Playing Planning Poker in Crowds: Human Computation of Software Effort Estimates
作者: Alhamed, Mohammed and Storer, Tim
关键词: No keywords
Abstract
Reliable cost effective effort estimation remains a considerable challenge for software projects. Recent work has demonstrated that the popular Planning Poker practice can produce reliable estimates when undertaken within a software team of knowledgeable domain experts. However, the process depends on the availability of experts and can be time-consuming to perform, making it impractical for large scale or open source projects that may curate many thousands of outstanding tasks. This paper reports on a full study to investigate the feasibility of using crowd workers supplied with limited information about a task to provide comparably accurate estimates using Planning Poker. We describe the design of a Crowd Planning Poker (CPP) process implemented on Amazon Mechanical Turk and the results of a substantial set of trials, involving more than 5000 crowd workers and 39 diverse software tasks. Our results show that a carefully organised and selected crowd of workers can produce effort estimates that are of similar accuracy to those of a single expert.
DOI: 10.1109/ICSE43902.2021.00014
JEST: N+1-version Differential Testing of Both JavaScript Engines and Specification
作者: Park, Jihyeok and An, Seungmin and Youn, Dongjun and Kim, Gyeongwon and Ryu, Sukyoung
关键词: mechanized specification, differential testing, conformance test generation, JavaScript
Abstract
Modern programming follows the continuous integration (CI) and continuous deployment (CD) approach rather than the traditional waterfall model. Even the development of modern programming languages uses the CI/CD approach to swiftly provide new language features and to adapt to new development environments. Unlike in the conventional approach, in the modern CI/CD approach, a language specification is no more the oracle of the language semantics because both the specification and its implementations (interpreters or compilers) can co-evolve. In this setting, both the specification and implementations may have bugs, and guaranteeing their correctness is non-trivial.In this paper, we propose a novel N+1-version differential testing to resolve the problem. Unlike the traditional differential testing, our approach consists of three steps: 1) to automatically synthesize programs guided by the syntax and semantics from a given language specification, 2) to generate conformance tests by injecting assertions to the synthesized programs to check their final program states, 3) to detect bugs in the specification and implementations via executing the conformance tests on multiple implementations, and 4) to localize bugs on the specification using statistical information. We actualize our approach for the JavaScript programming language via JEST, which performs N+1-version differential testing for modern JavaScript engines and ECMAScript, the language specification describing the syntax and semantics of JavaScript in a natural language. We evaluated JEST with four JavaScript engines that support all modern JavaScript language features and the latest version of ECMAScript (ES11, 2020). JEST automatically synthesized 1,700 programs that covered 97.78% of syntax and 87.70% of semantics from ES11. Using the assertion-injected JavaScript programs, it detected 44 engine bugs in four different engines and 27 specification bugs in ES11.
DOI: 10.1109/ICSE43902.2021.00015
Unrealizable Cores for Reactive Systems Specifications
作者: Maoz, Shahar and Shalom, Rafi
关键词: No keywords
Abstract
One of the main challenges of reactive synthesis, an automated procedure to obtain a correct-by-construction reactive system, is to deal with unrealizable specifications. One means to deal with unrealizability, in the context of GR(1), an expressive assume-guarantee fragment of LTL that enables efficient synthesis, is the computation of an unrealizable core, which can be viewed as a fault-localization approach. Existing solutions, however, are computationally costly, are limited to computing a single core, and do not correctly support specifications with constructs beyond pure GR(1) elements.In this work we address these limitations. First, we present QuickCore, a novel algorithm that accelerates unrealizable core computations by relying on the monotonicity of unrealizability, on an incremental computation, and on additional properties of GR(1) specifications. Second, we present Punch, a novel algorithm to efficiently compute all unrealizable cores of a specification. Finally, we present means to correctly handle specifications that include higher-level constructs beyond pure GR(1) elements.We implemented our ideas on top of Spectra, an open-source language and synthesis environment. Our evaluation over benchmarks from the literature shows that QuickCore is in most cases faster than previous algorithms, and that its relative advantage grows with scale. Moreover, we found that most specifications include more than one core, and that Punch finds all the cores significantly faster than a competing naive algorithm.
DOI: 10.1109/ICSE43902.2021.00016
Verifying Determinism in Sequential Programs
作者: Mudduluru, Rashmi and Waataja, Jason and Millstein, Suzanne and Ernst, Michael D.
关键词: verification, type system, specification, nondeterminism, hash table, flaky tests
Abstract
When a program is nondeterministic, it is difficult to test and debug. Nondeterminism occurs even in sequential programs: e.g., by iterating over the elements of a hash table.We have created a type system that expresses determinism specifications in a program. The key ideas in the type system are type qualifiers for nondeterminism, order-nondeterminism, and determinism; type well-formedness rules to restrict collection types; and enhancements to polymorphism that improve precision when analyzing collection operations. While state-of-the-art nondeterminism detection tools rely on observing output from specific runs, our approach soundly verifies determinism at compile time.We implemented our type system for Java. Our type checker, the Determinism Checker, warns if a program is nondeterministic or verifies that the program is deterministic. In case studies of 90097 lines of code, the Determinism Checker found 87 previously-unknown nondeterminism errors, even in programs that had been heavily vetted by developers who were greatly concerned about nondeterminism errors. In experiments, the Determinism Checker found all of the non-concurrency-related nondeterminism that was found by state-of-the-art dynamic approaches for detecting flaky tests.
DOI: 10.1109/ICSE43902.2021.00017
Domain-Specific Fixes for Flaky Tests with Wrong Assumptions on Underdetermined Specifications
作者: Zhang, Peilun and Jiang, Yanjie and Wei, Anjiang and Stodden, Victoria and Marinov, Darko and Shi, August
关键词: No keywords
Abstract
Library developers can provide classes and methods with underdetermined specifications that allow flexibility in future implementations. Library users may write code that relies on a specific implementation rather than on the specification, e.g., assuming mistakenly that the order of elements cannot change in the future. Prior work proposed the NonDex approach that detects such wrong assumptions.We present a novel approach, called DexFix, to repair wrong assumptions on underdetermined specifications in an automated way. We run the NonDex tool on 200 open-source Java projects and detect 275 tests that fail due to wrong assumptions. The majority of failures are from iterating over HashMap/HashSet collections and the getDeclaredFields method. We provide several new repair strategies that can fix these violations in both the test code and the main code. DexFix proposes fixes for 119 tests from the detected 275 tests. We have already reported fixes for 102 tests as GitHub pull requests: 74 have been merged, with only 5 rejected, and the remaining pending.
DOI: 10.1109/ICSE43902.2021.00018
Studying Test Annotation Maintenance in the Wild
作者: Kim, Dong Jae and Tsantalis, Nikolaos and Chen, Tse-Hsun Peter and Yang, Jinqiu
关键词: Software Quality, Software Evolution, Empirical Study, Annotation
Abstract
Since the introduction of annotations in Java 5, the majority of testing frameworks, such as JUnit, TestNG, and Mockito, have adopted annotations in their core design. This adoption affected the testing practices in every step of the test life-cycle, from fixture setup and test execution to fixture teardown. Despite the importance of test annotations, most research on test maintenance has mainly focused on test code quality and test assertions. As a result, there is little empirical evidence on the evolution and maintenance of test annotations. To fill this gap, we perform the first fine-grained empirical study on annotation changes. We developed a tool to mine 82,810 commits and detect 23,936 instances of test annotation changes from 12 open-source Java projects. Our main findings are: (1) Test annotation changes are more frequent than rename and type change refactorings. (2) We recover various migration efforts within the same testing framework or between different frameworks by analyzing common annotation replacement patterns. (3) We create a taxonomy by manually inspecting and classifying a sample of 368 test annotation changes and documenting the motivations driving these changes. Finally, we present a list of actionable implications for developers, researchers, and framework designers.
DOI: 10.1109/ICSE43902.2021.00019
Semantic Patches for Adaptation of JavaScript Programs to Evolving Libraries
作者: Nielsen, Benjamin Barslev and Torp, Martin Toldam and M\o{
关键词: No keywords
Abstract
JavaScript libraries are often updated and sometimes breaking changes are introduced in the process, resulting in the client developers having to adapt their code to the changes. In addition to locating the affected parts of their code, the client developers must apply suitable patches, which is a tedious, error-prone, and entirely manual process.To reduce the manual effort, we present JSFIX. Given a collection of semantic patches, which are formalized descriptions of the breaking changes, the tool detects the locations affected by breaking changes and then transforms those parts of the code to become compatible with the new library version. JSFIX relies on an existing static analysis to approximate the set of affected locations, and an interactive process where the user answers questions about the client code to filter away false positives.An evaluation involving 12 popular JavaScript libraries and 203 clients shows that our notion of semantic patches can accurately express most of the breaking changes that occur in practice, and that JSFIX can successfully adapt most of the clients to the changes. In particular, 31 clients have accepted pull requests made by JSFIX, indicating that the code quality is good enough for practical usage. It takes JSFIX only a few seconds to patch, on average, 3.8 source locations affected by breaking changes in each client, with only 2.7 questions to the user, which suggests that the approach can significantly reduce the manual effort required when adapting JavaScript programs to evolving libraries.
DOI: 10.1109/ICSE43902.2021.00020
DepOwl: Detecting Dependency Bugs to Prevent Compatibility Failures
作者: Jia, Zhouyang and Li, Shanshan and Yu, Tingting and Zeng, Chen and Xu, Erci and Liu, Xiaodong and Wang, Ji and Liao, Xiangke
关键词: Software dependency, Library incompatibility, Compatibility failure
Abstract
Applications depend on libraries to avoid reinventing the wheel. Libraries may have incompatible changes during evolving. As a result, applications will suffer from compatibility failures. There has been much research on addressing detecting incompatible changes in libraries, or helping applications co-evolve with the libraries. The existing solution helps the latest application version work well against the latest library version as an afterthought. However, end users have already been suffering from the failures and have to wait for new versions. In this paper, we propose DepOwl, a practical tool helping users prevent compatibility failures. The key idea is to avoid using incompatible versions from the very beginning. We evaluated DepOwl on 38 known compatibility failures from StackOverflow, and DepOwl can prevent 35 of them. We also evaluated DepOwl using the software repository shipped with Ubuntu-19.10. DepOwl detected 77 unknown dependency bugs, which may lead to compatibility failures.
DOI: 10.1109/ICSE43902.2021.00021
Hero: On the Chaos When PATH Meets Modules
作者: Wang, Ying and Qiao, Liang and Xu, Chang and Liu, Yepang and Cheung, Shing-Chi and Meng, Na and Yu, Hai and Zhu, Zhiliang
关键词: Golang Ecosystem, Dependency Management
Abstract
Ever since its first release in 2009, the Go programming language (Golang) has been well received by software communities. A major reason for its success is the powerful support of library-based development, where a Golang project can be conveniently built on top of other projects by referencing them as libraries. As Golang evolves, it recommends the use of a new library-referencing mode to overcome the limitations of the original one. While these two library modes are incompatible, both are supported by the Golang ecosystem. The heterogeneous use of library-referencing modes across Golang projects has caused numerous dependency management (DM) issues, incurring reference inconsistencies and even build failures. Motivated by the problem, we conducted an empirical study to characterize the DM issues, understand their root causes, and examine their fixing solutions. Based on our findings, we developed HERO, an automated technique to detect DM issues and suggest proper fixing solutions. We applied HERO to 19,000 popular Golang projects. The results showed that HERO achieved a high detection rate of 98.5% on a DM issue benchmark and found 2,422 new DM issues in 2,356 popular Golang projects. We reported 280 issues, among which 181 (64.6%) issues have been confirmed, and 160 of them (88.4%) have been fixed or are under fixing. Almost all the fixes have adopted our fixing suggestions.
DOI: 10.1109/ICSE43902.2021.00022
SOAR: A Synthesis Approach for Data Science API Refactoring
作者: Ni, Ansong and Ramos, Daniel and Yang, Aidan Z.H. and Lynce, In^{e
关键词: program synthesis, program translation, software maintenance
Abstract
With the growth of the open-source data science community, both the number of data science libraries and the number of versions for the same library are increasing rapidly. To match the evolving APIs from those libraries, open-source organizations often have to exert manual effort to refactor the APIs used in the code base. Moreover, due to the abundance of similar open-source libraries, data scientists working on a certain application may have an abundance of libraries to choose, maintain and migrate between. The manual refactoring between APIs is a tedious and error-prone task. Although recent research efforts were made on performing automatic API refactoring between different languages, previous work relies on statistical learning with collected pairwise training data for the API matching and migration. Using large statistical data for refactoring is not ideal because such training data will not be available for a new library or a new version of the same library. We introduce Synthesis for Open-Source API Refactoring (SOAR), a novel technique that requires no training data to achieve API migration and refactoring. SOAR relies only on the documentation that is readily available at the release of the library to learn API representations and mapping between libraries. Using program synthesis, SOAR automatically computes the correct configuration of arguments to the APIs and any glue code required to invoke those APIs. SOAR also uses the interpreter’s error messages when running refactored code to generate logical constraints that can be used to prune the search space. Our empirical evaluation shows that SOAR can successfully refactor 80% of our benchmarks corresponding to deep learning models with up to 44 layers with an average run time of 97.23 seconds, and 90% of the data wrangling benchmarks with an average run time of 17.31 seconds.
DOI: 10.1109/ICSE43902.2021.00023
Are Machine Learning Cloud APIs Used Correctly?
作者: Wan, Chengcheng and Liu, Shicheng and Hoffmann, Henry and Maire, Michael and Lu, Shan
关键词: No keywords
Abstract
Machine learning (ML) cloud APIs enable developers to easily incorporate learning solutions into software systems. Unfortunately, ML APIs are challenging to use correctly and efficiently, given their unique semantics, data requirements, and accuracy-performance tradeoffs. Much prior work has studied how to develop ML APIs or ML cloud services, but not how open-source applications are using ML APIs. In this paper, we manually studied 360 representative open-source applications that use Google or AWS cloud-based ML APIs, and found 70% of these applications contain API misuses in their latest versions that degrade functional, performance, or economical quality of the software. We have generalized 8 anti-patterns based on our manual study and developed automated checkers that identify hundreds of more applications that contain ML API misuses.
DOI: 10.1109/ICSE43902.2021.00024
Siri, Write the Next Method
作者: Wen, Fengcai and Aghajani, Emad and Nagy, Csaba and Lanza, Michele and Bavota, Gabriele
关键词: Mining Software Repositories, Empirical Software Engineering, Code Recommender
Abstract
Code completion is one of the killer features of Integrated Development Environments (IDEs), and researchers have proposed different methods to improve its accuracy. While these techniques are valuable to speed up code writing, they are limited to recommendations related to the next few tokens a developer is likely to type given the current context. In the best case, they can recommend a few APIs that a developer is likely to use next. We present FeaRS, a novel retrieval-based approach that, given the current code a developer is writing in the IDE, can recommend the next complete method (i.e., signature and method body) that the developer is likely to implement. To do this, FeaRS exploits “implementation patterns” (i.e., groups of methods usually implemented within the same task) learned by mining thousands of open source projects. We instantiated our approach to the specific context of Android apps. A large-scale empirical evaluation we performed across more than 20k apps shows encouraging preliminary results, but also highlights future challenges to overcome.
DOI: 10.1109/ICSE43902.2021.00025
Code Prediction by Feeding Trees to Transformers
作者: Kim, Seohyun and Zhao, Jinman and Tian, Yuchi and Chandra, Satish
关键词: code prediction, code embedding, autocomplete
Abstract
Code prediction, more specifically autocomplete, has become an essential feature in modern IDEs. Autocomplete is more effective when the desired next token is at (or close to) the top of the list of potential completions offered by the IDE at cursor position. This is where the strength of the underlying machine learning system that produces a ranked order of potential completions comes into play.We advance the state-of-the-art in the accuracy of code prediction (next token prediction) used in autocomplete systems. Our work uses Transformers as the base neural architecture. We show that by making the Transformer architecture aware of the syntactic structure of code, we increase the margin by which a Transformer-based system outperforms previous systems. With this, it outperforms the accuracy of several state-of-the-art next token prediction systems by margins ranging from 14% to 18%.We present in the paper several ways of communicating the code structure to the Transformer, which is fundamentally built for processing sequence data. We provide a comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a company internal Python corpus. Our code and data preparation pipeline will be available in open source.
DOI: 10.1109/ICSE43902.2021.00026
Towards Automating Code Review Activities
作者: Tufano, Rosalia and Pascarella, Luca and Tufano, Michele and Poshyvanyk, Denys and Bavota, Gabriele
关键词: Empirical Software Engineering, Deep Learning, Code Review
Abstract
Code reviews are popular in both industrial and open source projects. The benefits of code reviews are widely recognized and include better code quality and lower likelihood of introducing bugs. However, since code review is a manual activity it comes at the cost of spending developers’ time on reviewing their teammates’ code.Our goal is to make the first step towards partially automating the code review process, thus, possibly reducing the manual costs associated with it. We focus on both the contributor and the reviewer sides of the process, by training two different Deep Learning architectures. The first one learns code changes performed by developers during real code review activities, thus providing the contributor with a revised version of her code implementing code transformations usually recommended during code review before the code is even submitted for review. The second one automatically provides the reviewer commenting on a submitted code with the revised code implementing her comments expressed in natural language.The empirical evaluation of the two models shows that, on the contributor side, the trained model succeeds in replicating the code transformations applied during code reviews in up to 16% of cases. On the reviewer side, the model can correctly implement a comment provided in natural language in up to 31% of cases. While these results are encouraging, more research is needed to make these models usable by developers.
DOI: 10.1109/ICSE43902.2021.00027
Resource-Guided Configuration Space Reduction for Deep Learning Models
作者: Gao, Yanjie and Zhu, Yonghao and Zhang, Hongyu and Lin, Haoxiang and Yang, Mao
关键词: deep learning, constraint solving, configurable systems, AutoML
Abstract
Deep learning models, like traditional software systems, provide a large number of configuration options. A deep learning model can be configured with different hyperparameters and neural architectures. Recently, AutoML (Automated Machine Learning) has been widely adopted to automate model training by systematically exploring diverse configurations. However, current AutoML approaches do not take into consideration the computational constraints imposed by various resources such as available memory, computing power of devices, or execution time. The training with non-conforming configurations could lead to many failed AutoML trial jobs or inappropriate models, which cause significant resource waste and severely slow down development productivity.In this paper, we propose DnnSAT, a resource-guided AutoML approach for deep learning models to help existing AutoML tools efficiently reduce the configuration space ahead of time. DnnSAT can speed up the search process and achieve equal or even better model learning performance because it excludes trial jobs not satisfying the constraints and saves resources for more trials. We formulate the resource-guided configuration space reduction as a constraint satisfaction problem. DnnSAT includes a unified analytic cost model to construct common constraints with respect to the model weight size, number of floating-point operations, model inference time, and GPU memory consumption. It then utilizes an SMT solver to obtain the satisfiable configurations of hyperparameters and neural architectures. Our evaluation results demonstrate the effectiveness of DnnSAT in accelerating state-of-the-art AutoML methods (Hyperparameter Optimization and Neural Architecture Search) with an average speedup from 1.19X to 3.95X on public benchmarks. We believe that DnnSAT can make AutoML more practical in a real-world environment with constrained resources.
DOI: 10.1109/ICSE43902.2021.00028
An Evolutionary Study of Configuration Design and Implementation in Cloud Systems
作者: Zhang, Yuanliang and He, Haochen and Legunsen, Owolabi and Li, Shanshan and Dong, Wei and Xu, Tianyin
关键词: No keywords
Abstract
Many techniques were proposed for detecting software misconfigurations in cloud systems and for diagnosing unintended behavior caused by such misconfigurations. Detection and diagnosis are steps in the right direction: misconfigurations cause many costly failures and severe performance issues. But, we argue that continued focus on detection and diagnosis is symptomatic of a more serious problem: configuration design and implementation are not yet first-class software engineering endeavors in cloud systems. Little is known about how and why developers evolve configuration design and implementation, and the challenges that they face in doing so.This paper presents a source-code level study of the evolution of configuration design and implementation in cloud systems. Our goal is to understand the rationale and developer practices for revising initial configuration design/implementation decisions, especially in response to consequences of misconfigurations. To this end, we studied 1178 configuration-related commits from a 2.5 year version-control history of four large-scale, actively-maintained open-source cloud systems (HDFS, HBase, Spark, and Cassandra). We derive new insights into the software configuration engineering process. Our results motivate new techniques for proactively reducing misconfigurations by improving the configuration design and implementation process in cloud systems. We highlight a number of future research directions.
DOI: 10.1109/ICSE43902.2021.00029
AutoCCAG: An Automated Approach to Constrained Covering Array Generation
作者: Luo, Chuan and Lin, Jinkun and Cai, Shaowei and Chen, Xin and He, Bing and Qiao, Bo and Zhao, Pu and Lin, Qingwei and Zhang, Hongyu and Wu, Wei and Rajmohan, Saravanakumar and Zhang, Dongmei
关键词: Constrained Covering Array Generation, Automated Algorithm Optimization
Abstract
Combinatorial interaction testing (CIT) is an important technique for testing highly configurable software systems with demonstrated effectiveness in practice. The goal of CIT is to generate test cases covering the interactions of configuration options, under certain hard constraints. In this context, constrained covering arrays (CCAs) are frequently used as test cases in CIT. Constrained Covering Array Generation (CCAG) is an NP-hard combinatorial optimization problem, solving which requires an effective method for generating small CCAs. In particular, effectively solving t-way CCAG with t ≥ 4 is even more challenging. Inspired by the success of automated algorithm configuration and automated algorithm selection in solving combinatorial optimization problems, in this paper, we investigate the efficacy of automated algorithm configuration and automated algorithm selection for the CCAG problem, and propose a novel, automated CCAG approach called AutoCCAG. Extensive experiments on public benchmarks show that AutoCCAG can find much smaller-sized CCAs than current state-of-the-art approaches, indicating the effectiveness of AutoCCAG. More encouragingly, to our best knowledge, our paper reports the first results for CCAG with a high coverage strength (i.e., 5-way CCAG) on public benchmarks. Our results demonstrate that AutoCCAG can bring considerable benefits in testing highly configurable software systems.
DOI: 10.1109/ICSE43902.2021.00030
What helped, and what did not? An Evaluation of the Strategies to Improve Continuous Integration
作者: Jin, Xianhao and Servant, Francisco
关键词: software maintenance, empirical software engineering, continuous integration
Abstract
Continuous integration (CI) is a widely used practice in modern software engineering. Unfortunately, it is also an expensive practice — Google and Mozilla estimate their CI systems in millions of dollars. There are a number of techniques and tools designed to or having the potential to save the cost of CI or expand its benefit - reducing time to feedback. However, their benefits in some dimensions may also result in drawbacks in others. They may also be beneficial in other scenarios where they are not designed to help. In this paper, we perform the first exhaustive comparison of techniques to improve CI, evaluating 14 variants of 10 techniques using selection and prioritization strategies on build and test granularity. We evaluate their strengths and weaknesses with 10 different cost and time-to-feedback saving metrics on 100 real-world projects. We analyze the results of all techniques to understand the design decisions that helped different dimensions of benefit. We also synthesized those results to lay out a series of recommendations for the development of future research techniques to advance this area.
DOI: 10.1109/ICSE43902.2021.00031
Distribution-Aware Testing of Neural Networks Using Generative Models
作者: Dola, Swaroopa and Dwyer, Matthew B. and Soffa, Mary Lou
关键词: test generation, test coverage, input validation, deep neural networks, deep learning
Abstract
The reliability of software that has a Deep Neural Network (DNN) as a component is urgently important today given the increasing number of critical applications being deployed with DNNs. The need for reliability raises a need for rigorous testing of the safety and trustworthiness of these systems. In the last few years, there have been a number of research efforts focused on testing DNNs. However the test generation techniques proposed so far lack a check to determine whether the test inputs they are generating are valid, and thus invalid inputs are produced. To illustrate this situation, we explored three recent DNN testing techniques. Using deep generative model based input validation, we show that all the three techniques generate significant number of invalid test inputs. We further analyzed the test coverage achieved by the test inputs generated by the DNN testing techniques and showed how invalid test inputs can falsely inflate test coverage metrics.To overcome the inclusion of invalid inputs in testing, we propose a technique to incorporate the valid input space of the DNN model under test in the test generation process. Our technique uses a deep generative model-based algorithm to generate only valid inputs. Results of our empirical studies show that our technique is effective in eliminating invalid tests and boosting the number of valid test inputs generated.
DOI: 10.1109/ICSE43902.2021.00032
An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems
作者: Tang, Yiming and Khatchadourian, Raffi and Bagherzadeh, Mehdi and Singh, Rhia and Stewart, Ajani and Raja, Anita
关键词: technical debt, software repository mining, refactoring, machine learning systems, empirical studies
Abstract
Machine Learning (ML), including Deep Learning (DL), systems, i.e., those with ML capabilities, are pervasive in today’s data-driven society. Such systems are complex; they are comprised of ML models and many subsystems that support learning processes. As with other complex systems, ML systems are prone to classic technical debt issues, especially when such systems are long-lived, but they also exhibit debt specific to these systems. Unfortunately, there is a gap of knowledge in how ML systems actually evolve and are maintained. In this paper, we fill this gap by studying refactorings, i.e., source-to-source semantics-preserving program transformations, performed in real-world, open-source software, and the technical debt issues they alleviate. We analyzed 26 projects, consisting of 4.2 MLOC, along with 327 manually examined code patches. The results indicate that developers refactor these systems for a variety of reasons, both specific and tangential to ML, some refactorings correspond to established technical debt categories, while others do not, and code duplication is a major crosscutting theme that particularly involved ML configuration and model code, which was also the most refactored. We also introduce 14 and 7 new ML-specific refactorings and technical debt categories, respectively, and put forth several recommendations, best practices, and anti-patterns. The results can potentially assist practitioners, tool developers, and educators in facilitating long-term ML system usefulness.
DOI: 10.1109/ICSE43902.2021.00033
DeepLocalize: Fault Localization for Deep Neural Networks
作者: Wardat, Mohammad and Le, Wei and Rajan, Hridesh
关键词: Program Analysis, Fault Location, Deep learning bugs, Deep Neural Networks, Debugging
Abstract
Deep neural networks (DNNs) are becoming an integral part of most software systems. Previous work has shown that DNNs have bugs. Unfortunately, existing debugging techniques don’t support localizing DNN bugs because of the lack of understanding of model behaviors. The entire DNN model appears as a black box. To address these problems, we propose an approach and a tool that automatically determines whether the model is buggy or not, and identifies the root causes for DNN errors. Our key insight is that historic trends in values propagated between layers can be analyzed to identify faults, and also localize faults. To that end, we first enable dynamic analysis of deep learning applications: by converting it into an imperative representation and alternatively using a callback mechanism. Both mechanisms allows us to insert probes that enable dynamic analysis over the traces produced by the DNN while it is being trained on the training data. We then conduct dynamic analysis over the traces to identify the faulty layer or hyperparameter that causes the error. We propose an algorithm for identifying root causes by capturing any numerical error and monitoring the model during training and finding the relevance of every layer/parameter on the DNN outcome. We have collected a benchmark containing 40 buggy models and patches that contain real errors in deep learning applications from Stack Overflow and GitHub. Our benchmark can be used to evaluate automated debugging tools and repair techniques. We have evaluated our approach using this DNN bug-and-patch benchmark, and the results showed that our approach is much more effective than the existing debugging approach used in the state-of-the-practice Keras library. For 34/40 cases, our approach was able to detect faults whereas the best debugging approach provided by Keras detected 32/40 faults. Our approach was able to localize 21/40 bugs whereas Keras did not localize any faults.
DOI: 10.1109/ICSE43902.2021.00034
DeepPayload: Black-box Backdoor Attack on Deep Learning Models through Neural Payload Injection
作者: Li, Yuanchun and Hua, Jiayi and Wang, Haoyu and Chen, Chunyang and Liu, Yunxin
关键词: reverse engineering, mobile application, malicious payload, backdoor attack, Deep learning
Abstract
Deep learning models are increasingly used in mobile applications as critical components. Unlike the program bytecode whose vulnerabilities and threats have been widely-discussed, whether and how the deep learning models deployed in the applications can be compromised are not well-understood since neural networks are usually viewed as a black box. In this paper, we introduce a highly practical backdoor attack achieved with a set of reverse-engineering techniques over compiled deep learning models. The core of the attack is a neural conditional branch constructed with a trigger detector and several operators and injected into the victim model as a malicious payload. The attack is effective as the conditional logic can be flexibly customized by the attacker, and scalable as it does not require any prior knowledge from the original model. We evaluated the attack effectiveness using 5 state-of-the-art deep learning models and real-world samples collected from 30 users. The results demonstrated that the injected backdoor can be triggered with a success rate of 93.5%, while only brought less than 2ms latency overhead and no more than 1.4% accuracy decrease. We further conducted an empirical study on real-world mobile deep learning apps collected from Google Play. We found 54 apps that were vulnerable to our attack, including popular and security-critical ones. The results call for the awareness of deep learning application developers and auditors to enhance the protection of deployed models.
DOI: 10.1109/ICSE43902.2021.00035
Reducing DNN Properties to Enable Falsification with Adversarial Attacks
作者: Shriver, David and Elbaum, Sebastian and Dwyer, Matthew B.
关键词: neural nets, formal methods, falsification
Abstract
Deep Neural Networks (DNN) are increasingly being deployed in safety-critical domains, from autonomous vehicles to medical devices, where the consequences of errors demand techniques that can provide stronger guarantees about behavior than just high test accuracy. This paper explores broadening the application of existing adversarial attack techniques for the falsification of DNN safety properties. We contend and later show that such attacks provide a powerful repertoire of scalable algorithms for property falsification. To enable the broad application of falsification, we introduce a semantics-preserving reduction of multiple safety property types, which subsume prior work, into a set of equivalid correctness problems amenable to adversarial attacks. We evaluate our reduction approach as an enabler of falsification on a range of DNN correctness problems and show its cost-effectiveness and scalability.
DOI: 10.1109/ICSE43902.2021.00036
Graph-based Fuzz Testing for Deep Learning Inference Engines
作者: Luo, Weisi and Chai, Dong and Run, Xiaoyue and Wang, Jiang and Fang, Chunrong and Chen, Zhenyu
关键词: Operator-Level Coverage, Monte Carlo Tree Search, Graph Theory, Deep Learning Models, Deep Learning Inference Engine
Abstract
With the wide use of Deep Learning (DL) systems, academy and industry begin to pay attention to their quality. Testing is one of the major methods of quality assurance. However, existing testing techniques focus on the quality of DL models but lacks attention to the core underlying inference engines (i.e., frameworks and libraries). Inspired by the success stories of fuzz testing, we design a graph-based fuzz testing method to improve the quality of DL inference engines. This method is naturally followed by the graph structure of DL models. A novel operator-level coverage criterion based on graph theory is introduced and six different mutations are implemented to generate diversified DL models by exploring combinations of model structures, parameters, and data inputs. The Monte Carlo Tree Search (MCTS) is used to drive DL model generation without a training process. The experimental results show that the MCTS outperforms the random method in boosting operator-level coverage and detecting exceptions. Our method has discovered more than 40 different exceptions in three types of undesired behaviors: model conversion failure, inference failure, output comparison failure. The mutation strategies are useful to generate new valid test inputs, by up to an 8.2% more operator-level coverage on average and 8.6 more exceptions captured.
DOI: 10.1109/ICSE43902.2021.00037
RobOT: Robustness-Oriented Testing for Deep Learning Systems
作者: Wang, Jingyi and Chen, Jialuo and Sun, Youcheng and Ma, Xingjun and Wang, Dongxia and Sun, Jun and Cheng, Peng
关键词: No keywords
Abstract
Recently, there has been a significant growth of interest in applying software engineering techniques for the quality assurance of deep learning (DL) systems. One popular direction is deep learning testing, where adversarial examples (a.k.a. bugs) of DL systems are found either by fuzzing or guided search with the help of certain testing metrics. However, recent studies have revealed that the commonly used neuron coverage metrics by existing DL testing approaches are not correlated to model robustness. It is also not an effective measurement on the confidence of the model robustness after testing. In this work, we address this gap by proposing a novel testing framework called Robustness-Oriented Testing (RobOT). A key part of RobOT is a quantitative measurement on 1) the value of each test case in improving model robustness (often via retraining), and 2) the convergence quality of the model robustness improvement. RobOT utilizes the proposed metric to automatically generate test cases valuable for improving model robustness. The proposed metric is also a strong indicator on how well robustness improvement has converged through testing. Experiments on multiple benchmark datasets confirm the effectiveness and efficiency of RobOT in improving DL model robustness, with 67.02% increase on the adversarial robustness that is 50.65% higher than the state-of-the-art work DeepGini.
DOI: 10.1109/ICSE43902.2021.00038
Scalable Quantitative Verification For Deep Neural Networks
作者: Baluta, Teodora and Chua, Zheng Leong and Meel, Kuldeep S. and Saxena, Prateek
关键词: No keywords
Abstract
Despite the functional success of deep neural networks (DNNs), their trustworthiness remains a crucial open challenge. To address this challenge, both testing and verification techniques have been proposed. But these existing techniques provide either scalability to large networks or formal guarantees, not both. In this paper, we propose a scalable quantitative verification framework for deep neural networks, i.e., a test-driven approach that comes with formal guarantees that a desired probabilistic property is satisfied. Our technique performs enough tests until soundness of a formal probabilistic property can be proven. It can be used to certify properties of both deterministic and randomized DNNs. We implement our approach in a tool called provero1 and apply it in the context of certifying adversarial robustness of DNNs. In this context, we first show a new attack-agnostic measure of robustness which offers an alternative to purely attack-based methodology of evaluating robustness being reported today. Second, provero provides certificates of robustness for large DNNs, where existing state-of-the-art verification tools fail to produce conclusive results. Our work paves the way forward for verifying properties of distributions captured by real-world deep neural network, with provable guarantees, even where testers only have black-box access to the neural network.
DOI: 10.1109/ICSE43902.2021.00039
Traceability Transformed: Generating more Accurate Links with Pre-Trained BERT Models
作者: Lin, Jinfeng and Liu, Yalin and Zeng, Qingkai and Jiang, Meng and Cleland-Huang, Jane
关键词: language models, deep learning, Software traceability
Abstract
Software traceability establishes and leverages associations between diverse development artifacts. Researchers have proposed the use of deep learning trace models to link natural language artifacts, such as requirements and issue descriptions, to source code; however, their effectiveness has been restricted by availability of labeled data and efficiency at runtime. In this study, we propose a novel framework called Trace BERT (T-BERT) to generate trace links between source code and natural language artifacts. To address data sparsity, we leverage a three-step training strategy to enable trace models to transfer knowledge from a closely related Software Engineering challenge, which has a rich dataset, to produce trace links with much higher accuracy than has previously been achieved. We then apply the T-BERT framework to recover links between issues and commits in Open Source Projects. We comparatively evaluated accuracy and efficiency of three BERT architectures. Results show that a Single-BERT architecture generated the most accurate links, while a Siamese-BERT architecture produced comparable results with significantly less execution time. Furthermore, by learning and transferring knowledge, all three models in the framework outperform classical IR trace models. On the three evaluated real-word OSS projects, the best T-BERT stably outperformed the VSM model with average improvements of 60.31% measured using Mean Average Precision (MAP). RNN severely underper-formed on these projects due to insufficient training data, while T-BERT overcame this problem by using pretrained language models and transfer learning.
DOI: 10.1109/ICSE43902.2021.00040
Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks
作者: Mastropaolo, Antonio and Scalabrino, Simone and Cooper, Nathan and Palacio, David Nader and Poshyvanyk, Denys and Oliveto, Rocco and Bavota, Gabriele
关键词: Empirical software engineering, Deep Learning
Abstract
Deep learning (DL) techniques are gaining more and more attention in the software engineering community. They have been used to support several code-related tasks, such as automatic bug fixing and code comments generation. Recent studies in the Natural Language Processing (NLP) field have shown that the Text-To-Text Transfer Transformer (T5) architecture can achieve state-of-the-art performance for a variety of NLP tasks. The basic idea behind T5 is to first pre-train a model on a large and generic dataset using a self-supervised task (e.g., filling masked words in sentences). Once the model is pre-trained, it is fine-tuned on smaller and specialized datasets, each one related to a specific task (e.g., language translation, sentence classification). In this paper, we empirically investigate how the T5 model performs when pre-trained and fine-tuned to support code-related tasks. We pre-train a T5 model on a dataset composed of natural language English text and source code. Then, we fine-tune such a model by reusing datasets used in four previous works that used DL techniques to: (i) fix bugs, (ii) inject code mutants, (iii) generate assert statements, and (iv) generate code comments. We compared the performance of this single model with the results reported in the four original papers proposing DL-based solutions for those four tasks. We show that our T5 model, exploiting additional data for the self-supervised pre-training phase, can achieve performance improvements over the four baselines.
DOI: 10.1109/ICSE43902.2021.00041
Operation is the hardest teacher: estimating DNN accuracy looking for mispredictions
作者: Guerriero, Antonio and Pietrantuono, Roberto and Russo, Stefano
关键词: Software testing, Artificial neural networks
Abstract
Deep Neural Networks (DNN) are typically tested for accuracy relying on a set of unlabelled real world data (operational dataset), from which a subset is selected, manually labelled and used as test suite. This subset is required to be small (due to manual labelling cost) yet to faithfully represent the operational context, with the resulting test suite containing roughly the same proportion of examples causing misprediction (i.e., failing test cases) as the operational dataset.However, while testing to estimate accuracy, it is desirable to also learn as much as possible from the failing tests in the operational dataset, since they inform about possible bugs of the DNN. A smart sampling strategy may allow to intentionally include in the test suite many examples causing misprediction, thus providing this way more valuable inputs for DNN improvement while preserving the ability to get trustworthy unbiased estimates.This paper presents a test selection technique (DeepEST) that actively looks for failing test cases in the operational dataset of a DNN, with the goal of assessing the DNN expected accuracy by a small and “informative” test suite (namely with a high number of mispredictions) for subsequent DNN improvement. Experiments with five subjects, combining four DNN models and three datasets, are described. The results show that DeepEST provides DNN accuracy estimates with precision close to (and often better than) those of existing sampling-based DNN testing techniques, while detecting from 5 to 30 times more mispredictions, with the same test suite size.
DOI: 10.1109/ICSE43902.2021.00042
AutoTrainer: An Automatic DNN Training Problem Detection and Repair System
作者: Zhang, Xiaoyu and Zhai, Juan and Ma, Shiqing and Shen, Chao
关键词: software tools, software engineering, deep learning training
Abstract
With machine learning models especially Deep Neural Network (DNN) models becoming an integral part of the new intelligent software, new tools to support their engineering process are in high demand. Existing DNN debugging tools are either post-training which wastes a lot of time training a buggy model and requires expertises, or limited on collecting training logs without analyzing the problem not even fixing them. In this paper, we propose AutoTrainer, a DNN training monitoring and automatic repairing tool which supports detecting and auto-repairing five commonly seen training problems. During training, it periodically checks the training status and detects potential problems. Once a problem is found, AutoTrainer tries to fix it by using built-in state-of-the-art solutions. It supports various model structures and input data types, such as Convolutional Neural Networks (CNNs) for image and Recurrent Neural Networks (RNNs) for texts. Our evaluation on 6 datasets, 495 models show that AutoTrainer can effectively detect all potential problems with 100% detection rate and no false positives. Among all models with problems, it can fix 97.33% of them, increasing the accuracy by 47.08% on average.
DOI: 10.1109/ICSE43902.2021.00043
Self-Checking Deep Neural Networks in Deployment
作者: Xiao, Yan and Beschastnikh, Ivan and Rosenblum, David S. and Sun, Changsheng and Elbaum, Sebastian and Lin, Yun and Dong, Jin Song
关键词: trustworthiness, deployment, deep learning
Abstract
The widespread adoption of Deep Neural Networks (DNNs) in important domains raises questions about the trustworthiness of DNN outputs. Even a highly accurate DNN will make mistakes some of the time, and in settings like self-driving vehicles these mistakes must be quickly detected and properly dealt with in deployment.Just as our community has developed effective techniques and mechanisms to monitor and check programmed components, we believe it is now necessary to do the same for DNNs. In this paper we present DNN self-checking as a process by which internal DNN layer features are used to check DNN predictions. We detail SelfChecker, a self-checking system that monitors DNN outputs and triggers an alarm if the internal layer features of the model are inconsistent with the final prediction. SelfChecker also provides advice in the form of an alternative prediction.We evaluated SelfChecker on four popular image datasets and three DNN models and found that SelfChecker triggers correct alarms on 60.56% of wrong DNN predictions, and false alarms on 2.04% of correct DNN predictions. This is a substantial improvement over prior work (SelfOracle, Dissector, and ConfidNet). In experiments with self-driving car scenarios, SelfChecker triggers more correct alarms than SelfOracle for two DNN models (DAVE-2 and Chauffeur) with comparable false alarms. Our implementation is available as open source.
DOI: 10.1109/ICSE43902.2021.00044
Measuring Discrimination to Boost Comparative Testing for Multiple Deep Learning Models
作者: Meng, Linghan and Li, Yanhui and Chen, Lin and Wang, Zhi and Wu, Di and Zhou, Yuming and Xu, Baowen
关键词: Testing, Discrimination, Deep Learning, Comparative Testing
Abstract
The boom of DL technology leads to massive DL models built and shared, which facilitates the acquisition and reuse of DL models. For a given task, we encounter multiple DL models available with the same functionality, which are considered as candidates to achieve this task. Testers are expected to compare multiple DL models and select the more suitable ones w.r.t. the whole testing context. Due to the limitation of labeling effort, testers aim to select an efficient subset of samples to make an as precise rank estimation as possible for these models.To tackle this problem, we propose Sample Discrimination based Selection (SDS) to select efficient samples that could discriminate multiple models, i.e., the prediction behaviors (right/wrong) of these samples would be helpful to indicate the trend of model performance. To evaluate SDS, we conduct an extensive empirical study with three widely-used image datasets and 80 real world DL models. The experimental results show that, compared with state-of-the-art baseline methods, SDS is an effective and efficient sample selection method to rank multiple DL models.
DOI: 10.1109/ICSE43902.2021.00045
Prioritizing Test Inputs for Deep Neural Networks via Mutation Analysis
作者: Wang, Zan and You, Hanmo and Chen, Junjie and Zhang, Yingyi and Dong, Xuyuan and Zhang, Wenbin
关键词: Test Prioritization, Mutation, Label, Deep Neural Network, Deep Learning Testing
Abstract
Deep Neural Network (DNN) testing is one of the most widely-used ways to guarantee the quality of DNNs. However, labeling test inputs to check the correctness of DNN prediction is very costly, which could largely affect the efficiency of DNN testing, even the whole process of DNN development. To relieve the labeling-cost problem, we propose a novel test input prioritization approach (called PRIMA) for DNNs via intelligent mutation analysis in order to label more bug-revealing test inputs earlier for a limited time, which facilitates to improve the efficiency of DNN testing. PRIMA is based on the key insight: a test input that is able to kill many mutated models and produce different prediction results with many mutated inputs, is more likely to reveal DNN bugs, and thus it should be prioritized higher. After obtaining a number of mutation results from a series of our designed model and input mutation rules for each test input, PRIMA further incorporates learning-to-rank (a kind of supervised machine learning to solve ranking problems) to intelligently combine these mutation results for effective test input prioritization. We conducted an extensive study based on 36 popular subjects by carefully considering their diversity from five dimensions (i.e., different domains of test inputs, different DNN tasks, different network structures, different types of test inputs, and different training scenarios). Our experimental results demonstrate the effectiveness of PRIMA, significantly outperforming the state-of-the-art approaches (with the average improvement of 8.50%~131.01% in terms of prioritization effectiveness). In particular, we have applied PRIMA to the practical autonomous-vehicle testing in a large motor company, and the results on 4 real-world scene-recognition models in autonomous vehicles further confirm the practicability of PRIMA.
DOI: 10.1109/ICSE43902.2021.00046
Testing Machine Translation via Referential Transparency
作者: He, Pinjia and Meister, Clara and Su, Zhendong
关键词: Testing, Referential transparency, Metamorphic testing, Machine translation
Abstract
Machine translation software has seen rapid progress in recent years due to the advancement of deep neural networks. People routinely use machine translation software in their daily lives for tasks such as ordering food in a foreign restaurant, receiving medical diagnosis and treatment from foreign doctors, and reading international political news online. However, due to the complexity and intractability of the underlying neural networks, modern machine translation software is still far from robust and can produce poor or incorrect translations; this can lead to misunderstanding, financial loss, threats to personal safety and health, and political conflicts. To address this problem, we introduce referentially transparent inputs (RTIs), a simple, widely applicable methodology for validating machine translation software. A referentially transparent input is a piece of text that should have similar translations when used in different contexts. Our practical implementation, Purity, detects when this property is broken by a translation. To evaluate RTI, we use Purity to test Google Translate and Bing Microsoft Translator with 200 unlabeled sentences, which detected 123 and 142 erroneous translations with high precision (79.3% and 78.3%). The translation errors are diverse, including examples of under-translation, over-translation, word/phrase mistranslation, incorrect modification, and unclear logic.
DOI: 10.1109/ICSE43902.2021.00047
Automatic Web Testing Using Curiosity-Driven Reinforcement Learning
作者: Zheng, Yan and Liu, Yi and Xie, Xiaofei and Liu, Yepang and Ma, Lei and Hao, Jianye and Liu, Yang
关键词: No keywords
Abstract
Web testing has long been recognized as a notoriously difficult task. Even nowadays, web testing still heavily relies on manual efforts while automated web testing is far from achieving human-level performance. Key challenges in web testing include dynamic content update and deep bugs hiding under complicated user interactions and specific input values, which can only be triggered by certain action sequences in the huge search space. In this paper, we propose WebExplor, an automatic end-to-end web testing framework, to achieve an adaptive exploration of web applications. WebExplor adopts curiosity-driven reinforcement learning to generate high-quality action sequences (test cases) satisfying temporal logical relations. Besides, WebExplor incrementally builds an automaton during the online testing process, which provides high-level guidance to further improve the testing efficiency. We have conducted comprehensive evaluations of WebExplor on six real-world projects, a commercial SaaS web application, and performed an in-the-wild study of the top 50 web applications in the world. The results demonstrate that in most cases WebExplor can achieve significantly higher failure detection rate, code coverage and efficiency than existing state-of-the-art web testing techniques. WebExplor also detected 12 previously unknown failures in the commercial web application, which have been confirmed and fixed by the developers. Furthermore, our in-the-wild study further uncovered 3,466 exceptions and errors.
DOI: 10.1109/ICSE43902.2021.00048
Evaluating SZZ Implementations Through a Developer-informed Oracle
作者: Rosa, Giovanni and Pascarella, Luca and Scalabrino, Simone and Tufano, Rosalia and Bavota, Gabriele and Lanza, Michele and Oliveto, Rocco
关键词: SZZ, Empirical Study, Defect Prediction
Abstract
The SZZ algorithm for identifying bug-inducing changes has been widely used to evaluate defect prediction techniques and to empirically investigate when, how, and by whom bugs are introduced. Over the years, researchers have proposed several heuristics to improve the SZZ accuracy, providing various implementations of SZZ. However, fairly evaluating those implementations on a reliable oracle is an open problem: SZZ evaluations usually rely on (i) the manual analysis of the SZZ output to classify the identified bug-inducing commits as true or false positives; or (ii) a golden set linking bug-fixing and bug-inducing commits. In both cases, these manual evaluations are performed by researchers with limited knowledge of the studied subject systems. Ideally, there should be a golden set created by the original developers of the studied systems.We propose a methodology to build a “developer-informed” oracle for the evaluation of SZZ variants. We use Natural Language Processing (NLP) to identify bug-fixing commits in which developers explicitly reference the commit(s) that introduced a fixed bug. This was followed by a manual filtering step aimed at ensuring the quality and accuracy of the oracle. Once built, we used the oracle to evaluate several variants of the SZZ algorithm in terms of their accuracy. Our evaluation helped us to distill a set of lessons learned to further improve the SZZ algorithm.
DOI: 10.1109/ICSE43902.2021.00049
Early Life Cycle Software Defect Prediction: Why? How?
作者: Shrikanth, N. C. and Majumder, Suvodeep and Menzies, Tim
关键词: sampling, early, defect prediction, analytics
Abstract
Many researchers assume that, for software analytics, “more data is better.” We write to show that, at least for learning defect predictors, this may not be true.To demonstrate this, we analyzed hundreds of popular GitHub projects. These projects ran for 84 months and contained 3,728 commits (median values). Across these projects, most of the defects occur very early in their life cycle. Hence, defect predictors learned from the first 150 commits and four months perform just as well as anything else. This means that, at least for the projects studied here, after the first few months, we need not continually update our defect prediction models.We hope these results inspire other researchers to adopt a “simplicity-first” approach to their work. Some domains require a complex and data-hungry analysis. But before assuming complexity, it is prudent to check the raw data looking for “short cuts” that can simplify the analysis.
DOI: 10.1109/ICSE43902.2021.00050
IoT Bugs and Development Challenges
作者: Makhshari, Amir and Mesbah, Ali
关键词: Software Engineering, Mining Software Repositories, Internet of Things, Empirical Study
Abstract
IoT systems are rapidly adopted in various domains, from embedded systems to smart homes. Despite their growing adoption and popularity, there has been no thorough study to understand IoT development challenges from the practitioners’ point of view. We provide the first systematic study of bugs and challenges that IoT developers face in practice, through a large-scale empirical investigation. We collected 5,565 bug reports from 91 representative IoT project repositories and categorized a random sample of 323 based on the observed failures, root causes, and the locations of the faulty components. In addition, we conducted nine interviews with IoT experts to uncover more details about IoT bugs and to gain insight into IoT developers’ challenges. Lastly, we surveyed 194 IoT developers to validate our findings and gain further insights. We propose the first bug taxonomy for IoT systems based on our results. We highlight frequent bug categories and their root causes, correlations between them, and common pitfalls and challenges that IoT developers face. We recommend future directions for IoT areas that require research and development attention.
DOI: 10.1109/ICSE43902.2021.00051
How Developers Optimize Virtual Reality Applications: A Study of Optimization Commits in Open Source Unity Projects
作者: Nusrat, Fariha and Hassan, Foyzul and Zhong, Hao and Wang, Xiaoyin
关键词: Virtual Reality, Performance Optimization, Empirical Study
Abstract
Virtual Reality (VR) is an emerging technique that provides immersive experience for users. Due to the high computation cost of rendering real-time animation twice (for both eyes) and the resource limitation of wearable devices, VR applications often face performance bottlenecks and performance optimization plays an important role in VR software development. Performance optimizations of VR applications can be very different from those in traditional software as VR involves more elements such as graphics rendering and real-time animation. In this paper, we present the first empirical study on 183 real-world performance optimizations from 45 VR software projects. In particular, we manually categorized the optimizations into 11 categories, and applied static analysis to identify how they affect different life-cycle phases of VR applications. Furthermore, we studied the complexity and design / behavior effects of performance optimizations, and how optimizations are different between large organizational software projects and smaller personal software projects. Our major findings include: (1) graphics simplification (24.0%), rendering optimization (16.9%), language / API optimization (15.3%), heap avoidance (14.8%), and value caching (12.0%) are the most common categories of performance optimization in VR applications; (2) game logic updates (30.4%) and before-scene initialization (20.0%) are the most common life-cycle phases affected by performance issues; (3) 45.9% of the optimizations have behavior and design effects and 39.3% of the optimizations are systematic changes; (4) the distributions of optimization classes are very different between organizational VR projects and personal VR projects.
DOI: 10.1109/ICSE43902.2021.00052
Do this! Do that!, And nothing will happen: Do specifications lead to securely stored passwords?
作者: Hallett, Joseph and Patnaik, Nikhil and Shreeve, Benjamin and Rashid, Awais
关键词: No keywords
Abstract
Does the act of writing a specification (how the code should behave) for a piece of security sensitive code lead to developers producing more secure code? We asked 138 developers to write a snippet of code to store a password: Half of them were asked to write down a specification of how the code should behave before writing the program, the other half were asked to write the code but without being prompted to write a specification first. We find that explicitly prompting developers to write a specification has a small positive effect on the security of password storage approaches implemented. However, developers often fail to store passwords securely, despite claiming to be confident and knowledgeable in their approaches, and despite considering an appropriate range of threats. We find a need for developer-centered usable mechanisms for telling developers how to store passwords: lists of what they must do are not working.
DOI: 10.1109/ICSE43902.2021.00053
Why Don’t Developers Detect Improper Input Validation?': DROP TABLE Papers
作者: Braz, Larissa and Fregnan, Enrico and \c{C
关键词: No keywords
Abstract
Improper Input Validation (IIV) is a software vulnerability that occurs when a system does not safely handle input data. Even though IIV is easy to detect and fix, it still commonly happens in practice.In this paper, we study to what extent developers can detect IIV and investigate underlying reasons. This knowledge is essential to better understand how to support developers in creating secure software systems. We conduct an online experiment with 146 participants, of which 105 report at least three years of professional software development experience. Our results show that the existence of a visible attack scenario facilitates the detection of IIV vulnerabilities and that a significant portion of developers who did not find the vulnerability initially could identify it when warned about its existence. Yet, a total of 60 participants could not detect the vulnerability even after the warning. Other factors, such as the frequency with which the participants perform code reviews, influence the detection of IIV. Preprint: https://arxiv.org/abs/2102.06251. Data and materials: https://doi.org/10.5281/zenodo.3996696.
DOI: 10.1109/ICSE43902.2021.00054
The Mind Is a Powerful Place: How Showing Code Comprehensibility Metrics Influences Code Understanding
作者: Wyrich, Marvin and Preikschat, Andreas and Graziotin, Daniel and Wagner, Stefan
关键词: placebo effect, metrics, cognitive bias, code comprehension, behavioral software engineering, anchoring effect
Abstract
Static code analysis tools and integrated development environments present developers with quality-related software metrics, some of which describe the understandability of source code. Software metrics influence overarching strategic decisions that impact the future of companies and the prioritization of everyday software development tasks. Several software metrics, however, lack in validation: we just choose to trust that they reflect what they are supposed to measure. Some of them were even shown to not measure the quality aspects they intend to measure. Yet, they influence us through biases in our cognitive-driven actions. In particular, they might anchor us in our decisions. Whether the anchoring effect exists with software metrics has not been studied yet. We conducted a randomized and double-blind experiment to investigate the extent to which a displayed metric value for source code comprehensibility anchors developers in their subjective rating of source code comprehensibility, whether performance is affected by the anchoring effect when working on comprehension tasks, and which individual characteristics might play a role in the anchoring effect. We found that the displayed value of a comprehensibility metric has a significant and large anchoring effect on a developer’s code comprehensibility rating. The effect does not seem to affect the time or correctness when working on comprehension questions related to the code snippets under study. Since the anchoring effect is one of the most robust cognitive biases, and we have limited understanding of the consequences of the demonstrated manipulation of developers by non-validated metrics, we call for an increased awareness of the responsibility in code quality reporting and for corresponding tools to be based on scientific evidence.
DOI: 10.1109/ICSE43902.2021.00055
Program Comprehension and Code Complexity Metrics: An fMRI Study
作者: Peitek, Norman and Apel, Sven and Parnin, Chris and Brechmann, Andr'{e
关键词: No keywords
Abstract
Background: Researchers and practitioners have been using code complexity metrics for decades to predict how developers comprehend a program. While it is plausible and tempting to use code metrics for this purpose, their validity is debated, since they rely on simple code properties and rarely consider particularities of human cognition.Aims: We investigate whether and how code complexity metrics reflect difficulty of program comprehension.Method: We have conducted a functional magnetic resonance imaging (fMRI) study with 19 participants observing program comprehension of short code snippets at varying complexity levels. We dissected four classes of code complexity metrics and their relationship to neuronal, behavioral, and subjective correlates of program comprehension, overall analyzing more than 41 metrics.Results: While our data corroborate that complexity metrics can—to a limited degree—explain programmers’ cognition in program comprehension, fMRI allowed us to gain insights into why some code properties are difficult to process. In particular, a code’s textual size drives programmers’ attention, and vocabulary size burdens programmers’ working memory.Conclusion: Our results provide neuro-scientific evidence supporting warnings of prior research questioning the validity of code complexity metrics and pin down factors relevant to program comprehension.Future Work: We outline several follow-up experiments investigating fine-grained effects of code complexity and describe possible refinements to code complexity metrics.
DOI: 10.1109/ICSE43902.2021.00056
Do you really code? Designing and Evaluating Screening Questions for Online Surveys with Programmers
作者: Danilova, Anastasia and Naiakshina, Alena and Horstmann, Stefan and Smith, Matthew
关键词: No keywords
Abstract
Recruiting professional programmers in sufficient numbers for research studies can be challenging because they often cannot spare the time, or due to their geographical distribution and potentially the cost involved. Online platforms such as Clickworker or Qualtrics do provide options to recruit participants with programming skill; however, misunderstandings and fraud can be an issue. This can result in participants without programming skill taking part in studies and surveys. If these participants are not detected, they can cause detrimental noise in the survey data. In this paper, we develop screener questions that are easy and quick to answer for people with programming skill but difficult to answer correctly for those without. In order to evaluate our questionnaire for efficacy and efficiency, we recruited several batches of participants with and without programming skill and tested the questions. In our batch 42% of Clickworkers stating that they have programming skill did not meet our criteria and we would recommend filtering these from studies. We also evaluated the questions in an adversarial setting. We conclude with a set of recommended questions which researchers can use to recruit participants with programming skill from online platforms.
DOI: 10.1109/ICSE43902.2021.00057
How Gamification Affects Software Developers: Cautionary Evidence from a Natural Experiment on GitHub
作者: Moldon, Lukas and Strohmaier, Markus and Wachs, Johannes
关键词: software engineering, natural experiment, gamification, behavior, GitHub
Abstract
We examine how the behavior of software developers changes in response to removing gamification elements from GitHub, an online platform for collaborative programming and software development. We find that the unannounced removal of daily activity streak counters from the user interface (from user profile pages) was followed by significant changes in behavior. Long-running streaks of activity were abandoned and became less common. Weekend activity decreased and days in which developers made a single contribution became less common. Synchronization of streaking behavior in the platform’s social network also decreased, suggesting that gamification is a powerful channel for social influence. Focusing on a set of software developers that were publicly pursuing a goal to make contributions for 100 days in a row, we find that some of these developers abandon this quest following the removal of the public streak counter. Our findings provide evidence for the significant impact of gamification on the behavior of developers on large collaborative programming and software development platforms. They urge caution: gamification can steer the behavior of software developers in unexpected and unwanted directions.
DOI: 10.1109/ICSE43902.2021.00058
IdBench: Evaluating Semantic Representations of Identifier Names in Source Code
作者: Wainakh, Yaza and Rauf, Moiz and Pradel, Michael
关键词: source code, neural networks, identifiers, embeddings, benchmark
Abstract
Identifier names convey useful information about the intended semantics of code. Name-based program analyses use this information, e.g., to detect bugs, to predict types, and to improve the readability of code. At the core of name-based analyses are semantic representations of identifiers, e.g., in the form of learned embeddings. The high-level goal of such a representation is to encode whether two identifiers, e.g., len and size, are semantically similar. Unfortunately, it is currently unclear to what extent semantic representations match the semantic relatedness and similarity perceived by developers. This paper presents IdBench, the first benchmark for evaluating semantic representations against a ground truth created from thousands of ratings by 500 software developers. We use IdBench to study state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions. Our results show that the effectiveness of semantic representations varies significantly and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing technique provides a satisfactory representation of semantic similarities, among other reasons because identifiers with opposing meanings are incorrectly considered to be similar, which may lead to fatal mistakes, e.g., in a refactoring tool. Studying the strengths and weaknesses of the different techniques shows that they complement each other. As a first step toward exploiting this complementarity, we present an ensemble model that combines existing techniques and that clearly outperforms the best available semantic representation.
DOI: 10.1109/ICSE43902.2021.00059
A Context-based Automated Approach for Method Name Consistency Checking and Suggestion
作者: Li, Yi and Wang, Shaohua and Nguyen, Tien N.
关键词: Naturalness of Software, Inconsistent Method Name Checking, Entity Name Suggestion, Deep Learning
Abstract
Misleading method names in software projects can confuse developers, which may lead to software defects and affect code understandability. In this paper, we present DeepName, a context-based, deep learning approach to detect method name inconsistencies and suggest a proper name for a method. The key departure point is the philosophy of “Show Me Your Friends, I’ll Tell You Who You Are”. Unlike the state-of-the-art approaches, in addition to the method’s body, we also consider the interactions of the current method under study with the other ones including the caller and callee methods, and the sibling methods in the same enclosing class. The sequences of sub-tokens in the program entities’ names in the contexts are extracted and used as the input for an RNN-based encoder-decoder to produce the representations for the current method. We modify that RNN model to integrate the copy mechanism and our newly developed component, called the non-copy mechanism, to emphasize on the possibility of a certain sub-token not to be copied to follow the current sub-token in the currently generated method name.We conducted several experiments to evaluate DEEPNAME on large datasets with +14M methods. For consistency checking, DeepName improves the state-of-the-art approach by 2.1%, 19.6%, and 11.9% relatively in recall, precision, and F-score, respectively. For name suggestion, DeepName improves relatively over the state-of-the-art approaches in precision (1.8%-30.5%), recall (8.8%-46.1%), and F-score (5.2%-38.2%). To assess DEEPNAME’s usefulness, we detected inconsistent methods and suggested new method names in active projects. Among 50 pull requests, 12 were merged into the main branch. In total, in 30/50 cases, the team members agree that our suggested method names are more meaningful than the current names.
DOI: 10.1109/ICSE43902.2021.00060
On the Naming of Methods: A Survey of Professional Developers
作者: Alsuhaibani, Reem S. and Newman, Christian D. and Decker, Michael J. and Collard, Michael L. and Maletic, Jonathan I.
关键词: styling, naming conventions, method names, coding standards
Abstract
This paper describes the results of a large (+1100 responses) survey of professional software developers concerning standards for naming source code methods. The various standards for source code method names are derived from and supported in the software engineering literature. The goal of the survey is to determine if there is a general consensus among developers that the standards are accepted and used in practice. Additionally, the paper examines factors such as years of experience and programming language knowledge in the context of survey responses. The survey results show that participants very much agree about the importance of various standards and how they apply to names and that years of experience and the programming language has almost no effect on their responses. The results imply that the given standards are both valid and to a large degree complete. The work provides a foundation for automated method name assessment during development and code reviews.
DOI: 10.1109/ICSE43902.2021.00061
Relating Reading, Visualization, and Coding for New Programmers: A Neuroimaging Study
作者: Endres, Madeline and Karas, Zachary and Hu, Xiaosu and Kovelman, Ioulia and Weimer, Westley
关键词: No keywords
Abstract
Understanding how novices reason about coding at a neurological level has implications for training the next generation of software engineers. In recent years, medical imaging has been increasingly employed to investigate patterns of neural activity associated with coding activity. However, such studies have focused on advanced undergraduates and professionals. In a human study of 31 participants, we use functional near-infrared spectroscopy to measure the neural activity associated with introductory programming. In a controlled, contrast-based experiment, we relate brain activity when coding to that of reading natural language or mentally rotating objects (a spatial visualization task). Our primary result is that all three tasks— coding, prose reading, and mental rotation—are mentally distinct for novices. However, while those tasks are neurally distinct, we find more significant differences between prose and coding than between mental rotation and coding. Intriguingly, we generally find more activation in areas of the brain associated with spatial ability and task difficulty for novice coding compared to that reported in studies with more expert developers. Finally, in an exploratory analysis, we also find a neural activation pattern predictive of programming performance 11 weeks later. While preliminary, these findings both expand on previous results (e.g., relating expertise to a similarity between coding and prose reading) and also provide a new understanding of the cognitive processes underlying novice programming.
DOI: 10.1109/ICSE43902.2021.00062
A Case Study of Onboarding in Software Teams: Tasks and Strategies
作者: Ju, An and Sajnani, Hitesh and Kelly, Scot and Herzig, Kim
关键词: software development teams, social connections, onboarding, learning, confidence
Abstract
Developers frequently move into new teams or environments across software companies. Their onboarding experience is correlated with productivity, job satisfaction, and other short-term and long-term outcomes. The majority of the onboarding process comprises engineering tasks such as fixing bugs or implementing small features. Nevertheless, we do not have a systematic view of how tasks influence onboarding. In this paper, we present a case study of Microsoft, where we interviewed 32 developers moving into a new team and 15 engineering managers onboarding a new developer into their team - to understand and characterize developers’ onboarding experience and expectations in relation to the tasks performed by them while onboarding. We present how tasks interact with new developers through three representative themes: learning, confidence building, and socialization. We also discuss three onboarding strategies as inferred from the interviews that managers commonly use unknowingly, and discuss their pros and cons and offer situational recommendations. Furthermore, we triangulate our interview findings with a developer survey (N = 189) and a manager survey (N = 37) and find that survey results suggest that our findings are representative and our recommendations are actionable. Practitioners could use our findings to improve their onboarding processes, while researchers could find new research directions from this study to advance the understanding of developer onboarding. Our research instruments and anonymous data are available at https://zenodo.org/record/4455937#.YCOQCs_0lFd.
DOI: 10.1109/ICSE43902.2021.00063
How Was Your Weekend? Software Development Teams Working From Home During COVID-19
作者: Miller, Courtney and Rodeghero, Paige and Storey, Margaret-Anne and Ford, Denae and Zimmermann, Thomas
关键词: No keywords
Abstract
The mass shift to working at home during the COVID-19 pandemic radically changed the way many software development teams collaborate and communicate. To investigate how team culture and team productivity may also have been affected, we conducted two surveys at a large software company. The first, an exploratory survey during the early months of the pandemic with 2,265 developer responses, revealed that many developers faced challenges reaching milestones and that their team productivity had changed. We also found through qualitative analysis that important team culture factors such as communication and social connection had been affected. For example, the simple phrase “How was your weekend?” had become a subtle way to show peer support.In our second survey, we conducted a quantitative analysis of the team cultural factors that emerged from our first survey to understand the prevalence of the reported changes. From 608 developer responses, we found that 74% of these respondents missed social interactions with colleagues and 51% reported a decrease in their communication ease with colleagues. We used data from the second survey to build a regression model to identify important team culture factors for modeling team productivity. We found that the ability to brainstorm with colleagues, difficulty communicating with colleagues, and satisfaction with interactions from social activities are important factors that are associated with how developers report their software development team’s productivity. Our findings inform how managers and leaders in large software companies can support sustained team productivity during times of crisis and beyond.
DOI: 10.1109/ICSE43902.2021.00064
flack: Counterexample-Guided Fault Localization for Alloy Models
作者: Zheng, Guolong and Nguyen, ThanhVu and Brida, Sim'{o
关键词: No keywords
Abstract
Fault localization is a practical research topic that helps developers identify code locations that might cause bugs in a program. Most existing fault localization techniques are designed for imperative programs (e.g., C and Java) and rely on analyzing correct and incorrect executions of the program to identify suspicious statements. In this work, we introduce a fault localization approach for models written in a declarative language, where the models are not “executed,” but rather converted into a logical formula and solved using backend constraint solvers. We present FLACK, a tool that takes as input an Alloy model consisting of some violated assertion and returns a ranked list of suspicious expressions contributing to the assertion violation. The key idea is to analyze the differences between counterexamples, i.e., instances of the model that do not satisfy the assertion, and instances that do satisfy the assertion to find suspicious expressions in the input model. The experimental results show that FLACK is efficient (can handle complex, real-world Alloy models with thousand lines of code within 5 seconds), accurate (can consistently rank buggy expressions in the top 1.9% of the suspicious list), and useful (can often narrow down the error to the exact location within the suspicious expressions).
DOI: 10.1109/ICSE43902.2021.00065
Improving Fault Localization by Integrating Value and Predicate Based Causal Inference Techniques
作者: K"{u
关键词: No keywords
Abstract
Statistical fault localization (SFL) techniques use execution profiles and success/failure information from software executions, in conjunction with statistical inference, to automatically score program elements based on how likely they are to be faulty. SFL techniques typically employ one type of profile data: either coverage data, predicate outcomes, or variable values. Most SFL techniques actually measure correlation, not causation, between profile values and success/failure, and so they are subject to confounding bias that distorts the scores they produce. This paper presents a new SFL technique, named UniVal, that uses causal inference techniques and machine learning to integrate information about both predicate outcomes and variable values to more accurately estimate the true failure-causing effect of program statements. UniVal was empirically compared to several coverage-based, predicate-based, and value-based SFL techniques on 800 program versions with real faults.
DOI: 10.1109/ICSE43902.2021.00066
Fault Localization with Code Coverage Representation Learning
作者: Li, Yi and Wang, Shaohua and Nguyen, Tien N.
关键词: representation learning, machine learning, fault localization, deep learning, code coverage
Abstract
In this paper, we propose DEEPRL4FL, a deep learning fault localization (FL) approach that locates the buggy code at the statement and method levels by treating FL as an image pattern recognition problem. DEEPRL4FL does so via novel code coverage representation learning (RL) and data dependencies RL for program statements. Those two types of RL on the dynamic information in a code coverage matrix are also combined with the code representation learning on the static information of the usual suspicious source code. This combination is inspired by crime scene investigation in which investigators analyze the crime scene (failed test cases and statements) and related persons (statements with dependencies), and at the same time, examine the usual suspects who have committed a similar crime in the past (similar buggy code in the training data).For the code coverage information, DEEPRL4FL first orders the test cases and marks error-exhibiting code statements, expecting that a model can recognize the patterns discriminating between faulty and non-faulty statements/methods. For dependencies among statements, the suspiciousness of a statement is seen taking into account the data dependencies to other statements in execution and data flows, in addition to the statement by itself. Finally, the vector representations for code coverage matrix, data dependencies among statements, and source code are combined and used as the input of a classifier built from a Convolution Neural Network to detect buggy statements/methods. Our empirical evaluation shows that DEEPRL4FL improves the top-1 results over the state-of-the-art statement-level FL baselines from 173.1% to 491.7%. It also improves the top-1 results over the existing method-level FL baselines from 15.0% to 206.3%.
DOI: 10.1109/ICSE43902.2021.00067
An Empirical Study on Deployment Faults of Deep Learning Based Mobile Applications
作者: Chen, Zhenpeng and Yao, Huihan and Lou, Yiling and Cao, Yanbin and Liu, Yuanqiang and Wang, Haoyu and Liu, Xuanzhe
关键词: mobile applications, deployment faults, deep learning
Abstract
Deep learning (DL) is moving its step into a growing number of mobile software applications. These software applications, named as DL based mobile applications (abbreviated as mobile DL apps) integrate DL models trained using large-scale data with DL programs. A DL program encodes the structure of a desirable DL model and the process by which the model is trained using training data. Due to the increasing dependency of current mobile apps on DL, software engineering (SE) for mobile DL apps has become important. However, existing efforts in SE research community mainly focus on the development of DL models and extensively analyze faults in DL programs. In contrast, faults related to the deployment of DL models on mobile devices (named as deployment faults of mobile DL apps) have not been well studied. Since mobile DL apps have been used by billions of end users daily for various purposes including for safety-critical scenarios, characterizing their deployment faults is of enormous importance. To fill in the knowledge gap, this paper presents the first comprehensive study to date on the deployment faults of mobile DL apps. We identify 304 real deployment faults from Stack Overflow and GitHub, two commonly used data sources for studying software faults. Based on the identified faults, we construct a fine-granularity taxonomy consisting of 23 categories regarding to fault symptoms and distill common fix strategies for different fault symptoms. Furthermore, we suggest actionable implications and research avenues that can potentially facilitate the deployment of DL models on mobile devices.
DOI: 10.1109/ICSE43902.2021.00068
Extracting Concise Bug-Fixing Patches from Human-Written Patches in Version Control Systems
作者: Jiang, Yanjie and Liu, Hui and Niu, Nan and Zhang, Lu and Hu, Yamin
关键词: Testing, Repository, Patch, Defect, Dataset, Bug
Abstract
High-quality and large-scale repositories of real bugs and their concise patches collected from real-world applications are critical for research in the software engineering community. In such a repository, each real bug is explicitly associated with its fix. Therefore, on one side, the real bugs and their fixes may inspire novel approaches for finding, locating, and repairing software bugs; on the other side, the real bugs and their fixes are indispensable for rigorous and meaningful evaluation of approaches for software testing, fault localization, and program repair. To this end, a number of such repositories, e.g., Defects4J, have been proposed. However, such repositories are rather small because their construction involves expensive human intervention. Although bug-fixing code commits as well as associated test cases could be retrieved from version control systems automatically, existing approaches could not yet automatically extract concise bug-fixing patches from bug-fixing commits because such commits often involve bug-irrelevant changes. In this paper, we propose an automatic approach, called BugBuilder, to extracting complete and concise bug-fixing patches from human-written patches in version control systems. It excludes refactorings by detecting refactorings involved in bug-fixing commits, and reapplying detected refactorings on the faulty version. It enumerates all subsets of the remaining part and validates them on test cases. If none of the subsets has the potential to be a complete bug-fixing patch, the remaining part as a whole is taken as a complete and concise bug-fixing patch. Evaluation results on 809 real bug-fixing commits in Defects4J suggest that BugBuilder successfully generated complete and concise bug-fixing patches for forty percent of the bug-fixing commits, and its precision (99%) was even higher than human experts.
DOI: 10.1109/ICSE43902.2021.00069
Input Algebras
作者: Gopinath, Rahul and Nemati, Hamed and Zeller, Andreas
关键词: testing, faults, debugging
Abstract
Grammar-based test generators are highly efficient in producing syntactically valid test inputs and give their users precise control over which test inputs should be generated. Adapting a grammar or a test generator towards a particular testing goal can be tedious, though. We introduce the concept of a grammar transformer, specializing a grammar towards inclusion or exclusion of specific patterns: “The phone number must not start with 011 or +1”. To the best of our knowledge, ours is the first approach to allow for arbitrary Boolean combinations of patterns, giving testers unprecedented flexibility in creating targeted software tests. The resulting specialized grammars can be used with any grammar-based fuzzer for targeted test generation, but also as validators to check whether the given specialization is met or not, opening up additional usage scenarios. In our evaluation on real-world bugs, we show that specialized grammars are accurate both in producing and validating targeted inputs.
DOI: 10.1109/ICSE43902.2021.00070
Fuzzing Symbolic Expressions
作者: Borzacchiello, Luca and Coppa, Emilio and Demetrescu, Camil
关键词: fuzzing testing, concolic execution, SMT solver
Abstract
Recent years have witnessed a wide array of results in software testing, exploring different approaches and methodologies ranging from fuzzers to symbolic engines, with a full spectrum of instances in between such as concolic execution and hybrid fuzzing. A key ingredient of many of these tools is Satisfiability Modulo Theories (SMT) solvers, which are used to reason over symbolic expressions collected during the analysis. In this paper, we investigate whether techniques borrowed from the fuzzing domain can be applied to check whether symbolic formulas are satisfiable in the context of concolic and hybrid fuzzing engines, providing a viable alternative to classic SMT solving techniques. We devise a new approximate solver, Fuzzy-Sat, and show that it is both competitive with and complementary to state-of-the-art solvers such as Z3 with respect to handling queries generated by hybrid fuzzers.
DOI: 10.1109/ICSE43902.2021.00071
Growing A Test Corpus with Bonsai Fuzzing
作者: Vikram, Vasudev and Padhye, Rohan and Sen, Koushik
关键词: test-case reduction, test-case generation, small scope hypothesis, grammar-based testing, fuzz testing
Abstract
This paper presents a coverage-guided grammar-based fuzzing technique for automatically synthesizing a corpus of concise test inputs. We walk-through a case study of a compiler designed for education and the corresponding problem of generating meaningful test cases to provide to students. The prior state-of-the-art solution is a combination of fuzzing and test-case reduction techniques such as variants of delta-debugging. Our key insight is that instead of attempting to minimize convoluted fuzzer-generated test inputs, we can instead grow concise test inputs by construction using a form of iterative deepening. We call this approach bonsai fuzzing. Experimental results show that bonsai fuzzing can generate test corpora having inputs that are 16-45% smaller in size on average as compared to a fuzz-then-reduce approach, while achieving approximately the same code coverage and fault-detection capability.
DOI: 10.1109/ICSE43902.2021.00072
We’ll Fix It in Post: What Do Bug Fixes in Video Game Update Notes Tell Us?
作者: Truelove, Andrew and de Almeida, Eduardo Santana and Ahmed, Iftekhar
关键词: No keywords
Abstract
Bugs that persist into releases of video games can have negative impacts on both developers and users, but particular aspects of testing in game development can lead to difficulties in effectively catching these missed bugs. It has become common practice for developers to apply updates to games in order to fix missed bugs. These updates are often accompanied by notes that describe the changes to the game included in the update. However, some bugs reappear even after an update attempts to fix them. In this paper, we develop a taxonomy for bug types in games that is based on prior work. We examine 12,122 bug fixes from 723 updates for 30 popular games on the Steam platform. We label the bug fixes included in these updates to identify the frequency of these different bug types, the rate at which bug types recur over multiple updates, and which bug types are treated as more severe. Additionally, we survey game developers regarding their experience with different bug types and what aspects of game development they most strongly associate with bug appearance. We find that Information bugs appear the most frequently in updates, while Crash bugs recur the most frequently and are often treated as more severe than other bug types. Finally, we find that challenges in testing, code quality, and bug reproduction have a close association with bug persistence. These findings should help developers identify which aspects of game development could benefit from greater attention in order to prevent bugs. Researchers can use our results in devising tools and methods to better identify and address certain bug types.
DOI: 10.1109/ICSE43902.2021.00073
guigan: Learning to Generate GUI Designs Using Generative Adversarial Networks
作者: Zhao, Tianming and Chen, Chunyang and Liu, Yuanning and Zhu, Xiaodong
关键词: mobile application, deep learning, Graphical User Interface, Generative Adversarial Network (GAN), GUI design
Abstract
Graphical User Interface (GUI) is ubiquitous in almost all modern desktop software, mobile applications, and online websites. A good GUI design is crucial to the success of the software in the market, but designing a good GUI which requires much innovation and creativity is difficult even to well-trained designers. Besides, the requirement of the rapid development of GUI design also aggravates designers’ working load. So, the availability of various automated generated GUIs can help enhance the design personalization and specialization as they can cater to the taste of different designers. To assist designers, we develop a model GUIGAN to automatically generate GUI designs. Different from conventional image generation models based on image pixels, our GUIGAN is to reuse GUI components collected from existing mobile app GUIs for composing a new design that is similar to natural-language generation. Our GUIGAN is based on SeqGAN by modeling the GUI component style compatibility and GUI structure. The evaluation demonstrates that our model significantly outperforms the best of the baseline methods by 30.77% in Fr'{e
DOI: 10.1109/ICSE43902.2021.00074
Don’t Do That! Hunting Down Visual Design Smells in Complex UIs against Design Guidelines
作者: Yang, Bo and Xing, Zhenchang and Xia, Xin and Chen, Chunyang and Ye, Deheng and Li, Shanping
关键词: Violation detection, UI design smell, Material design, GUI testing
Abstract
Just like code smells in source code, UI design has visual design smells. We study 93 don’t-do-that guidelines in the Material Design, a complex design system created by Google. We find that these don’t-guidelines go far beyond UI aesthetics, and involve seven general design dimensions (layout, typography, iconography, navigation, communication, color, and shape) and four component design aspects (anatomy, placement, behavior, and usage). Violating these guidelines results in visual design smells in UIs (or UI design smells). In a study of 60,756 UIs of 9,286 Android apps, we find that 7,497 UIs of 2,587 apps have at least one violation of some Material Design guidelines. This reveals the lack of developer training and tool support to avoid UI design smells. To fill this gap, we design an automated UI design smell detector (UIS-Hunter) that extracts and validates multi-modal UI information (component metadata, typography, iconography, color, and edge) for detecting the violation of diverse don’t-guidelines in Material Design. The detection accuracy of UIS-Hunter is high (precision=0.81, recall=0.90) on the 60,756 UIs of 9,286 apps. We build a guideline gallery with real-world UI design smells that UIS-Hunter detects for developers to learn the best Material Design practices. Our user studies show that UIS-Hunter is more effective than manual detection of UI design smells, and the UI design smells that are detected by UIS-Hunter have severely negative impacts on app users.
DOI: 10.1109/ICSE43902.2021.00075
Same File, Different Changes: The Potential of Meta-Maintenance on GitHub
作者: Hata, Hideaki and Kula, Raula Gaikovina and Ishio, Takashi and Treude, Christoph
关键词: No keywords
Abstract
Online collaboration platforms such as GitHub have provided software developers with the ability to easily reuse and share code between repositories. With clone-and-own and forking becoming prevalent, maintaining these shared files is important, especially for keeping the most up-to-date version of reused code. Different to related work, we propose the concept of meta-maintenance—i.e., tracking how the same files evolve in different repositories with the aim to provide useful maintenance opportunities to those files. We conduct an exploratory study by analyzing repositories from seven different programming languages to explore the potential of meta-maintenance. Our results indicate that a majority of active repositories on GitHub contains at least one file which is also present in another repository, and that a significant minority of these files are maintained differently in the different repositories which contain them. We manually analyzed a representative sample of shared files and their variants to understand which changes might be useful for meta-maintenance. Our findings support the potential of meta-maintenance and open up avenues for future work to capitalize on this potential.
DOI: 10.1109/ICSE43902.2021.00076
Can Program Synthesis be Used to Learn Merge Conflict Resolutions? An Empirical Analysis
作者: Pan, Rangeet and Le, Vu and Nagappan, Nachiappan and Gulwani, Sumit and Lahiri, Shuvendu and Kaufman, Mike
关键词: program synthesis, merge conflict, automated fixing
Abstract
Forking structure is widespread in the open-source repositories and that causes a significant number of merge conflicts. In this paper, we study the problem of textual merge conflicts from the perspective of Microsoft Edge, a large, highly collaborative fork of the main Chromium branch with significant merge conflicts. Broadly, this study is divided into two sections. First, we empirically evaluate textual merge conflicts in Microsoft Edge and classify them based on the type of files, location of conflicts in a file, and the size of conflicts. We found that ~28% of the merge conflicts are 1-2 line changes, and many resolutions have frequent patterns. Second, driven by these findings, we explore Program Synthesis (for the first time) to learn patterns and resolve structural merge conflicts. We propose a novel domain-specific language (DSL) that captures many of the repetitive merge conflict resolution patterns and learn resolution strategies as programs in this DSL from example resolutions. We found that the learned strategies can resolve 11.4% of the conflicts (~41% of 1-2 line changes) that arise in the C++ files with 93.2% accuracy.
DOI: 10.1109/ICSE43902.2021.00077
Abacus: Precise Side-Channel Analysis
作者: Bao, Qinkun and Wang, Zihao and Li, Xiaoting and Larus, James R. and Wu, Dinghao
关键词: No keywords
Abstract
Side-channel attacks allow adversaries to infer sensitive information from non-functional characteristics. Prior side-channel detection work is able to identify numerous potential vulnerabilities. However, in practice, many such vulnerabilities leak a negligible amount of sensitive information, and thus developers are often reluctant to address them. Existing tools do not provide information to evaluate a leak’s severity, such as the number of leaked bits.To address this issue, we propose a new program analysis method to precisely quantify the leaked information in a single-trace attack through side-channels. It can identify covert information flows in programs that expose confidential information and can reason about security flaws that would otherwise be difficult, if not impossible, for a developer to find. We model an attacker’s observation of each leakage site as a constraint. We use symbolic execution to generate these constraints and then run Monte Carlo sampling to estimate the number of leaked bits for each leakage site. By applying the Central Limit Theorem, we provide an error bound for these estimations.We have implemented the technique in a tool called Abacus, which not only finds very fine-grained side-channel vulnerabilities but also estimates how many bits are leaked. Abacus outperforms existing dynamic side-channel detection tools in performance and accuracy. We evaluate Abacus on OpenSSL, mbedTLS, Libgcrypt, and Monocypher. Our results demonstrate that most reported vulnerabilities are difficult to exploit in practice and should be de-prioritized by developers. We also find several sensitive vulnerabilities that are missed by the existing tools. We confirm those vulnerabilities with manual checks and by contacting the developers.
DOI: 10.1109/ICSE43902.2021.00078
Data-Driven Synthesis of Provably Sound Side Channel Analyses
作者: Wang, Jingbo and Sung, Chungha and Raghothaman, Mukund and Wang, Chao
关键词: No keywords
Abstract
We propose a data-driven method for synthesizing static analyses to detect side-channel information leaks in cryptographic software. Compared to the conventional way of manually crafting such static analyzers, which can be tedious, error prone and suboptimal, our learning-based technique is not only automated but also provably sound. Our analyzer consists of a set of type-inference rules learned from the training data, i.e., example code snippets annotated with the ground truth. Internally, we use syntax-guided synthesis (SyGuS) to generate new recursive features and decision tree learning (DTL) to generate analysis rules based on these features. We guarantee soundness by proving each learned analysis rule via a technique called query containment checking. We have implemented our technique in the LLVM compiler and used it to detect power side channels in C programs that implement cryptographic protocols. Our results show that, in addition to being automated and provably sound during synthesis, our analyzer can achieve the same empirical accuracy as two state-of-the-art, manually-crafted analyzers while being 300X and 900X faster, respectively.
DOI: 10.1109/ICSE43902.2021.00079
IMGDroid: Detecting Image Loading Defects in Android Applications
作者: Song, Wei and Han, Mengqi and Huang, Jeff
关键词: image loading, defect analysis, Android app
Abstract
Images are essential for many Android applications or apps. Although images play a critical role in app functionalities and user experience, inefficient or improper image loading and displaying operations may severely impact the app performance and quality. Additionally, since these image loading defects may not be manifested by immediate failures, e.g., app crashes, existing GUI testing approaches cannot detect them effectively. In this paper, we identify five anti-patterns of such image loading defects, including image passing by intent, image decoding without resizing, local image loading without permission, repeated decoding without caching, and image decoding in UI thread. Based on these anti-patterns, we propose a static analysis technique, IMGDroid, to automatically and effectively detect such defects. We have applied IMGDroid to a benchmark of 21 open-source Android apps, and found that it not only successfully detects the 45 previously-known image loading defects but also finds 15 new such defects. Our empirical study on 1,000 commercial Android apps demonstrates that the image loading defects are prevalent.
DOI: 10.1109/ICSE43902.2021.00080
Fast Parametric Model Checking through Model Fragmentation
作者: Fang, Xinwei and Calinescu, Radu and Gerasimou, Simos and Alhwikem, Faisal
关键词: non-functional properties, discrete-time Markov chains, Parametric model checking
Abstract
Parametric model checking (PMC) computes algebraic formulae that express key non-functional properties of a system (reliability, performance, etc.) as rational functions of the system and environment parameters. In software engineering, PMC formulae can be used during design, e.g., to analyse the sensitivity of different system architectures to parametric variability, or to find optimal system configurations. They can also be used at runtime, e.g., to check if non-functional requirements are still satisfied after environmental changes, or to select new configurations after such changes. However, current PMC techniques do not scale well to systems with complex behaviour and more than a few parameters. Our paper introduces a fast PMC (fPMC) approach that overcomes this limitation, extending the applicability of PMC to a broader class of systems than previously possible. To this end, fPMC partitions the Markov models that PMC operates with into fragments whose reachability properties are analysed independently, and obtains PMC reachability formulae by combining the results of these fragment analyses. To demonstrate the effectiveness of fPMC, we show how our fPMC tool can analyse three systems (taken from the research literature, and belonging to different application domains) with which current PMC techniques and tools struggle.
DOI: 10.1109/ICSE43902.2021.00081
Trace-Checking CPS Properties: Bridging the Cyber-Physical Gap
作者: Menghi, Claudio and Vigan`{o
关键词: Validation, Specification, Semantics, Monitors, Languages, Formal methods
Abstract
Cyber-physical systems combine software and physical components. Specification-driven trace-checking tools for CPS usually provide users with a specification language to express the requirements of interest, and an automatic procedure to check whether these requirements hold on the execution traces of a CPS. Although there exist several specification languages for CPS, they are often not sufficiently expressive to allow the specification of complex CPS properties related to the software and the physical components and their interactions.In this paper, we propose (i) the Hybrid Logic of Signals (HLS), a logic-based language that allows the specification of complex CPS requirements, and (ii) ThEodorE, an efficient SMT-based trace-checking procedure. This procedure reduces the problem of checking a CPS requirement over an execution trace, to checking the satisfiability of an SMT formula.We evaluated our contributions by using a representative industrial case study in the satellite domain. We assessed the expressiveness of HLS by considering 212 requirements of our case study. HLS could express all the 212 requirements. We also assessed the applicability of ThEodorE by running the trace-checking procedure for 747 trace-requirement combinations. ThEodorE was able to produce a verdict in 74.5% of the cases. Finally, we compared HLS and ThEodorE with other specification languages and trace-checking tools from the literature. Our results show that, from a practical standpoint, our approach offers a better trade-off between expressiveness and performance.
DOI: 10.1109/ICSE43902.2021.00082
Centris: A Precise and Scalable Approach for Identifying Modified Open-Source Software Reuse
作者: Woo, Seunghoon and Park, Sunghan and Kim, Seulbae and Lee, Heejo and Oh, Hakjoo
关键词: Software Security, Software Composition Analysis, Open-Source Software
Abstract
Open-source software (OSS) is widely reused as it provides convenience and efficiency in software development. Despite evident benefits, unmanaged OSS components can introduce threats, such as vulnerability propagation and license violation. Unfortunately, however, identifying reused OSS components is a challenge as the reused OSS is predominantly modified and nested. In this paper, we propose CENTRIS, a precise and scalable approach for identifying modified OSS reuse. By segmenting an OSS code base and detecting the reuse of a unique part of the OSS only, CENTRIS is capable of precisely identifying modified OSS reuse in the presence of nested OSS components. For scalability, CENTRIS eliminates redundant code comparisons and accelerates the search using hash functions. When we applied CENTRIS on 10,241 widely-employed GitHub projects, comprising 229,326 versions and 80 billion lines of code, we observed that modified OSS reuse is a norm in software development, occurring 20 times more frequently than exact reuse. Nonetheless, CENTRIS identified reused OSS components with 91% precision and 94% recall in less than a minute per application on average, whereas a recent clone detection technique, which does not take into account modified and nested OSS reuse, hardly reached 10% precision and 40% recall.
DOI: 10.1109/ICSE43902.2021.00083
Interpretation-enabled Software Reuse Detection Based on a Multi-Level Birthmark Model
作者: Xu, Xi and Zheng, Qinghua and Yan, Zheng and Fan, Ming and Jia, Ang and Liu, Ting
关键词: Software Reuse Detection, Multi-Level Software Birthmark, Interpretation, Binary Similarity Analysis
Abstract
Software reuse, especially partial reuse, poses legal and security threats to software development. Since its source codes are usually unavailable, software reuse is hard to be detected with interpretation. On the other hand, current approaches suffer from poor detection accuracy and efficiency, far from satisfying practical demands. To tackle these problems, in this paper, we propose ISRD, an interpretation-enabled software reuse detection approach based on a multi-level birthmark model that contains function level, basic block level, and instruction level. To overcome obfuscation caused by cross-compilation, we represent function semantics with Minimum Branch Path (MBP) and perform normalization to extract core semantics of instructions. For efficiently detecting reused functions, a process for “intent search based on anchor recognition” is designed to speed up reuse detection. It uses strict instruction match and identical library call invocation check to find anchor functions (in short anchors) and then traverses neighbors of the anchors to explore potentially matched function pairs. Extensive experiments based on two real-world binary datasets reveal that ISRD is interpretable, effective, and efficient, which achieves 97.2% precision and 94.8% recall. Moreover, it is resilient to cross-compilation, outperforming state-of-the-art approaches.
DOI: 10.1109/ICSE43902.2021.00084
Fast Outage Analysis of Large-scale Production Clouds with Service Correlation Mining
作者: Wang, Yaohui and Li, Guozheng and Wang, Zijian and Kang, Yu and Zhou, Yangfan and Zhang, Hongyu and Gao, Feng and Sun, Jeffrey and Yang, Li and Lee, Pochian and Xu, Zhangwei and Zhao, Pu and Qiao, Bo and Li, Liqun and Zhang, Xu and Lin, Qingwei
关键词: root cause analysis, outage triage, machine learning, cloud computing
Abstract
Cloud-based services are surging into popularity in recent years. However, outages, i.e., severe incidents that always impact multiple services, can dramatically affect user experience and incur severe economic losses. Locating the root-cause service, i.e., the service that contains the root cause of the outage, is a crucial step to mitigate the impact of the outage. In current industrial practice, this is generally performed in a bootstrap manner and largely depends on human efforts: the service that directly causes the outage is identified first, and the suspected root cause is traced back manually from service to service during diagnosis until the actual root cause is found. Unfortunately, production cloud systems typically contain a large number of interdependent services. Such a manual root cause analysis is often time-consuming and labor-intensive. In this work, we propose COT, the first outage triage approach that considers the global view of service correlations. COT mines the correlations among services from outage diagnosis data. After learning from historical outages, COT can infer the root cause of emerging ones accurately. We implement COT and evaluate it on a real-world dataset containing one year of data collected from Microsoft Azure, one of the representative cloud computing platforms in the world. Our experimental results show that COT can reach a triage accuracy of 82.1%~83.5%, which outperforms the state-of-the-art triage approach by 28.0%~29.7%.
DOI: 10.1109/ICSE43902.2021.00085
MuDelta: Delta-Oriented Mutation Testing at Commit Time
作者: Ma, Wei and Chekam, Thierry Titcheu and Papadakis, Mike and Harman, Mark
关键词: regression testing, mutation testing, machine learning, continuous integration, commit-relevant mutants
Abstract
To effectively test program changes using mutation testing, one needs to use mutants that are relevant to the altered program behaviours. We introduce MuDelta, an approach that identifies commit-relevant mutants; mutants that affect and are affected by the changed program behaviours. Our approach uses machine learning applied on a combined scheme of graph and vector-based representations of static code features. Our results, from 50 commits in 21 Coreutils programs, demonstrate a strong prediction ability of our approach; yielding 0.80 (ROC) and 0.50 (PR-Curve) AUC values with 0.63 and 0.32 precision and recall values. These predictions are significantly higher than random guesses, 0.20 (PR-Curve) AUC, 0.21 and 0.21 precision and recall, and subsequently lead to strong relevant tests that kill 45% more relevant mutants than randomly sampled mutants (either sampled from those residing on the changed component(s) or from the changed lines). Our results also show that MuDelta selects mutants with 27% higher fault revealing ability in fault introducing commits. Taken together, our results corroborate the conclusion that commit-based mutation testing is suitable and promising for evolving software.
DOI: 10.1109/ICSE43902.2021.00086
Does mutation testing improve testing practices?
作者: Petrovi'{c
关键词: mutation testing, fault coupling, code coverage
Abstract
Various proxy metrics for test quality have been defined in order to guide developers when writing tests. Code coverage is particularly well established in practice, even though the question of how coverage relates to test quality is a matter of ongoing debate. Mutation testing offers a promising alternative: Artificial defects can identify holes in a test suite, and thus provide concrete suggestions for additional tests. Despite the obvious advantages of mutation testing, it is not yet well established in practice. Until recently, mutation testing tools and techniques simply did not scale to complex systems. Although they now do scale, a remaining obstacle is lack of evidence that writing tests for mutants actually improves test quality. In this paper we aim to fill this gap: By analyzing a large dataset of almost 15 million mutants, we investigate how these mutants influenced developers over time, and how these mutants relate to real faults. Our analyses suggest that developers using mutation testing write more tests, and actively improve their test suites with high quality tests such that fewer mutants remain. By analyzing a dataset of past fixes of real high-priority faults, our analyses further provide evidence that mutants are indeed coupled with real faults. In other words, had mutation testing been used for the changes introducing the faults, it would have reported a live mutant that could have prevented the bug.
DOI: 10.1109/ICSE43902.2021.00087
Identifying Key Features from App User Reviews
作者: Wu, Huayao and Deng, Wenjun and Niu, Xintao and Nie, Changhai
关键词: user reviews, key features, feature extraction, app store analysis
Abstract
Due to the rapid growth and strong competition of mobile application (app) market, app developers should not only offer users with attractive new features, but also carefully maintain and improve existing features based on users’ feedbacks. User reviews indicate a rich source of information to plan such feature maintenance activities, and it could be of great benefit for developers to evaluate and magnify the contribution of specific features to the overall success of their apps. In this study, we refer to the features that are highly correlated to app ratings as key features, and we present KEFE, a novel approach that leverages app description and user reviews to identify key features of a given app. The application of KEFE especially relies on natural language processing, deep machine learning classifier, and regression analysis technique, which involves three main steps: 1) extracting feature-describing phrases from app description; 2) matching each app feature with its relevant user reviews; and 3) building a regression model to identify features that have significant relationships with app ratings. To train and evaluate KEFE, we collect 200 app descriptions and 1,108,148 user reviews from Chinese Apple App Store. Experimental results demonstrate the effectiveness of KEFE in feature extraction, where an average F-measure of 78.13% is achieved. The key features identified are also likely to provide hints for successful app releases, as for the releases that receive higher app ratings, 70% of features improvements are related to key features.
DOI: 10.1109/ICSE43902.2021.00088
CHAMP: Characterizing Undesired App Behaviors from User Comments based on Market Policies
作者: Hu, Yangyu and Wang, Haoyu and Ji, Tiantong and Xiao, Xusheng and Luo, Xiapu and Gao, Peng and Guo, Yao
关键词: undesired behavior, app market, User comment
Abstract
Millions of mobile apps have been available through various app markets. Although most app markets have enforced a number of automated or even manual mechanisms to vet each app before it is released to the market, thousands of low-quality apps still exist in different markets, some of which violate the explicitly specified market policies. In order to identify these violations accurately and timely, we resort to user comments, which can form an immediate feedback for app market maintainers, to identify undesired behaviors that violate market policies, including security-related user concerns. Specifically, we present the first large-scale study to detect and characterize the correlations between user comments and market policies. First, we propose CHAMP, an approach that adopts text mining and natural language processing (NLP) techniques to extract semantic rules through a semi-automated process, and classifies comments into 26 pre-defined types of undesired behaviors that violate market policies. Our evaluation on real-world user comments shows that it achieves both high precision and recall (> 0.9) in classifying comments for undesired behaviors. Then, we curate a large-scale comment dataset (over 3 million user comments) from apps in Google Play and 8 popular alternative Android app markets, and apply CHAMP to understand the characteristics of undesired behavior comments in the wild. The results confirm our speculation that user comments can be used to pinpoint suspicious apps that violate policies declared by app markets. The study also reveals that policy violations are widespread in many app markets despite their extensive vetting efforts. CHAMP can be a whistle blower that assigns policy-violation scores and identifies most informative comments for apps.
DOI: 10.1109/ICSE43902.2021.00089
Prioritize Crowdsourced Test Reports via Deep Screenshot Understanding
作者: Yu, Shengcheng and Fang, Chunrong and Cao, Zhenfei and Wang, Xu and Li, Tongyu and Chen, Zhenyu
关键词: Mobile App Testing, Deep Screenshot Understanding, Crowdsourced testing
Abstract
Crowdsourced testing is increasingly dominant in mobile application (app) testing, but it is a great burden for app developers to inspect the incredible number of test reports. Many researches have been proposed to deal with test reports based only on texts or additionally simple image features. However, in mobile app testing, texts contained in test reports are condensed and the information is inadequate. Many screenshots are included as complements that contain much richer information beyond texts. This trend motivates us to prioritize crowdsourced test reports based on a deep screenshot understanding.In this paper, we present a novel crowdsourced test report prioritization approach, namely DeepPrior. We first represent the crowdsourced test reports with a novelly introduced feature, namely DeepFeature, that includes all the widgets along with their texts, coordinates, types, and even intents based on the deep analysis of the app screenshots, and the textual descriptions in the crowdsourced test reports. DeepFeature includes the Bug Feature, which directly describes the bugs, and the Context Feature, which depicts the thorough context of the bug. The similarity of the DeepFeature is used to represent the test reports’ similarity and prioritize the crowdsourced test reports. We formally define the similarity as DeepSimilarity. We also conduct an empirical experiment to evaluate the effectiveness of the proposed technique with a large dataset group. The results show that DeepPrior is promising, and it outperforms the state-of-the-art approach with less than half the overhead.
DOI: 10.1109/ICSE43902.2021.00090
It Takes Two to Tango: Combining Visual and Textual Information for Detecting Duplicate Video-Based Bug Reports
作者: Cooper, Nathan and Bernal-C'{a
关键词: Screen Recordings, Duplicate Detection, Bug Reporting
Abstract
When a bug manifests in a user-facing application, it is likely to be exposed through the graphical user interface (GUI). Given the importance of visual information to the process of identifying and understanding such bugs, users are increasingly making use of screenshots and screen-recordings as a means to report issues to developers. However, when such information is reported en masse, such as during crowd-sourced testing, managing these artifacts can be a time-consuming process. As the reporting of screen-recordings in particular becomes more popular, developers are likely to face challenges related to manually identifying videos that depict duplicate bugs. Due to their graphical nature, screen-recordings present challenges for automated analysis that preclude the use of current duplicate bug report detection techniques. To overcome these challenges and aid developers in this task, this paper presents TANGO, a duplicate detection technique that operates purely on video-based bug reports by leveraging both visual and textual information. TANGO combines tailored computer vision techniques, optical character recognition, and text retrieval. We evaluated multiple configurations of TANGO in a comprehensive empirical evaluation on 4,860 duplicate detection tasks that involved a total of 180 screen-recordings from six Android apps. Additionally, we conducted a user study investigating the effort required for developers to manually detect duplicate video-based bug reports and compared this to the effort required to use TANGO. The results reveal that TANGO’s optimal configuration is highly effective at detecting duplicate video-based bug reports, accurately ranking target duplicate videos in the top-2 returned results in 83% of the tasks. Additionally, our user study shows that, on average, TANGO can reduce developer effort by over 60%, illustrating its practicality.
DOI: 10.1109/ICSE43902.2021.00091
Automatically Matching Bug Reports With Related App Reviews
作者: Haering, Marlo and Stanik, Christoph and Maalej, Walid
关键词: software evolution, natural language processing, mining software repositories, deep learning, app store analytics
Abstract
App stores allow users to give valuable feedback on apps, and developers to find this feedback and use it for the software evolution. However, finding user feedback that matches existing bug reports in issue trackers is challenging as users and developers often use a different language. In this work, we introduce DeepMatcher, an automatic approach using state-of-the-art deep learning methods to match problem reports in app reviews to bug reports in issue trackers. We evaluated DeepMatcher with four open-source apps quantitatively and qualitatively. On average, DeepMatcher achieved a hit ratio of 0.71 and a Mean Average Precision of 0.55. For 91 problem reports, DeepMatcher did not find any matching bug report. When manually analyzing these 91 problem reports and the issue trackers of the studied apps, we found that in 47 cases, users actually described a problem before developers discovered and documented it in the issue tracker. We discuss our findings and different use cases for DeepMatcher.
DOI: 10.1109/ICSE43902.2021.00092
What Makes a Great Maintainer of Open Source Projects?
作者: Dias, Edson and Meirelles, Paulo and Castor, Fernando and Steinmacher, Igor and Wiese, Igor and Pinto, Gustavo
关键词: open source maintainers, great attributes, Open source software
Abstract
Although Open Source Software (OSS) maintainers devote a significant proportion of their work to coding tasks, great maintainers must excel in many other activities beyond coding. Maintainers should care about fostering a community, helping new members to find their place, while also saying “no” to patches that although are well-coded and well-tested, do not contribute to the goal of the project. To perform all these activities masterfully, maintainers should exercise attributes that software engineers (working on closed source projects) do not always need to master. This paper aims to uncover, relate, and prioritize the unique attributes that great OSS maintainers might have. To achieve this goal, we conducted 33 semi-structured interviews with well-experienced maintainers that are the gatekeepers of notable projects such as the Linux Kernel, the Debian operating system, and the GitLab coding platform. After we analyzed the interviews and curated a list of attributes, we created a conceptual framework to explain how these attributes are connected. We then conducted a rating survey with 90 OSS contributors. We noted that “technical excellence” and “communication” are the most recurring attributes. When grouped, these attributes fit into four broad categories: management, social, technical, and personality. While we noted that “sustain a long term vision of the project” and being “extremely careful” seem to form the basis of our framework, we noted through our survey that the communication attribute was perceived as the most essential one.
DOI: 10.1109/ICSE43902.2021.00093
Representation of Developer Expertise in Open Source Software
作者: Dey, Tapajit and Karnauch, Andrey and Mockus, Audris
关键词: World of Code, Vector Embedding, Skill Space, Project embedding, Open Source, Machine Learning, Expertise, Doc2Vec, Developer embedding, Developer Expertise, API embedding, API
Abstract
Background: Accurate representation of developer expertise has always been an important research problem. While a number of studies proposed novel methods of representing expertise within individual projects, these methods are difficult to apply at an ecosystem level. However, with the focus of software development shifting from monolithic to modular, a method of representing developers’ expertise in the context of the entire OSS development becomes necessary when, for example, a project tries to find new maintainers and look for developers with relevant skills. Aim: We aim to address this knowledge gap by proposing and constructing the Skill Space where each API, developer, and project is represented and postulate how the topology of this space should reflect what developers know (and projects need). Method: we use the World of Code infrastructure to extract the complete set of APIs in the files changed by open source developers and, based on that data, employ Doc2Vec embeddings for vector representations of APIs, developers, and projects. We then evaluate if these embeddings reflect the postulated topology of the Skill Space by predicting what new APIs/projects developers use/join, and whether or not their pull requests get accepted. We also check how the developers’ representations in the Skill Space align with their self-reported API expertise. Result: Our results suggest that the proposed embeddings in the Skill Space appear to satisfy the postulated topology and we hope that such representations may aid in the construction of signals that increase trust (and efficiency) of open source ecosystems at large and may aid investigations of other phenomena related to developer proficiency and learning.
DOI: 10.1109/ICSE43902.2021.00094
Extracting Rationale for Open Source Software Development Decisions: A Study of Python Email Archives
作者: Sharma, Pankajeshwara Nand and Savarimuthu, Bastin Tony Roy and Stanger, Nigel
关键词: rationale, heuristics, decision-making, causal extraction, Rationale Miner, Python, Open Source Software Development (OSSD)
Abstract
A sound Decision-Making (DM) process is key to the successful governance of software projects. In many Open Source Software Development (OSSD) communities, DM processes lie buried amongst vast amounts of publicly available data. Hidden within this data lie the rationale for decisions that led to the evolution and maintenance of software products. While there have been some efforts to extract DM processes from publicly available data, the rationale behind ‘how’ the decisions are made have seldom been explored. Extracting the rationale for these decisions can facilitate transparency (by making them known), and also promote accountability on the part of decision-makers. This work bridges this gap by means of a large-scale study that unearths the rationale behind decisions from Python development email archives comprising about 1.5 million emails. This paper makes two main contributions. First, it makes a knowledge contribution by unearthing and presenting the rationale behind decisions made. Second, it makes a methodological contribution by presenting a heuristics-based rationale extraction system called Rationale Miner that employs multiple heuristics, and follows a data-driven, bottom-up approach to infer the rationale behind specific decisions (e.g., whether a new module is implemented based on core developer consensus or benevolent dictator’s pronouncement). Our approach can be applied to extract rationale in other OSSD communities that have similar governance structures.
DOI: 10.1109/ICSE43902.2021.00095
Leaving My Fingerprints: Motivations and Challenges of Contributing to OSS for Social Good
作者: Huang, Yu and Ford, Denae and Zimmermann, Thomas
关键词: No keywords
Abstract
When inspiring software developers to contribute to open source software, the act is often referenced as an opportunity to build tools to support the developer community. However, that is not the only charge that propels contributions— growing interest in open source has also been attributed to software developers deciding to use their technical skills to benefit a common societal good. To understand how developers identify these projects, their motivations for contributing, and challenges they face, we conducted 21 semi-structured interviews with OSS for Social Good (OSS4SG) contributors. From our interview analysis, we identified themes of contribution styles that we wanted to understand at scale by deploying a survey to over 5765 OSS and Open Source Software for Social Good contributors. From our quantitative analysis of 517 responses, we find that the majority of contributors demonstrate a distinction between OSS4SG and OSS. Likewise, contributors described definitions based on what societal issue the project was to mitigate and who the outcomes of the project were going to benefit. In addition, we find that OSS4SG contributors focus less on benefiting themselves by padding their resume with new technology skills and are more interested in leaving their mark on society at statistically significant levels. We also find that OSS4SG contributors evaluate the owners of the project significantly more than OSS contributors. These findings inform implications to help contributors identify high societal impact projects, help project maintainers reduce barriers to entry, and help organizations understand why contributors are drawn to these projects to sustain active participation.
DOI: 10.1109/ICSE43902.2021.00096
Onboarding vs. Diversity, Productivity, and Quality: Empirical Study of the OpenStack Ecosystem
作者: Foundjem, Armstrong and Eghan, Ellis E. and Adams, Bram
关键词: knowledge-transfer, contributors, Software ecosystems, Open source, Onboarding, Mentoring, Collaboration
Abstract
Despite the growing success of open-source software ecosystems (SECOs), their sustainability depends on the recruitment and involvement of ever-larger contributors. As such, onboarding, i.e., the socio-technical adaptation of new contributors to a SECO, forms a significant aspect of a SECO’s growth that requires substantial resources. Unfortunately, despite theoretical models and initial user studies to examine the potential benefits of onboarding, little is known about the process of SECO onboarding, nor about the socio-technical benefits and drawbacks of contributors’ onboarding experience in a SECO. To address these, we first carry out an observational study of 72 new contributors during an OpenStack onboarding event to provide a catalog of teaching content, teaching strategies, onboarding challenges, and expected benefits. Next, we empirically validate the extent to which diversity, productivity, and quality benefits are achieved by mining code changes, reviews, and contributors’ issues with(out) OpenStack onboarding experience. Among other findings, our study shows a significant correlation with increasing gender diversity (65% for both females and non-binary contributors) and patch acceptance rates (13.5%). Onboarding also has a significant negative correlation with the time until a contributor’s first commit and bug-proneness of contributions.
DOI: 10.1109/ICSE43902.2021.00097
The Shifting Sands of Motivation: Revisiting What Drives Contributors in Open Source
作者: Gerosa, Marco and Wiese, Igor and Trinkenreich, Bianca and Link, Georg and Robles, Gregorio and Treude, Christoph and Steinmacher, Igor and Sarma, Anita
关键词: open source, motivation, incentive
Abstract
Open Source Software (OSS) has changed drastically over the last decade, with OSS projects now producing a large ecosystem of popular products, involving industry participation, and providing professional career opportunities. But our field’s understanding of what motivates people to contribute to OSS is still fundamentally grounded in studies from the early 2000s. With the changed landscape of OSS, it is very likely that motivations to join OSS have also evolved. Through a survey of 242 OSS contributors, we investigate shifts in motivation from three perspectives: (1) the impact of the new OSS landscape, (2) the impact of individuals’ personal growth as they become part of OSS communities, and (3) the impact of differences in individuals’ demographics. Our results show that some motivations related to social aspects and reputation increased in frequency and that some intrinsic and internalized motivations, such as learning and intellectual stimulation, are still highly relevant. We also found that contributing to OSS often transforms extrinsic motivations to intrinsic, and that while experienced contributors often shift toward altruism, novices often shift toward career, fun, kinship, and learning. OSS projects can leverage our results to revisit current strategies to attract and retain contributors, and researchers and tool builders can better support the design of new studies and tools to engage and support OSS development.
DOI: 10.1109/ICSE43902.2021.00098
White-Box Performance-Influence Models: A Profiling and Learning Approach
作者: Weber, Max and Apel, Sven and Siegmund, Norbert
关键词: software variability, software product lines, performance, Configuration management
Abstract
Many modern software systems are highly configurable, allowing the user to tune them for performance and more. Current performance modeling approaches aim at finding performance-optimal configurations by building performance models in a black-box manner. While these models provide accurate estimates, they cannot pinpoint causes of observed performance behavior to specific code regions. This does not only hinder system understanding, but it also complicates tracing the influence of configuration options to individual methods.We propose a white-box approach that models configuration-dependent performance behavior at the method level. This allows us to predict the influence of configuration decisions on individual methods, supporting system understanding and performance debugging. The approach consists of two steps: First, we use a coarse-grained profiler and learn performance-influence models for all methods, potentially identifying some methods that are highly configuration- and performance-sensitive, causing inaccurate predictions. Second, we re-measure these methods with a fine-grained profiler and learn more accurate models, at higher cost, though. By means of 9 real-world Java software systems, we demonstrate that our approach can efficiently identify configuration-relevant methods and learn accurate performance-influence models.
DOI: 10.1109/ICSE43902.2021.00099
White-Box Analysis over Machine Learning: Modeling Performance of Configurable Systems
作者: Velez, Miguel and Jamshidi, Pooyan and Siegmund, Norbert and Apel, Sven and K"{a
关键词: No keywords
Abstract
Performance-influence models can help stakeholders understand how and where configuration options and their interactions influence the performance of a system. With this understanding, stakeholders can debug performance behavior and make deliberate configuration decisions. Current black-box techniques to build such models combine various sampling and learning strategies, resulting in tradeoffs between measurement effort, accuracy, and interpretability. We present Comprex, a white-box approach to build performance-influence models for configurable systems, combining insights of local measurements, dynamic taint analysis to track options in the implementation, compositionality, and compression of the configuration space, without relying on machine learning to extrapolate incomplete samples. Our evaluation on 4 widely-used, open-source projects demonstrates that Comprex builds similarly accurate performance-influence models to the most accurate and expensive black-box approach, but at a reduced cost and with additional benefits from interpretable and local models.
DOI: 10.1109/ICSE43902.2021.00100
An Empirical Assessment of Global COVID-19 Contact Tracing Applications
作者: Sun, Ruoxi and Wang, Wei and Xue, Minhui and Tyson, Gareth and Camtepe, Seyit and Ranasinghe, Damith C.
关键词: No keywords
Abstract
The rapid spread of COVID-19 has made manual contact tracing difficult. Thus, various public health authorities have experimented with automatic contact tracing using mobile applications (or “apps”). These apps, however, have raised security and privacy concerns. In this paper, we propose an automated security and privacy assessment tool—COVIDGUARDIAN—which combines identification and analysis of Personal Identification Information (PII), static program analysis and data flow analysis, to determine security and privacy weaknesses. Furthermore, in light of our findings, we undertake a user study to investigate concerns regarding contact tracing apps. We hope that COVIDGUARDIAN, and the issues raised through responsible disclosure to vendors, can contribute to the safe deployment of mobile contact tracing. As part of this, we offer concrete guidelines, and highlight gaps between user requirements and app performance.
DOI: 10.1109/ICSE43902.2021.00101
Sustainable Solving: Reducing The Memory Footprint of IFDS-Based Data Flow Analyses Using Intelligent Garbage Collection
作者: Arzt, Steven
关键词: No keywords
Abstract
Static data flow analysis is an integral building block for many applications, ranging from compile-time code optimization to security and privacy analysis. When assessing whether a mobile app is trustworthy, for example, analysts need to identify which of the user’s personal data is sent to external parties such as the app developer or cloud providers. Since accessing and sending data is usually done via API calls, tracking the data flow between source and sink API is often the method of choice. Precise algorithms such as IFDS help reduce the number of false positives, but also introduce significant performance penalties. With its fixpoint iteration over the program’s entire exploded supergraph, IFDS is particularly memory-intensive, consuming hundreds of megabytes or even several gigabytes for medium-sized apps.In this paper, we present a technique called CleanDroid for reducing the memory footprint of a precise IFDS-based data flow analysis and demonstrate its effectiveness in the popular FlowDroid open-source data flow solver. CleanDroid efficiently removes edges from the path edge table used for the IFDS fixpoint iteration without affecting termination. As we show on 600 real-world Android apps from the Google Play Store, CleanDroid reduces the average per-app memory consumption by around 63% to 78%. At the same time, CleanDroid speeds up the analysis by up to 66%.
DOI: 10.1109/ICSE43902.2021.00102
Synthesizing Object State Transformers for Dynamic Software Updates
作者: Zhao, Zelin and Jiang, Yanyan and Xu, Chang and Gu, Tianxiao and Ma, Xiaoxing
关键词: program synthesis, object transformation, dynamic software update, Software maintenance and evolution
Abstract
There is an increasing demand for evolving software systems to deliver continuous services of no restart. Dynamic software update (DSU) aims to achieve this goal by patching the system state on the fly but is currently hindered from practice due to non-trivial cross-version object state transformations. This paper revisits this problem through an in-depth empirical study of over 190 class changes from Tomcat 8. The study produced an important finding that most non-trivial object state transformers can be constructed by reassembling existing old/new version code snippets. This paper presents a domain-specific language and an efficient algorithm for synthesizing non-trivial object transformers over code reuse. We experimentally evaluated our tool implementation PASTA with real-world software systems, reporting PASTA’s effectiveness in succeeding in 7.5x non-trivial object transformation tasks compared with the best existing DSU techniques.
DOI: 10.1109/ICSE43902.2021.00103
Fast and Precise On-the-fly Patch Validation for All
作者: Chen, Lingchao and Ouyang, Yicheng and Zhang, Lingming
关键词: No keywords
Abstract
Generate-and-validate (G&V) automated program repair (APR) techniques have been extensively studied during the past decade. Meanwhile, such techniques can be extremely time-consuming due to the manipulation of program code to fabricate a large number of patches and also the repeated test executions on patches to identify potential fixes. PraPR, a recent G&V APR technique, reduces such costs by modifying program code directly at the level of compiled JVM bytecode with on-the-fly patch validation, which directly allows multiple bytecode patches to be tested within the same JVM process. However, PraPR is limited due to its unique bytecode-repair design, and is basically unsound/imprecise as it assumes that patch executions do not change global JVM state and affect later patch executions on the same JVM process. In this paper, we propose a unified patch validation framework, named UniAPR, to perform the first empirical study of on-the-fly patch validation for state-of-the-art source-code-level APR techniques widely studied in the literature; furthermore, UniAPR addresses the imprecise patch validation issue by resetting the JVM global state via runtime bytecode transformation. We have implemented UniAPR as a publicly available fully automated Maven Plugin. Our study demonstrates for the first time that on-the-fly patch validation can often speed up state-of-the-art source-code-level APR by over an order of magnitude, enabling all existing APR techniques to explore a larger search space to fix more bugs in the near future. Furthermore, our study shows the first empirical evidence that vanilla on-the-fly patch validation can be imprecise/unsound, while UniAPR with JVM reset is able to mitigate such issues with negligible overhead.
DOI: 10.1109/ICSE43902.2021.00104
Bounded Exhaustive Search of Alloy Specification Repairs
作者: Brida, Sim'{o
关键词: No keywords
Abstract
The rising popularity of declarative languages and the hard to debug nature thereof have motivated the need for applicable, automated repair techniques for such languages. However, despite significant advances in the program repair of imperative languages, there is a dearth of repair techniques for declarative languages. This paper presents BeAFix, an automated repair technique for faulty models written in Alloy, a declarative language based on first-order relational logic. BeAFix is backed with a novel strategy for bounded exhaustive, yet scalable, exploration of the spaces of fix candidates and a formally rigorous, sound pruning of such spaces. Moreover, different from the state-of-the-art in Alloy automated repair, that relies on the availability of unit tests, BeAFix does not require tests and can work with assertions that are naturally used in formal declarative languages. Our experience with using BeAFix to repair thousands of real-world faulty models, collected by other researchers, corroborates its ability to effectively generate correct repairs and outperform the state-of-the-art.
DOI: 10.1109/ICSE43902.2021.00105
Shipwright: A Human-in-the-Loop System for Dockerfile Repair
作者: Henkel, Jordan and Silva, Denini and Teixeira, Leopoldo and d’Amorim, Marcelo and Reps, Thomas
关键词: Repair, Docker, DevOps
Abstract
Docker is a tool for lightweight OS-level virtualization. Docker images are created by performing a build, controlled by a source-level artifact called a Dockerfile. We studied Dockerfiles on GitHub, and—to our great surprise— found that over a quarter of the examined Dockerfiles failed to build (and thus to produce images). To address this problem, we propose Shipwright, a human-in-the-loop system for finding repairs to broken Dockerfiles. Shipwright uses a modified version of the BERT language model to embed build logs and to cluster broken Dockerfiles. Using these clusters and a search-based procedure, we were able to design 13 rules for making automated repairs to Dockerfiles. With the aid of Shipwright, we submitted 45 pull requests (with a 42.2% acceptance rate) to GitHub projects with broken Dockerfiles. Furthermore, in a “time-travel” analysis of broken Dockerfiles that were later fixed, we found that Shipwright proposed repairs that were equivalent to human-authored patches in 22.77% of the cases we studied. Finally, we compared our work with recent, state-of-the-art, static Dockerfile analyses, and found that, while static tools detected possible build-failure-inducing issues in 20.6-33.8% of the files we examined, Shipwright was able to detect possible issues in 73.25% of the files and, additionally, provide automated repairs for 18.9% of the files.
DOI: 10.1109/ICSE43902.2021.00106
CURE: Code-Aware Neural Machine Translation for Automatic Program Repair
作者: Jiang, Nan and Lutellier, Thibaud and Tan, Lin
关键词: software reliability, automatic program repair
Abstract
Automatic program repair (APR) is crucial to improve software reliability. Recently, neural machine translation (NMT) techniques have been used to fix software bugs automatically. While promising, these approaches have two major limitations. Their search space often does not contain the correct fix, and their search strategy ignores software knowledge such as strict code syntax. Due to these limitations, existing NMT-based techniques underperform the best template-based approaches.We propose CURE, a new NMT-based APR technique with three major novelties. First, CURE pre-trains a programming language (PL) model on a large software codebase to learn developer-like source code before the APR task. Second, CURE designs a new code-aware search strategy that finds more correct fixes by focusing on compilable patches and patches that are close in length to the buggy code. Finally, CURE uses a subword tokenization technique to generate a smaller search space that contains more correct fixes.Our evaluation on two widely-used benchmarks shows that CURE correctly fixes 57 Defects4J bugs and 26 QuixBugs bugs, outperforming all existing APR techniques on both benchmarks.
DOI: 10.1109/ICSE43902.2021.00107
A Differential Testing Approach for Evaluating Abstract Syntax Tree Mapping Algorithms
作者: Fan, Yuanrui and Xia, Xin and Lo, David and Hassan, Ahmed E. and Wang, Yuan and Li, Shanping
关键词: software evolution, abstract syntax trees, Program element mapping
Abstract
Abstract syntax tree (AST) mapping algorithms are widely used to analyze changes in source code. Despite the foundational role of AST mapping algorithms, little effort has been made to evaluate the accuracy of AST mapping algorithms, i.e., the extent to which an algorithm captures the evolution of code. We observe that a program element often has only one best-mapped program element. Based on this observation, we propose a hierarchical approach to automatically compare the similarity of mapped statements and tokens by different algorithms. By performing the comparison, we determine if each of the compared algorithms generates inaccurate mappings for a statement or its tokens. We invite 12 external experts to determine if three commonly used AST mapping algorithms generate accurate mappings for a statement and its tokens for 200 statements. Based on the experts’ feedback, we observe that our approach achieves a precision of 0.98-1.00 and a recall of 0.65-0.75. Furthermore, we conduct a large-scale study with a dataset of ten Java projects containing a total of 263,165 file revisions. Our approach determines that GumTree, MTDiff and IJM generate inaccurate mappings for 20%-29%, 25%-36% and 21%-30% of the file revisions, respectively. Our experimental results show that state-of-the-art AST mapping algorithms still need improvements.
DOI: 10.1109/ICSE43902.2021.00108
InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees
作者: Bui, Nghi D. Q. and Yu, Yijun and Jiang, Lingxiao
关键词: No keywords
Abstract
Learning code representations has found many uses in software engineering, such as code classification, code search, comment generation, and bug prediction, etc. Although representations of code in tokens, syntax trees, dependency graphs, paths in trees, or the combinations of their variants have been proposed, existing learning techniques have a major limitation that these models are often trained on datasets labeled for specific downstream tasks, and as such the code representations may not be suitable for other tasks. Even though some techniques generate representations from unlabeled code, they are far from being satisfactory when applied to the downstream tasks. To overcome the limitation, this paper proposes InferCode, which adapts the self-supervised learning idea from natural language processing to the abstract syntax trees (ASTs) of code. The novelty lies in the training of code representations by predicting subtrees automatically identified from the contexts of ASTs. With InferCode, subtrees in ASTs are treated as the labels for training the code representations without any human labelling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units.We have trained an instance of InferCode model using Tree-Based Convolutional Neural Network (TBCNN) as the encoder of a large set of Java code. This pre-trained model can then be applied to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search, or be reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to prior techniques applied to the same downstream tasks, such as code2vec, code2seq, ASTNN, using our pre-trained InferCode model higher performance is achieved with a significant margin for most of the tasks, including those involving different programming languages. The implementation of InferCode and the trained embeddings are available at the link: https://github.com/bdqnghi/infercode.
DOI: 10.1109/ICSE43902.2021.00109
Efficient Compiler Autotuning via Bayesian Optimization
作者: Chen, Junjie and Xu, Ningxin and Chen, Peiqi and Zhang, Hongyu
关键词: Configuration, Compiler Optimization, Compiler Autotuning, Bayesian Optimization
Abstract
A typical compiler such as GCC supports hundreds of optimizations controlled by compilation flags for improving the runtime performance of the compiled program. Due to the large number of compilation flags and the exponential number of flag combinations, it is impossible for compiler users to manually tune these optimization flags in order to achieve the required runtime performance of the compiled programs. Over the years, many compiler autotuning approaches have been proposed to automatically tune optimization flags, but they still suffer from the efficiency problem due to the huge search space. In this paper, we propose the first Bayesian optimization based approach, called BOCA, for efficient compiler autotuning. In BOCA, we leverage a tree-based model for approximating the objective function in order to make Bayesian optimization scalable to a large number of optimization flags. Moreover, we design a novel searching strategy to improve the efficiency of Bayesian optimization by incorporating the impact of each optimization flag measured by the tree-based model and a decay function to strike a balance between exploitation and exploration. We conduct extensive experiments to investigate the effectiveness of BOCA on two most popular C compilers (i.e., GCC and LLVM) and two widely-used C benchmarks (i.e., cBench and PolyBench). The results show that BOCA significantly outperforms the state-of-the-art compiler autotuning approaches and Bayesion optimization methods in terms of the time spent on achieving specified speedups, demonstrating the effectiveness of BOCA.
DOI: 10.1109/ICSE43902.2021.00110
TransRegex: Multi-modal Regular Expression Synthesis by Generate-and-Repair
作者: Li, Yeting and Li, Shuaimin and Xu, Zhiwu and Cao, Jialun and Chen, Zixuan and Hu, Yun and Chen, Haiming and Cheung, Shing-Chi
关键词: regex synthesis, regex repair, programming by natural languages, programming by example
Abstract
Since regular expressions (abbrev. regexes) are difficult to understand and compose, automatically generating regexes has been an important research problem. This paper introduces TRANSREGEX, for automatically constructing regexes from both natural language descriptions and examples. To the best of our knowledge, TransRegex is the first to treat the NLP-and-example-based regex synthesis problem as the problem of NLP-based synthesis with regex repair. For this purpose, we present novel algorithms for both NLP-based synthesis and regex repair. We evaluate TransRegex with ten relevant state-of-the-art tools on three publicly available datasets. The evaluation results demonstrate that the accuracy of our TransRegex is 17.4%, 35.8% and 38.9% higher than that of NLP-based approaches on the three datasets, respectively. Furthermore, TransRegex can achieve higher accuracy than the state-of-the-art multi-modal techniques with 10% to 30% higher accuracy on all three datasets. The evaluation results also indicate TransRegex utilizing natural language and examples in a more effective way.
DOI: 10.1109/ICSE43902.2021.00111
EvoSpex: An Evolutionary Algorithm for Learning Postconditions
作者: Molina, Facundo and Ponzio, Pablo and Aguirre, Nazareno and Frias, Marcelo
关键词: No keywords
Abstract
Software reliability is a primary concern in the construction of software, and thus a fundamental component in the definition of software quality. Analyzing software reliability requires a specification of the intended behavior of the software under analysis, and at the source code level, such specifications typically take the form of assertions. Unfortunately, software many times lacks such specifications, or only provides them for scenario-specific behaviors, as assertions accompanying tests. This issue seriously diminishes the analyzability of software with respect to its reliability.In this paper, we tackle this problem by proposing a technique that, given a Java method, automatically produces a specification of the method’s current behavior, in the form of postcondition assertions. This mechanism is based on generating executions of the method under analysis to obtain valid pre/post state pairs, mutating these pairs to obtain (allegedly) invalid ones, and then using a genetic algorithm to produce an assertion that is satisfied by the valid pre/post pairs, while leaving out the invalid ones. The technique, which targets in particular methods of reference-based class implementations, is assessed on a benchmark of open source Java projects, showing that our genetic algorithm is able to generate post-conditions that are stronger and more accurate, than those generated by related automated approaches, as evaluated by an automated oracle assessment tool. Moreover, our technique is also able to infer an important part of manually written rich postconditions in verified classes, and reproduce contracts for methods whose class implementations were automatically synthesized from specifications.
DOI: 10.1109/ICSE43902.2021.00112
Interface Compliance of Inline Assembly: Automatically Check, Patch and Refine
作者: Recoules, Fr'{e
关键词: No keywords
Abstract
Inline assembly is still a common practice in low-level C programming, typically for efficiency reasons or for accessing specific hardware resources. Such embedded assembly codes in the GNU syntax (supported by major compilers such as GCC, Clang and ICC) have an interface specifying how the assembly codes interact with the C environment. For simplicity reasons, the compiler treats GNU inline assembly codes as blackboxes and relies only on their interface to correctly glue them into the compiled C code. Therefore, the adequacy between the assembly chunk and its interface (named compliance) is of primary importance, as such compliance issues can lead to subtle and hard-to-find bugs. We propose RUSTInA, the first automated technique for formally checking inline assembly compliance, with the extra ability to propose (proven) patches and (optimization) refinements in certain cases. RUSTInA is based on an original formalization of the inline assembly compliance problem together with novel dedicated algorithms. Our prototype has been evaluated on 202 Debian packages with inline assembly (2656 chunks), finding 2183 issues in 85 packages - 986 significant issues in 54 packages (including major projects such as ffmpeg or ALSA), and proposing patches for 92% of them. Currently, 38 patches have already been accepted (solving 156 significant issues), with positive feedback from development teams.
DOI: 10.1109/ICSE43902.2021.00113
Enabling Software Resilience in GPGPU Applications via Partial Thread Protection
作者: Yang, Lishan and Nie, Bin and Jog, Adwait and Smirni, Evgenia
关键词: Transient faults, Thread remapping, Reliability, GPGPU application resilience
Abstract
Graphics Processing Units (GPUs) are widely used by various applications in a broad variety of fields to accelerate their computation but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. By taking advantage of a general purpose GPU application hierarchical organization in threads, warps, and cooperative thread arrays, we propose a methodology that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. This allows engaging partial replication mechanisms for error detection/correction at the warp level. By exploring 12 benchmarks (17 kernels) from 4 benchmark suites, we illustrate that threads can be remapped into reliable or unreliable warps with only 1.63% introduced overhead (on average), and then enable selective protection via replication to those groups of threads that truly need it. Furthermore, we show that thread remapping to different warps does not sacrifice application performance. We show how this remapping facilitates warp replication for error detection and/or correction and achieves an average reduction of 20.61% and 27.15% execution cycles, respectively comparing to standard duplication/triplication.
DOI: 10.1109/ICSE43902.2021.00114
Automatic Extraction of Opinion-based Q&A from Online Developer Chats
作者: Chatterjee, Preetha and Damevski, Kostadin and Pollock, Lori
关键词: public chats, opinion-asking question, opinion question-answering system, answer extraction
Abstract
Virtual conversational assistants designed specifically for software engineers could have a huge impact on the time it takes for software engineers to get help. Research efforts are focusing on virtual assistants that support specific software development tasks such as bug repair and pair programming. In this paper, we study the use of online chat platforms as a resource towards collecting developer opinions that could potentially help in building opinion Q&A systems, as a specialized instance of virtual assistants and chatbots for software engineers. Opinion Q&A has a stronger presence in chats than in other developer communications, thus mining them can provide a valuable resource for developers in quickly getting insight about a specific development topic (e.g., What is the best Java library for parsing JSON?). We address the problem of opinion Q&A extraction by developing automatic identification of opinion-asking questions and extraction of participants’ answers from public online developer chats. We evaluate our automatic approaches on chats spanning six programming communities and two platforms. Our results show that a heuristic approach to opinion-asking questions works well (.87 precision), and a deep learning approach customized to the software domain outperforms heuristics-based, machine-learning-based and deep learning for answer extraction in community question answering.
DOI: 10.1109/ICSE43902.2021.00115
Automated Query Reformulation for Efficient Search based on Query Logs From Stack Overflow
作者: Cao, Kaibo and Chen, Chunyang and Baltes, Sebastian and Treude, Christoph and Chen, Xiang
关键词: Stack Overflow, Query Reformulation, Query Logs, Deep Learning, Data Mining
Abstract
As a popular Q&A site for programming, Stack Overflow is a treasure for developers. However, the amount of questions and answers on Stack Overflow make it difficult for developers to efficiently locate the information they are looking for. There are two gaps leading to poor search results: the gap between the user’s intention and the textual query, and the semantic gap between the query and the post content. Therefore, developers have to constantly reformulate their queries by correcting misspelled words, adding limitations to certain programming languages or platforms, etc. As query reformulation is tedious for developers, especially for novices, we propose an automated software-specific query reformulation approach based on deep learning. With query logs provided by Stack Overflow, we construct a large-scale query reformulation corpus, including the original queries and corresponding reformulated ones. Our approach trains a Transformer model that can automatically generate candidate reformulated queries when given the user’s original query. The evaluation results show that our approach outperforms five state-of-the-art baselines, and achieves a 5.6% to 33.5% boost in terms of ExactMatch and a 4.8% to 14.4% boost in terms of GLEU.
DOI: 10.1109/ICSE43902.2021.00116
Automatic Solution Summarization for Crash Bugs
作者: Wang, Haoye and Xia, Xin and Lo, David and Grundy, John and Wang, Xinyu
关键词: No keywords
Abstract
The causes of software crashes can be hidden anywhere in the source code and development environment. When encountering software crashes, recurring bugs that are discussed on Q&A sites could provide developers with solutions to their crashing problems. However, it is difficult for developers to accurately search for relevant content on search engines, and developers have to spend a lot of manual effort to find the right solution from the returned results. In this paper, we present CraSolver, an approach that takes into account both the structural information of crash traces and the knowledge of crash-causing bugs to automatically summarize solutions from crash traces. Given a crash trace, CraSolver retrieves relevant questions from Q&A sites by combining a proposed position dependent similarity - based on the structural information of the crash trace - with an extra knowledge similarity, based on the knowledge from official documentation sites. After obtaining the answers to these questions from the Q&A site, CraSolver summarizes the final solution based on a multi-factor scoring mechanism. To evaluate our approach, we built two repositories of Java and Android exception-related questions from Stack Overflow with size of 69,478 and 33,566 questions respectively. Our user study results using 50 selected Java crash traces and 50 selected Android crash traces show that our approach significantly outperforms four baselines in terms of relevance, usefulness, and diversity. The evaluation also confirms the effectiveness of the relevant question retrieval component in our approach for crash traces.
DOI: 10.1109/ICSE43902.2021.00117
Supporting Quality Assurance with Automated Process-Centric Quality Constraints Checking
作者: Mayr-Dorn, Christoph and Vierhauser, Michael and Bichler, Stefan and Keplinger, Felix and Cleland-Huang, Jane and Egyed, Alexander and Mehofer, Thomas
关键词: traceability, software engineering process, developer support
Abstract
Regulations, standards, and guidelines for safety-critical systems stipulate stringent traceability but do not prescribe the corresponding, detailed software engineering process. Given the industrial practice of using only semi-formal notations to describe engineering processes, processes are rarely “executable” and developers have to spend significant manual effort in ensuring that they follow the steps mandated by quality assurance. The size and complexity of systems and regulations makes manual, timely feedback from Quality Assurance (QA) engineers infeasible. In this paper we propose a novel framework for tracking processes in the background, automatically checking QA constraints depending on process progress, and informing the developer of unfulfilled QA constraints. We evaluate our approach by applying it to two different case studies; one open source community system and a safety-critical system in the air-traffic control domain. Results from the analysis show that trace links are often corrected or completed after the fact and thus timely and automated constraint checking support has significant potential on reducing rework.
DOI: 10.1109/ICSE43902.2021.00118
Understanding Bounding Functions in Safety-Critical UAV Software
作者: Liang, Xiaozhou and Burns, John Henry and Sanchez, Joseph and Dantu, Karthik and Ziarek, Lukasz and Liu, Yu David
关键词: unmanned aerial vehicles, safety, bounding functions
Abstract
Unmanned Aerial Vehicles (UAVs) are an emerging computation platform known for their safety-critical need. In this paper, we conduct an empirical study on a widely used open-source UAV software framework, Paparazzi, with the goal of understanding the safety-critical concerns of UAV software from a bottom-up developer-in-the-field perspective. We set our focus on the use of Bounding Functions (BFs), the runtime checks injected by Paparazzi developers on the range of variables. Through an in-depth analysis on BFs in the Paparazzi autopilot software, we found a large number of them (109 instances) are used to bound safety-critical variables essential to the cyber-physical nature of the UAV, such as its thrust, its speed, and its sensor values. The novel contributions of this study are two fold. First, we take a static approach to classify all BF instances, presenting a novel datatype-based 5-category taxonomy with finegrained insight on the role of BFs in ensuring the safety of UAV systems. Second, we dynamically evaluate the impact of the BF uses through a differential approach, establishing the UAV behavioral difference with and without BFs. The two-pronged static and dynamic approach together illuminates a rarely studied design space of safety-critical UAV software systems.
DOI: 10.1109/ICSE43902.2021.00119
Enhancing Genetic Improvement of Software with Regression Test Selection
作者: Guizzo, Giovani and Petke, Justyna and Sarro, Federica and Harman, Mark
关键词: Search Based Software Engineering, Regression Test Selection, Genetic Programming, Genetic Improvement
Abstract
Genetic improvement uses artificial intelligence to automatically improve software with respect to non-functional properties (AI for SE). In this paper, we propose the use of existing software engineering best practice to enhance Genetic Improvement (SE for AI).We conjecture that existing Regression Test Selection (RTS) techniques (which have been proven to be efficient and effective) can and should be used as a core component of the GI search process for maximising its effectiveness.To assess our idea, we have carried out a thorough empirical study assessing the use of both dynamic and static RTS techniques with GI to improve seven real-world software programs.The results of our empirical evaluation show that incorporation of RTS within GI significantly speeds up the whole GI process, making it up to 68% faster on our benchmark set, being still able to produce valid software improvements.Our findings are significant in that they can save hours to days of computational time, and can facilitate the uptake of GI in an industrial setting, by significantly reducing the time for the developer to receive feedback from such an automated technique. Therefore, we recommend the use of RTS in future test-based automated software improvement work. Finally, we hope this successful application of SE for AI will encourage other researchers to investigate further applications in this area.
DOI: 10.1109/ICSE43902.2021.00120
Containing Malicious Package Updates in npm with a Lightweight Permission System
作者: Ferreira, Gabriel and Jia, Limin and Sunshine, Joshua and K"{a
关键词: supply-chain security, security, sand-boxing, permission system, package management, malicious package updates, design trade-offs
Abstract
The large amount of third-party packages available in fast-moving software ecosystems, such as Node.js/npm, enables attackers to compromise applications by pushing malicious updates to their package dependencies. Studying the npm repository, we observed that many packages in the npm repository that are used in Node.js applications perform only simple computations and do not need access to filesystem or network APIs. This offers the opportunity to enforce least-privilege design per package, protecting applications and package dependencies from malicious updates. We propose a lightweight permission system that protects Node.js applications by enforcing package permissions at runtime. We discuss the design space of solutions and show that our system makes a large number of packages much harder to be exploited, almost for free.
DOI: 10.1109/ICSE43902.2021.00121
Too Quiet in the Library: An Empirical Study of Security Updates in Android Apps’ Native Code
作者: Almanee, Sumaya and "{U
关键词: No keywords
Abstract
Android apps include third-party native libraries to increase performance and to reuse functionality. Native code is directly executed from apps through the Java Native Interface or the Android Native Development Kit. Android developers add precompiled native libraries to their projects, enabling their use. Unfortunately, developers often struggle or simply neglect to update these libraries in a timely manner. This results in the continuous use of outdated native libraries with unpatched security vulnerabilities years after patches became available.To further understand such phenomena, we study the security updates in native libraries in the most popular 200 free apps on Google Play from Sept. 2013 to May 2020. A core difficulty we face in this study is the identification of libraries and their versions. Developers often rename or modify libraries, making their identification challenging. We create an approach called LibRARIAN (LibRAry veRsion IdentificAtioN) that accurately identifies native libraries and their versions as found in Android apps based on our novel similarity metric bin2sim. LibRARIAN leverages different features extracted from libraries based on their metadata and identifying strings in read-only sections.We discovered 53/200 popular apps (26.5%) with vulnerable versions with known CVEs between Sept. 2013 and May 2020, with 14 of those apps remaining vulnerable. We find that app developers took, on average, 528.71 ±40.20 days to apply security patches, while library developers release a security patch after 54.59 ± 8.12 days—a 10 times slower rate of update.
DOI: 10.1109/ICSE43902.2021.00122
If It’s Not Secure, It Should Not Compile: Preventing DOM-Based XSS in Large-Scale Web Development with API Hardening
作者: Wang, Pei and Bangert, Julian and Kern, Christoph
关键词: language-based security, empirical software engineering, cross-site scripting, Web security
Abstract
With tons of efforts spent on its mitigation, Cross-site scripting (XSS) remains one of the most prevalent security threats on the internet. Decades of exploitation and remediation demonstrated that code inspection and testing alone does not eliminate XSS vulnerabilities in complex web applications with a high degree of confidence.This paper introduces Google’s secure-by-design engineering paradigm that effectively prevents DOM-based XSS vulnerabilities in large-scale web development. Our approach, named API hardening, enforces a series of company-wide secure coding practices. We provide a set of secure APIs to replace native DOM APIs that are prone to XSS vulnerabilities. Through a combination of type contracts and appropriate validation and escaping, the secure APIs ensure that applications based thereon are free of XSS vulnerabilities. We deploy a simple yet capable compile-time checker to guarantee that developers exclusively use our hardened APIs to interact with the DOM. We make various of efforts to scale this approach to tens of thousands of engineers without significant productivity impact. By offering rigorous tooling and consultant support, we help developers adopt the secure coding practices as seamlessly as possible. We present empirical results showing how API hardening has helped reduce the occurrences of XSS vulnerabilities in Google’s enormous code base over the course of two-year deployment.
DOI: 10.1109/ICSE43902.2021.00123
Why Security Defects Go Unnoticed during Code Reviews? A Case-Control Study of the Chromium OS Project
作者: Paul, Rajshakhar and Turzo, Asif Kamal and Bosu, Amiangshu
关键词: vulnerability, security, code review
Abstract
Peer code review has been found to be effective in identifying security vulnerabilities. However, despite practicing mandatory code reviews, many Open Source Software (OSS) projects still encounter a large number of post-release security vulnerabilities, as some security defects escape those. Therefore, a project manager may wonder if there was any weakness or inconsistency during a code review that missed a security vulnerability. Answers to this question may help a manager pinpointing areas of concern and taking measures to improve the effectiveness of his/her project’s code reviews in identifying security defects. Therefore, this study aims to identify the factors that differentiate code reviews that successfully identified security defects from those that missed such defects.With this goal, we conduct a case-control study of Chromium OS project. Using multi-stage semi-automated approaches, we build a dataset of 516 code reviews that successfully identified security defects and 374 code reviews where security defects escaped. The results of our empirical study suggest that the are significant differences between the categories of security defects that are identified and that are missed during code reviews. A logistic regression model fitted on our dataset achieved an AUC score of 0.91 and has identified nine code review attributes that influence identifications of security defects. While time to complete a review, the number of mutual reviews between two developers, and if the review is for a bug fix have positive impacts on vulnerability identification, opposite effects are observed from the number of directories under review, the number of total reviews by a developer, and the total number of prior commits for the file under review.
DOI: 10.1109/ICSE43902.2021.00124
Technical Leverage in a Software Ecosystem: Development Opportunities and Security Risks
作者: Massacci, Fabio and Pashchenko, Ivan
关键词: vulnerabilities, technical debt, software security, maven, leverage, free open source software, empirical analysis, dependencies
Abstract
In finance, leverage is the ratio between assets borrowed from others and one’s own assets. A matching situation is present in software: by using free open-source software (FOSS) libraries a developer leverages on other people’s code to multiply the offered functionalities with a much smaller own codebase. In finance as in software, leverage magnifies profits when returns from borrowing exceed costs of integration, but it may also magnify losses, in particular in the presence of security vulnerabilities. We aim to understand the level of technical leverage in the FOSS ecosystem and whether it can be a potential source of security vulnerabilities. Also, we introduce two metrics change distance and change direction to capture the amount and the evolution of the dependency on third-party libraries.The application of the proposed metrics on 8494 distinct library versions from the FOSS Maven-based Java libraries shows that small and medium libraries (less than 100KLoC) have disproportionately more leverage on FOSS dependencies in comparison to large libraries. We show that leverage pays off as leveraged libraries only add a 4% delay in the time interval between library releases while providing four times more code than their own. However, libraries with such leverage (i.e., 75% of libraries in our sample) also have 1.6 higher odds of being vulnerable in comparison to the libraries with lower leverage.We provide an online demo for computing the proposed metrics for real-world software libraries available under the following URL: https://techleverage.eu/.
DOI: 10.1109/ICSE43902.2021.00125
RAICC: Revealing Atypical Inter-Component Communication in Android Apps
作者: Samhi, Jordan and Bartel, Alexandre and Bissyande, Tegawende F. and Klein, Jacques
关键词: Static Analysis, Android Security
Abstract
Inter-Component Communication (ICC) is a key mechanism in Android. It enables developers to compose rich functionalities and explore reuse within and across apps. Unfortunately, as reported by a large body of literature, ICC is rather “complex and largely unconstrained”, leaving room to a lack of precision in apps modeling. To address the challenge of tracking ICCs within apps, state of the art static approaches such as Epicc, IccTA and Amandroid have focused on the documented framework ICC methods (e.g., startActivity) to build their approaches. In this work we show that ICC models inferred in these state of the art tools may actually be incomplete: the framework provides other atypical ways of performing ICCs. To address this limitation in the state of the art, we propose RAICC a static approach for modeling new ICC links and thus boosting previous analysis tasks such as ICC vulnerability detection, privacy leaks detection, malware detection, etc. We have evaluated RAICC on 20 benchmark apps, demonstrating that it improves the precision and recall of uncovered leaks in state of the art tools. We have also performed a large empirical investigation showing that Atypical ICC methods are largely used in Android apps, although not necessarily for data transfer. We also show that RAICC increases the number of ICC links found by 61.6% on a dataset of real-world malicious apps, and that RAICC enables the detection of new ICC vulnerabilities.
DOI: 10.1109/ICSE43902.2021.00126
Smart Contract Security: a Practitioners’ Perspective
作者: Wan, Zhiyuan and Xia, Xin and Lo, David and Chen, Jiachi and Luo, Xiapu and Yang, Xiaohu
关键词: Smart contract, Security, Practitioner, Empirical study
Abstract
Smart contracts have been plagued by security incidents, which resulted in substantial financial losses. Given numerous research efforts in addressing the security issues of smart contracts, we wondered how software practitioners build security into smart contracts in practice. We performed a mixture of qualitative and quantitative studies with 13 interviewees and 156 survey respondents from 35 countries across six continents to understand practitioners’ perceptions and practices on smart contract security. Our study uncovers practitioners’ motivations and deterrents of smart contract security, as well as how security efforts and strategies fit into the development lifecycle. We also find that blockchain platforms have a statistically significant impact on practitioners’ security perceptions and practices of smart contract development. Based on our findings, we highlight future research directions and provide recommendations for practitioners.
DOI: 10.1109/ICSE43902.2021.00127
AID: An automated detector for gender-inclusivity bugs in OSS project pages
作者: Chatterjee, Amreeta and Guizani, Mariam and Stevens, Catherine and Emard, Jillian and May, Mary Evelyn and Burnett, Margaret and Ahmed, Iftekhar and Sarma, Anita
关键词: open source, information processing, automation, Gender inclusivity
Abstract
The tools and infrastructure used in tech, including Open Source Software (OSS), can embed “inclusivity bugs”—features that disproportionately disadvantage particular groups of contributors. To see whether OSS developers have existing practices to ward off such bugs, we surveyed 266 OSS developers. Our results show that a majority (77%) of developers do not use any inclusivity practices, and 92% of respondents cited a lack of concrete resources to enable them to do so. To help fill this gap, this paper introduces AID, a tool that automates the GenderMag method to systematically find gender-inclusivity bugs in software. We then present the results of the tool’s evaluation on 20 GitHub projects. The tool achieved precision of 0.69, recall of 0.92, an F-measure of 0.79 and even captured some inclusivity bugs that human GenderMag teams missed.
DOI: 10.1109/ICSE43902.2021.00128
“Ignorance and Prejudice” in Software Fairness
作者: Zhang, Jie M. and Harman, Mark
关键词: software fairness, machine learning fairness
Abstract
Machine learning software can be unfair when making human-related decisions, having prejudices over certain groups of people. Existing work primarily focuses on proposing fairness metrics and presenting fairness improvement approaches. It remains unclear how key aspect of any machine learning system, such as feature set and training data, affect fairness. This paper presents results from a comprehensive study that addresses this problem. We find that enlarging the feature set plays a significant role in fairness (with an average effect rate of 38%). Importantly, and contrary to widely-held beliefs that greater fairness often corresponds to lower accuracy, our findings reveal that an enlarged feature set has both higher accuracy and fairness. Perhaps also surprisingly, we find that a larger training data does not help to improve fairness. Our results suggest a larger training data set has more unfairness than a smaller one when feature sets are insufficient; an important cautionary finding for practising software engineers.
DOI: 10.1109/ICSE43902.2021.00129
Semi-supervised Log-based Anomaly Detection via Probabilistic Label Estimation
作者: Yang, Lin and Chen, Junjie and Wang, Zan and Wang, Weijing and Jiang, Jiajun and Dong, Xuyuan and Zhang, Wenbin
关键词: Probabilistic Estimation, Log Analysis, Label, Deep Learning, Anomaly Detection
Abstract
With the growth of software systems, logs have become an important data to aid system maintenance. Log-based anomaly detection is one of the most important methods for such purpose, which aims to automatically detect system anomalies via log analysis. However, existing log-based anomaly detection approaches still suffer from practical issues due to either depending on a large amount of manually labeled training data (supervised approaches) or unsatisfactory performance without learning the knowledge on historical anomalies (unsupervised and semi-supervised approaches).In this paper, we propose a novel practical log-based anomaly detection approach, PLELog, which is semi-supervised to get rid of time-consuming manual labeling and incorporates the knowledge on historical anomalies via probabilistic label estimation to bring supervised approaches’ superiority into play. In addition, PLELog is able to stay immune to unstable log data via semantic embedding and detect anomalies efficiently and effectively by designing an attention-based GRU neural network. We evaluated PLELog on two most widely-used public datasets, and the results demonstrate the effectiveness of PLELog, significantly outperforming the compared approaches with an average of 181.6% improvement in terms of F1-score. In particular, PLELog has been applied to two real-world systems from our university and a large corporation, further demonstrating its practicability.
DOI: 10.1109/ICSE43902.2021.00130
DeepLV: Suggesting Log Levels Using Ordinal Based Neural Networks
作者: Li, Zhenhao and Li, Heng and Chen, Tse-Hsun Peter and Shang, Weiyi
关键词: logs, log level, empirical study, deep learning
Abstract
Developers write logging statements to generate logs that provide valuable runtime information for debugging and maintenance of software systems. Log level is an important component of a logging statement, which enables developers to control the information to be generated at system runtime. However, due to the complexity of software systems and their runtime behaviors, deciding a proper log level for a logging statement is a challenging task. For example, choosing a higher level (e.g., error) for a trivial event may confuse end users and increase system maintenance overhead, while choosing a lower level (e.g., trace) for a critical event may prevent the important execution information to be conveyed opportunely. In this paper, we tackle the challenge by first conducting a preliminary manual study on the characteristics of log levels. We find that the syntactic context of the logging statement and the message to be logged might be related to the decision of log levels, and log levels that are further apart in order (e.g., trace and error) tend to have more differences in their characteristics. Based on this, we then propose a deep-learning based approach that can leverage the ordinal nature of log levels to make suggestions on choosing log levels, by using the syntactic context and message features of the logging statements extracted from the source code. Through an evaluation on nine large-scale open source projects, we find that: 1) our approach outperforms the state-of-the-art baseline approaches; 2) we can further improve the performance of our approach by enlarging the training data obtained from other systems; 3) our approach also achieves promising results on cross-system suggestions that are even better than the baseline approaches on within-system suggestions. Our study highlights the potentials in suggesting log levels to help developers make informed logging decisions.
DOI: 10.1109/ICSE43902.2021.00131
How to Identify Boundary Conditions with Contrasty Metric?
作者: Luo, Weilin and Wan, Hai and Song, Xiaotong and Yang, Binhao and Zhong, Hongzhen and Chen, Yin
关键词: Goal-Oriented Requirement Engineering, Goal-Conflict Identification, Boundary Conditions
Abstract
The boundary conditions (BCs) have shown great potential in requirements engineering because a BC captures the particular combination of circumstances, i.e., divergence, in which the goals of the requirement cannot be satisfied as a whole. Existing researches have attempted to automatically identify lots of BCs. Unfortunately, a large number of identified BCs make assessing and resolving divergences expensive. Existing methods adopt a coarse-grained metric, generality, to filter out less general BCs. However, the results still retain a large number of redundant BCs since a general BC potentially captures redundant circumstances that do not lead to a divergence. Furthermore, the likelihood of BC can be misled by redundant BCs resulting in costly repeatedly assessing and resolving divergences.In this paper, we present a fine-grained metric to filter out the redundant BCs. We first introduce the concept of contrasty of BC. Intuitively, if two BCs are contrastive, they capture different divergences. We argue that a set of contrastive BCs should be recommended to engineers, rather than a set of general BCs that potentially only indicates the same divergence. Then we design a post-processing framework (PPFc) to produce a set of contrastive BCs after identifying BCs. Experimental results show that the contrasty metric dramatically reduces the number of BCs recommended to engineers. Results also demonstrate that lots of BCs identified by the state-of-the-art method are redundant in most cases. Besides, to improve efficiency, we propose a joint framework (JFc) to interleave assessing based on the contrasty metric with identifying BCs. The primary intuition behind JFc is that it considers the search bias toward contrastive BCs during identifying BCs, thereby pruning the BCs capturing the same divergence. Experiments confirm the improvements of JFc in identifying contrastive BCs.
DOI: 10.1109/ICSE43902.2021.00132
Using Domain-specific Corpora for Improved Handling of Ambiguity in Requirements
作者: Ezzini, Saad and Abualhaija, Sallam and Arora, Chetan and Sabetzadeh, Mehrdad and Briand, Lionel C.
关键词: Wikipedia, Requirements Engineering, Natural-language Requirements, Natural Language Processing, Corpus Generation, Ambiguity
Abstract
Ambiguity in natural-language requirements is a pervasive issue that has been studied by the requirements engineering community for more than two decades. A fully manual approach for addressing ambiguity in requirements is tedious and time-consuming, and may further overlook unacknowledged ambiguity - the situation where different stakeholders perceive a requirement as unambiguous but, in reality, interpret the requirement differently. In this paper, we propose an automated approach that uses natural language processing for handling ambiguity in requirements. Our approach is based on the automatic generation of a domain-specific corpus from Wikipedia. Integrating domain knowledge, as we show in our evaluation, leads to a significant positive improvement in the accuracy of ambiguity detection and interpretation. We scope our work to coordination ambiguity (CA) and prepositional-phrase attachment ambiguity (PAA) because of the prevalence of these types of ambiguity in natural-language requirements [1]. We evaluate our approach on 20 industrial requirements documents. These documents collectively contain more than 5000 requirements from seven distinct application domains. Over this dataset, our approach detects CA and PAA with an average precision of ~{a
DOI: 10.1109/ICSE43902.2021.00133
On Indirectly Dependent Documentation in the Context of Code Evolution: A Study
作者: Sondhi, Devika and Gupta, Avyakt and Purandare, Salil and Rana, Ankit and Kaushal, Deepanshu and Purandare, Rahul
关键词: documentation, commits, code evolution, GitHub repositories
Abstract
A software system evolves over time due to factors such as bug-fixes, enhancements, optimizations and deprecation. As entities interact in a software repository, the alterations made at one point may require the changes to be reflected at various other points to maintain consistency. However, often less attention is given to making appropriate changes to the documentation associated with the functions. Inconsistent documentation is undesirable, since documentation serves as a useful source of information about the functionality. This paper presents a study on the prevalence of function documentations that are indirectly or implicitly dependent on entities other than the associated function. We observe a substantial presence of such documentations, with 62% of the studied Javadoc comments being dependent on other entities, as studied in 11 open-source repositories implemented in Java. We comprehensively analyze the nature of documentation updates made in 1288 commit logs and study patterns to reason about the cause of dependency in the documentation. Our findings from the observed patterns may be applied to suggest documentations that should be updated on making a change in the repository.
DOI: 10.1109/ICSE43902.2021.00134
CodeShovel: Constructing Method-Level Source Code Histories
作者: Grund, Felix and Chowdhury, Shaiful and Bradley, Nick C. and Hall, Braxton and Holmes, Reid
关键词: No keywords
Abstract
Source code histories are commonly used by developers and researchers to reason about how software evolves. Through a survey with 42 professional software developers, we learned that developers face significant mismatches between the output provided by developers’ existing tools for examining source code histories and what they need to successfully complete their historical analysis tasks. To address these shortcomings, we propose CodeShovel, a tool for uncovering method histories that quickly produces complete and accurate change histories for 90% methods (including 97% of all method changes) outperforming leading tools from both research (e.g, FinerGit) and practice (e.g., IntelliJ / git log). CodeShovel helps developers to navigate the entire history of source code methods so they can better understand how the method evolved. A field study on industrial code bases with 16 industrial developers confirmed our empirical findings of CodeShovel’s correctness, low runtime overheads, and additionally showed that the approach can be useful for a wide range of industrial development tasks.
DOI: 10.1109/ICSE43902.2021.00135
Evaluating Unit Testing Practices in R Packages
作者: Vidoni, Melina
关键词: No keywords
Abstract
Testing Technical Debt (TTD) occurs due to shortcuts (non-optimal decisions) taken about testing; it is the test dimension of technical debt. R is a package-based programming ecosystem that provides an easy way to install third-party code, datasets, tests, documentation and examples. This structure makes it especially vulnerable to TTD because errors present in a package can transitively affect all packages and scripts that depend on it. Thus, TTD can effectively become a threat to the validity of all analysis written in R that rely on potentially faulty code. This two-part study provides the first analysis in this area. First, 177 systematically-selected, open-source R packages were mined and analysed to address quality of testing, testing goals, and identify potential TTD sources. Second, a survey addressed how R package developers perceive testing and face its challenges (response rate of 19.4%). Results show that testing in R packages is of low quality; the most common smells are inadequate and obscure unit testing, improper asserts, inexperienced testers and improper test design. Furthermore, skilled R developers still face challenges such as time constraints, emphasis on development rather than testing, poor tool documentation and a steep learning curve.
DOI: 10.1109/ICSE43902.2021.00136
Data-Oriented Differential Testing of Object-Relational Mapping Systems
作者: Sotiropoulos, Thodoris and Chaliasos, Stefanos and Atlidakis, Vaggelis and Mitropoulos, Dimitris and Spinellis, Diomidis
关键词: Object-Relational Mapping, Differential Testing, Automated Testing
Abstract
We introduce, what is to the best of our knowledge, the first approach for systematically testing Object-Relational Mapping (ORM) systems. Our approach leverages differential testing to establish a test oracle for ORM-specific bugs. Specifically, we first generate random relational database schemas, set up the respective databases, and then, we query these databases using the APIs of the ORM systems under test. To tackle the challenge that ORMs lack a common input language, we generate queries written in an abstract query language. These abstract queries are translated into concrete, executable ORM queries, which are ultimately used to differentially test the correctness of target implementations. The effectiveness of our method heavily relies on the data inserted to the underlying databases. Therefore, we employ a solver-based approach for producing targeted database records with respect to the constraints of the generated queries. We implement our approach as a tool, called CYNTHIA, which found 28 bugs in five popular ORM systems. The vast majority of these bugs are confirmed (25 / 28), more than half were fixed (20 / 28), and three were marked as release blockers by the corresponding developers.
DOI: 10.1109/ICSE43902.2021.00137
Automatic Unit Test Generation for Machine Learning Libraries: How Far Are We?
作者: Wang, Song and Shrestha, Nishtha and Subburaman, Abarna Kucheri and Wang, Junjie and Wei, Moshi and Nagappan, Nachiappan
关键词: testing machine learning libraries, test case generation, Empirical software engineering
Abstract
Automatic unit test generation that explores the input space and produces effective test cases for given programs have been studied for decades. Many unit test generation tools that can help generate unit test cases with high structural coverage over a program have been examined. However, the fact that existing test generation tools are mainly evaluated on general software programs calls into question about its practical effectiveness and usefulness for machine learning libraries, which are statistically-orientated and have fundamentally different nature and construction from general software projects.In this paper, we set out to investigate the effectiveness of existing unit test generation techniques on machine learning libraries. To investigate this issue, we conducted an empirical study on five widely-used machine learning libraries with two popular unit test case generation tools, i.e., EVOSUITE and Randoop. We find that (1) most of the machine learning libraries do not maintain a high-quality unit test suite regarding commonly applied quality metrics such as code coverage (on average is 34.1%) and mutation score (on average is 21.3%), (2) unit test case generation tools, i.e., EVOSUITE and Randoop, lead to clear improvements in code coverage and mutation score, however, the improvement is limited, and (3) there exist common patterns in the uncovered code across the five machine learning libraries that can be used to improve unit test case generation tasks.
DOI: 10.1109/ICSE43902.2021.00138
Layout and Image Recognition Driving Cross-Platform Automated Mobile Testing
作者: Yu, Shengcheng and Fang, Chunrong and Yun, Yexiao and Feng, Yang
关键词: Record and Replay, Mobile Testing, Image Analysis, Cross-Platform Testing
Abstract
The fragmentation problem has extended from Android to different platforms, such as iOS, mobile web, and even mini-programs within some applications (app), like WeChat1. In such a situation, recording and replaying test scripts is one of the most popular automated mobile app testing approaches. However, such approach encounters severe problems when crossing platforms. Different versions of the same app need to be developed to support different platforms relying on different platform supports. Therefore, mobile app developers need to develop and maintain test scripts for multiple platforms aimed at completely the same test requirements, greatly increasing testing costs. However, we discover that developers adopt highly similar user interface layouts for versions of the same app on different platforms. Such a phenomenon inspires us to replay test scripts from the perspective of similar UI layouts.In this paper, we propose an image-driven mobile app testing framework, utilizing Widget Feature Matching and Layout Characterization Matching to analyze app UIs. We use computer vision (CV) technologies to perform UI feature comparison and layout hierarchy extraction on mobile app screenshots to obtain UI structures containing rich contextual information of app widgets, including coordinates, relative relationship, etc. Based on acquired UI structures, we can form a platform-independent test script, and then locate the target widgets under test. Thus, the proposed framework non-intrusively replays test scripts according to a novel platform-independent test script model. We also design and implement a tool named LIRAT to devote the proposed framework into practice, based on which, we conduct an empirical study to evaluate the effectiveness and usability of the proposed testing framework. The results show that the overall replay accuracy reaches around 65.85% on Android (8.74% improvement over state-of-the-art approaches) and 35.26% on iOS (35% improvement over state-of-the-art approaches).
DOI: 10.1109/ICSE43902.2021.00139
FlakeFlagger: Predicting Flakiness Without Rerunning Tests
作者: Alshammari, Abdulrahman and Morris, Christopher and Hilton, Michael and Bell, Jonathan
关键词: No keywords
Abstract
When developers make changes to their code, they typically run regression tests to detect if their recent changes (re)introduce any bugs. However, many tests are flaky, and their outcomes can change non-deterministically, failing without apparent cause. Flaky tests are a significant nuisance in the development process, since they make it more difficult for developers to trust the outcome of their tests, and hence, it is important to know which tests are flaky. The traditional approach to identify flaky tests is to rerun them multiple times: if a test is observed both passing and failing on the same code, it is definitely flaky. We conducted a very large empirical study looking for flaky tests by rerunning the test suites of 24 projects 10,000 times each, and found that even with this many reruns, some previously identified flaky tests were still not detected. We propose FlakeFlagger, a novel approach that collects a set of features describing the behavior of each test, and then predicts tests that are likely to be flaky based on similar behavioral features. We found that FlakeFlagger correctly labeled as flaky at least as many tests as a state-of-the-art flaky test classifier, but that FlakeFlagger reported far fewer false positives. This lower false positive rate translates directly to saved time for researchers and developers who use the classification result to guide more expensive flaky test detection processes. Evaluated on our dataset of 23 projects with flaky tests, FlakeFlagger outperformed the prior approach (by F1 score) on 16 projects and tied on 4 projects. Our results indicate that this approach can be effective for identifying likely flaky tests prior to running time-consuming flaky test detectors.
DOI: 10.1109/ICSE43902.2021.00140
An Empirical Analysis of UI-based Flaky Tests
作者: Romano, Alan and Song, Zihe and Grandhi, Sampath and Yang, Wei and Wang, Weihang
关键词: No keywords
Abstract
Flaky tests have gained attention from the research community in recent years and with good reason. These tests lead to wasted time and resources, and they reduce the reliability of the test suites and build systems they affect. However, most of the existing work on flaky tests focus exclusively on traditional unit tests. This work ignores UI tests that have larger input spaces and more diverse running conditions than traditional unit tests. In addition, UI tests tend to be more complex and resource-heavy, making them unsuited for detection techniques involving rerunning test suites multiple times.In this paper, we perform a study on flaky UI tests. We analyze 235 flaky UI test samples found in 62 projects from both web and Android environments. We identify the common underlying root causes of flakiness in the UI tests, the strategies used to manifest the flaky behavior, and the fixing strategies used to remedy flaky UI tests. The findings made in this work can provide a foundation for the development of detection and prevention techniques for flakiness arising in UI tests.
DOI: 10.1109/ICSE43902.2021.00141
GenTree: Using Decision Trees to Learn Interactions for Configurable Software
作者: Nguyen, KimHao and Nguyen, ThanhVu
关键词: No keywords
Abstract
Modern software systems are increasingly designed to be highly configurable, which increases flexibility but can make programs harder to develop, test, and analyze, e.g., how configuration options are set to reach certain locations, what characterizes the configuration space of an interesting or buggy program behavior? We introduce GenTree, a new dynamic analysis that automatically learns a program’s interactions—logical formulae that describe how configuration option settings map to code coverage. GenTree uses an iterative refinement approach that runs the program under a small sample of configurations to obtain coverage data; uses a custom classifying algorithm on these data to build decision trees representing interaction candidates; and then analyzes the trees to generate new configurations to further refine the trees and interactions in the next iteration. Our experiments on 17 configurable systems spanning 4 languages show that GenTree efficiently finds precise interactions using a tiny fraction of the configuration space.
DOI: 10.1109/ICSE43902.2021.00142
Semantic Web Accessibility Testing via Hierarchical Visual Analysis
作者: Bajammal, Mohammad and Mesbah, Ali
关键词: web testing, web accessibility, visual analysis, accessibility testing
Abstract
Web accessibility, the design of web apps to be usable by users with disabilities, impacts millions of people around the globe. Although accessibility has traditionally been a marginal afterthought that is often ignored in many software products, it is increasingly becoming a legal requirement that must be satisfied. While some web accessibility testing tools exist, most only perform rudimentary syntactical checks that do not assess the more important high-level semantic aspects that users with disabilities rely on. Accordingly, assessing web accessibility has largely remained a laborious manual process requiring human input. In this paper, we propose an approach, called AxeRay, that infers semantic groupings of various regions of a web page and their semantic roles. We evaluate our approach on 30 real-world websites and assess the accuracy of semantic inference as well as the ability to detect accessibility failures. The results show that AxeRay achieves, on average, an F-measure of 87% for inferring semantic groupings, and is able to detect accessibility failures with 85% accuracy.
DOI: 10.1109/ICSE43902.2021.00143
Restoring Execution Environments of Jupyter Notebooks
作者: Wang, Jiawei and Li, Li and Zeller, Andreas
关键词: Python, Jupyter Notebook, Environment, API
Abstract
More than ninety percent of published Jupyter notebooks do not state dependencies on external packages. This makes them non-executable and thus hinders reproducibility of scientific results. We present SnifferDog, an approach that 1) collects the APIs of Python packages and versions, creating a database of APIs; 2) analyzes notebooks to determine candidates for required packages and versions; and 3) checks which packages are required to make the notebook executable (and ideally, reproduce its stored results). In its evaluation, we show that SnifferDog precisely restores execution environments for the largest majority of notebooks, making them immediately executable for end users.
DOI: 10.1109/ICSE43902.2021.00144
PyART: Python API Recommendation in Real-Time
作者: He, Xincheng and Xu, Lei and Zhang, Xiangyu and Hao, Rui and Feng, Yang and Xu, Baowen
关键词: real-time recommendation, data flow analysis, context analysis, Python, API recommendation
Abstract
API recommendation in real-time is challenging for dynamic languages like Python. Many existing API recommendation techniques are highly effective, but they mainly support static languages. A few Python IDEs provide API recommendation functionalities based on type inference and training on a large corpus of Python libraries and third-party libraries. As such, they may fail to recommend or make poor recommendations when type information is missing or target APIs are project-specific. In this paper, we propose a novel approach, PyART, to recommending APIs for Python programs in real-time. It features a light-weight analysis to derive so-called optimistic data-flow, which is neither sound nor complete, but simulates the local data-flow information humans can derive. It extracts three kinds of features: data-flow, token similarity, and token co-occurrence, in the context of the program point where a recommendation is solicited. A predictive model is trained on these features using the Random Forest algorithm. Evaluation on 8 popular Python projects demonstrates that PyART can provide effective API recommendations. When historic commits can be leveraged, which is the target scenario of a state-of-the-art tool ARIREC, our average top-1 accuracy is over 50% and average top-10 accuracy over 70%, outperforming APIREC and Intellicode (i.e., the recommendation component in Visual Studio) by 28.48%-39.05% for top-1 accuracy and 24.41%-30.49% for top-10 accuracy. In other applications such as when historic comments are not available and cross-project recommendation, PyART also shows better overall performance. The time to make a recommendation is less than a second on average, satisfying the real-time requirement.
DOI: 10.1109/ICSE43902.2021.00145
PyCG: Practical Call Graph Generation in Python
作者: Salis, Vitalis and Sotiropoulos, Thodoris and Louridas, Panos and Spinellis, Diomidis and Mitropoulos, Dimitris
关键词: Vulnerability Propagation, Program Analysis, Inter-procedural Analysis, Call Graph
Abstract
Call graphs play an important role in different contexts, such as profiling and vulnerability propagation analysis. Generating call graphs in an efficient manner can be a challenging task when it comes to high-level languages that are modular and incorporate dynamic features and higher-order functions.Despite the language’s popularity, there have been very few tools aiming to generate call graphs for Python programs. Worse, these tools suffer from several effectiveness issues that limit their practicality in realistic programs. We propose a pragmatic, static approach for call graph generation in Python. We compute all assignment relations between program identifiers of functions, variables, classes, and modules through an inter-procedural analysis. Based on these assignment relations, we produce the resulting call graph by resolving all calls to potentially invoked functions. Notably, the underlying analysis is designed to be efficient and scalable, handling several Python features, such as modules, generators, function closures, and multiple inheritance.We have evaluated our prototype implementation, which we call PyCG, using two benchmarks: a micro-benchmark suite containing small Python programs and a set of macro-benchmarks with several popular real-world Python packages. Our results indicate that PyCG can efficiently handle thousands of lines of code in less than a second (0.38 seconds for 1k LoC on average). Further, it outperforms the state-of-the-art for Python in both precision and recall: PyCG achieves high rates of precision ~99.2%, and adequate recall ~69.9%. Finally, we demonstrate how PyCG can aid dependency impact analysis by showcasing a potential enhancement to GitHub’s “security advisory” notification service using a real-world example.
DOI: 10.1109/ICSE43902.2021.00146
Seamless Variability Management With the Virtual Platform
作者: Mahmood, Wardah and Str"{u
关键词: variability management, software product lines, re-engineering, framework, clone management
Abstract
Customization is a general trend in software engineering, demanding systems that support variable stakeholder requirements. Two opposing strategies are commonly used to create variants: software clone&own and software configuration with an integrated platform. Organizations often start with the former, which is cheap, agile, and supports quick innovation, but does not scale. The latter scales by establishing an integrated platform that shares software assets between variants, but requires high up-front investments or risky migration processes. So, could we have a method that allows an easy transition or even combine the benefits of both strategies? We propose a method and tool that supports a truly incremental development of variant-rich systems, exploiting a spectrum between both opposing strategies. We design, formalize, and prototype the variability-management framework virtual platform. It bridges clone&own and platform-oriented development. Relying on programming-language-independent conceptual structures representing software assets, it offers operators for engineering and evolving a system, comprising: traditional, asset-oriented operators and novel, feature-oriented operators for incrementally adopting concepts of an integrated platform. The operators record meta-data that is exploited by other operators to support the transition. Among others, they eliminate expensive feature-location effort or the need to trace clones. Our evaluation simulates the evolution of a real-world, clone-based system, measuring its costs and benefits.
DOI: 10.1109/ICSE43902.2021.00147
Fine with “1234”? An Analysis of SMS One-Time Password Randomness in Android Apps
作者: Ma, Siqi and Li, Juanru and Kim, Hyoungshick and Bertino, Elisa and Nepal, Surya and Ostry, Diethelm and Sun, Cong
关键词: Vulnerability Detection, Randomness Evaluation, Pseudo-Random Number Generator, OTP Authentication Protocol, Mobile Application Security
Abstract
A fundamental premise of SMS One-Time Password (OTP) is that the used pseudo-random numbers (PRNs) are uniquely unpredictable for each login session. Hence, the process of generating PRNs is the most critical step in the OTP authentication. An improper implementation of the pseudorandom number generator (PRNG) will result in predictable or even static OTP values, making them vulnerable to potential attacks. In this paper, we present a vulnerability study against PRNGs implemented for Android apps. A key challenge is that PRNGs are typically implemented on the server-side, and thus the source code is not accessible. To resolve this issue, we build an analysis tool, OTP-Lint, to assess implementations of the PRNGs in an automated manner without the source code requirement. Through reverse engineering, OTP-Lint identifies the apps using SMS OTP and triggers each app’s login functionality to retrieve OTP values. It further assesses the randomness of the OTP values to identify vulnerable PRNGs. By analyzing 6,431 commercially used Android apps downloaded from Google Play and Tencent Myapp, OTP-Lint identified 399 vulnerable apps that generate predictable OTP values. Even worse, 194 vulnerable apps use the OTP authentication alone without any additional security mechanisms, leading to insecure authentication against guessing attacks and replay attacks.
DOI: 10.1109/ICSE43902.2021.00148
App’s Auto-Login Function Security Testing via Android OS-Level Virtualization
作者: Song, Wenna and Ming, Jiang and Jiang, Lin and Yan, Han and Xiang, Yi and Chen, Yuan and Fu, Jianming and Peng, Guojun
关键词: No keywords
Abstract
Limited by the small keyboard, most mobile apps support the automatic login feature for better user experience. Therefore, users avoid the inconvenience of retyping their ID and password when an app runs in the foreground again. However, this auto-login function can be exploited to launch the so-called “data-clone attack”: once the locally-stored, auto-login depended data are cloned by attackers and placed into their own smartphones, attackers can break through the login-device number limit and log in to the victim’s account stealthily. A natural countermeasure is to check the consistency of device-specific attributes. As long as the new device shows different device fingerprints with the previous one, the app will disable the auto-login function and thus prevent data-clone attacks.In this paper, we develop VPDroid, a transparent Android OS-level virtualization platform tailored for security testing. With VPDroid, security analysts can customize different device artifacts, such as CPU model, Android ID, and phone number, in a virtual phone without user-level API hooking. VPDroid’s isolation mechanism ensures that user-mode apps in the virtual phone cannot detect device-specific discrepancies. To assess Android apps’ susceptibility to the data-clone attack, we use VPDroid to simulate data-clone attacks with 234 most-downloaded apps. Our experiments on five different virtual phone environments show that VPDroid’s device attribute customization can deceive all tested apps that perform device-consistency checks, such as Twitter, WeChat, and PayPal. 19 vendors have confirmed our report as a zero-day vulnerability. Our findings paint a cautionary tale: only enforcing a device-consistency check at client side is still vulnerable to an advanced data-clone attack.
DOI: 10.1109/ICSE43902.2021.00149
ATVHunter: Reliable Version Detection of Third-Party Libraries for Vulnerability Identification in Android Applications
作者: Zhan, Xian and Fan, Lingling and Chen, Sen and Wu, Feng and Liu, Tianming and Luo, Xiapu and Liu, Yang
关键词: No keywords
Abstract
Third-party libraries (TPLs) as essential parts in the mobile ecosystem have become one of the most significant contributors to the huge success of Android, which facilitate the fast development of Android applications. Detecting TPLs in Android apps is also important for downstream tasks, such as malware and repackaged apps identification. To identify in-app TPLs, we need to solve several challenges, such as TPL dependency, code obfuscation, precise version representation. Unfortunately, existing TPL detection tools have been proved that they have not solved these challenges very well, let alone specify the exact TPL versions.To this end, we propose a system, named ATVHunter, which can pinpoint the precise vulnerable in-app TPL versions and provide detailed information about the vulnerabilities and TPLs. We propose a two-phase detection approach to identify specific TPL versions. Specifically, we extract the Control Flow Graphs as the coarse-grained feature to match potential TPLs in the predefined TPL database, and then extract opcode in each basic block of CFG as the fine-grained feature to identify the exact TPL versions. We build a comprehensive TPL database (189,545 unique TPLs with 3,006,676 versions) as the reference database. Meanwhile, to identify the vulnerable in-app TPL versions, we also construct a comprehensive and known vulnerable TPL database containing 1,180 CVEs and 224 security bugs. Experimental results show ATVHunter outperforms state-of-the-art TPL detection tools, achieving 90.55% precision and 88.79% recall with high efficiency, and is also resilient to widely-used obfuscation techniques and scalable for large-scale TPL detection. Furthermore, to investigate the ecosystem of the vulnerable TPLs used by apps, we exploit ATVHunter to conduct a large-scale analysis on 104,446 apps and find that 9,050 apps include vulnerable TPL versions with 53,337 vulnerabilities and 7,480 security bugs, most of which are with high risks and are not recognized by app developers.
DOI: 10.1109/ICSE43902.2021.00150
JUSTGen: Effective Test Generation for Unspecified JNI Behaviors on JVMs
作者: Hwang, Sungjae and Lee, Sungho and Kim, Jihoon and Ryu, Sukyoung
关键词: Testing, Java Virtual Machine, Java Native Interface, Empirical Study, Debugging
Abstract
Java Native Interface (JNI) provides a way for Java applications to access native libraries, but it is difficult to develop correct JNI programs. By leveraging native code, the JNI enables Java developers to implement efficient applications and to reuse code written in other programming languages such as C and C++. Besides, the core Java libraries already use the JNI to provide system features like a graphical user interface. As a result, many mainstream Java Virtual Machines (JVMs) support the JNI. However, due to the complex interoperation semantics between different programming languages, implementing correct JNI programs is not trivial. Moreover, because of the performance overhead, JVMs do not validate erroneous JNI interoperations by default, but they validate them only when the debug feature, the -Xcheck:jni option, is enabled. Therefore, the correctness of JNI programs highly relies on the checks by the -Xcheck:jni option of JVMs. Questions remain, however, on the quality of the checks provided by the feature. Are there any properties that the -Xcheck:jni option fails to validate? If so, what potential issues can arise due to the lack of such validation? To the best of our knowledge, no research has explored these questions in-depth.In this paper, we empirically study the validation quality and impacts of the -Xcheck:jni option on mainstream JVMs using unspecified corner cases in the JNI specification. Such unspecified cases may lead to unexpected run-time behaviors because their semantics is not defined in the specification. For a systematic study, we propose JUSTGen, a semi-automated approach to identify unspecified cases from a specification and generate test programs. JUSTGen receives the JNI specification written in our domain specific language (DSL), and automatically discovers unspecified cases using an SMT solver. It then generates test programs that trigger the behaviors of unspecified cases. Using the generated tests, we empirically study the validation ability of the -Xcheck:jni option. Our experimental result shows that the JNI debug feature does not validate thousands of unspecified cases on JVMs, and they can cause critical run-time errors such as violation of the Java type system and memory corruption. We reported 792 unspecified cases that are not validated by JVMs to their corresponding JVM vendors. Among them, 563 cases have been fixed and the remaining cases will be fixed in near future. Based on our empirical study, we believe that the JNI specification should specify the semantics of the missing cases clearly and the debug feature should be supported completely.
DOI: 10.1109/ICSE43902.2021.00151
AndroEvolve: automated update for android deprecated-API usages
作者: Haryono, Stefanus A. and Thung, Ferdian and Lo, David and Jiang, Lingxiao and Lawall, Julia and Kang, Hong Jin and Serrano, Lucas and Muller, Gilles
关键词: No keywords
Abstract
The Android operating system (OS) is often updated, where each new version may involve API deprecation. Usages of deprecated APIs in Android apps need to be updated to ensure the apps’ compatibility with the old and new versions of the Android OS. In this work, we propose AndroEvolve, an automated tool to update usages of deprecated Android APIs, that addresses the limitations of the state-of-the-art tool, CocciEvolve. AndroEvolve utilizes data flow analysis to solve the problem of out-of-method-boundary variables, and variable denormalization to remove the temporary variables introduced by CocciEvolve. We evaluated the accuracy of AndroEvolve using a dataset of 360 target files and 20 deprecated Android APIs, where AndroEvolve is able to produce 319 correct updates, compared to CocciEvolve which only produces 249 correct updates. We also evaluated the readability of AndroEvolve’s update results using a manual and an automatic evaluation. Both evaluations demonstrated that the code produced by AndroEvolve has higher readability than CocciEvolve’s. A video demonstration of AndroEvolve is available at https://youtu.be/siU0tuMITXI.
DOI: 10.1109/ICSE-Companion52605.2021.00021
APIScanner: towards automated detection of deprecated APIs in Python libraries
作者: Vadlamani, Aparna and Kalicheti, Rishitha and Chimalakonda, Sridhar
关键词: API evolution, Python libraries, deprecated APIs, visual studio code extension
Abstract
Python libraries are widely used for machine learning and scientific computing tasks today. APIs in Python libraries are deprecated due to feature enhancements and bug fixes in the same way as in other languages. These deprecated APIs are discouraged from being used in further software development. Manually detecting and replacing deprecated APIs is a tedious and time-consuming task due to the large number of API calls used in the projects. Moreover, the lack of proper documentation for these deprecated APIs makes the task challenging. To address this challenge, we propose an algorithm and a tool APIScanner that automatically detects deprecated APIs in Python libraries. This algorithm parses the source code of the libraries using abstract syntax tree (ASTs) and identifies the deprecated APIs via decorator, hard-coded warning or comments. APIScanner is a Visual Studio Code Extension that highlights and warns the developer on the use of deprecated API elements while writing the source code. The tool can help developers to avoid using deprecated API elements without the execution of code. We tested our algorithm and tool on six popular Python libraries, which detected 838 of 871 deprecated API elements. Demo of APIScanner: https://youtu.be/1hy_ugf-iek. Documentation, tool, and source code can be found here: https://rishitha957.github.io/APIScanner.
DOI: 10.1109/ICSE-Companion52605.2021.00022
MigrationAdvisor: recommending library migrations from large-scale open-source data
作者: He, Hao and Xu, Yulin and Cheng, Xiao and Liang, Guangtai and Zhou, Minghui
关键词: dependency management, library migration, library recommendation, mining software repositories
Abstract
During software maintenance, developers may need to migrate an already in-use library to another library with similar functionalities. However, it is difficult to make the optimal migration decision with limited information, knowledge, or expertise. In this paper, we present MigrationAdvisor, an evidence-based tool to recommend library migration targets through intelligent analysis upon a large number of GitHub repositories and Java libraries. The migration advisories are provided through a search engine style web service where developers can seek migration suggestions for a specific library. We conduct systematic evaluations on the correctness of results, and evaluate the usefulness of the tool by collecting usage feedback from industry developers. Video: https://youtu.be/4I75W22TqwQ.
DOI: 10.1109/ICSE-Companion52605.2021.00023
GraphGallery: a platform for fast benchmarking and easy development of graph neural networks based intelligent software
作者: Li, Jintang and Xu, Kun and Chen, Liang and Zheng, Zibin and Liu, Xiao
关键词: benchmarking, graph neural networks, intelligent software development, open-source platform
Abstract
Graph Neural Networks (GNNs) have recently shown to be powerful tools for representing and analyzing graph data. So far GNNs is becoming an increasingly critical role in software engineering including program analysis, type inference, and code representation. In this paper, we introduce GraphGallery, a platform for fast benchmarking and easy development of GNNs based software. GraphGallery is an easy-to-use platform that allows developers to automatically deploy GNNs even with less domain-specific knowledge. It offers a set of implementations of common GNN models based on mainstream deep learning frameworks. In addition, existing GNNs toolboxes such as PyG and DGL can be easily incorporated into the platform. Experiments demonstrate the reliability of implementations and superiority in fast coding. The official source code of GraphGallery is available at https://github.com/EdisonLeeeee/GraphGallery and a demo video can be found at https://youtu.be/mv7Zs1YeaYo.
DOI: 10.1109/ICSE-Companion52605.2021.00024
BlockEye: hunting for DeFi attacks on blockchain
作者: Wang, Bin and Liu, Han and Liu, Chao and Yang, Zhiqiang and Ren, Qian and Zheng, Huixuan and Lei, Hong
关键词: DeFi, attack monitoring, oracle analysis
Abstract
Decentralized finance, i.e., DeFi, has become the most popular type of application on many public blockchains (e.g., Ethereum) in recent years. Compared to the traditional finance, DeFi allows customers to flexibly participate in diverse blockchain financial services (e.g., lending, borrowing, collateralizing, exchanging etc.) via smart contracts at a relatively low cost of trust. However, the open nature of DeFi inevitably introduces a large attack surface, which is a severe threat to the security of participants’ funds. In this paper, we proposed BlockEye, a real-time attack detection system for DeFi projects on the Ethereum blockchain. Key capabilities provided by BlockEye are twofold: (1) Potentially vulnerable DeFi projects are identified based on an automatic security analysis process, which performs symbolic reasoning on the data flow of important service states, e.g., asset price, and checks whether they can be externally manipulated. (2) Then, a transaction monitor is installed off-chain for a vulnerable DeFi project. Transactions sent not only to that project but other associated projects as well are collected for further security analysis. A potential attack is flagged if a violation is detected on a critical invariant configured in BlockEye, e.g., Benefit is achieved within a very short time and way much bigger than the cost. We applied BlockEye in several popular DeFi projects and managed to discover potential security attacks that are unreported before. A video of BlockEye is available at https://youtu.be/7DjsWBLdlQU.
DOI: 10.1109/ICSE-Companion52605.2021.00025
Roosterize: suggesting lemma names for coq verification projects using deep learning
作者: Nie, Pengyu and Palmskog, Karl and Li, Junyi Jessy and Gligoric, Milos
关键词: coq, lemma names, neural networks
Abstract
Naming conventions are an important concern in large verification projects using proof assistants, such as Coq. In particular, lemma names are used by proof engineers to effectively understand and modify Coq code. However, providing accurate and informative lemma names is a complex task, which is currently often carried out manually. Even when lemma naming is automated using rule-based tools, generated names may fail to adhere to important conventions not specified explicitly. We demonstrate a toolchain, dubbed Roosterize, which automatically suggests lemma names in Coq projects. Roosterize leverages a neural network model trained on existing Coq code, thus avoiding manual specification of naming conventions. To allow proof engineers to conveniently access suggestions from Roosterize during Coq project development, we integrated the toolchain into the popular Visual Studio Code editor. Our evaluation shows that Roosterize substantially outperforms strong baselines for suggesting lemma names and is useful in practice. The demo video for Roosterize can be viewed at: https://youtu.be/HZ5ac7Q14rc.
DOI: 10.1109/ICSE-Companion52605.2021.00026
NeuroSPF: a tool for the symbolic analysis of neural networks
作者: Usman, Muhammad and Noller, Yannic and P\u{a
关键词: No keywords
Abstract
This paper presents NeuroSPF, a tool for the symbolic analysis of neural networks. Given a trained neural network model, the tool extracts the architecture and model parameters and translates them into a Java representation that is amenable for analysis using the Symbolic PathFinder symbolic execution tool. Notably, NeuroSPF encodes specialized peer classes for parsing the model’s parameters, thereby enabling efficient analysis. With NeuroSPF the user has the flexibility to specify either the inputs or the network internal parameters as symbolic, promoting the application of program analysis and testing approaches from software engineering to the field of machine learning. For instance, NeuroSPF can be used for coverage-based testing and test generation, finding adversarial examples and also constraint-based repair of neural networks, thus improving the reliability of neural networks and of the applications that use them. Video URL: https://youtu.be/seal8fG78LI
DOI: 10.1109/ICSE-Companion52605.2021.00027
Metrinome: path complexity predicts symbolic execution path explosion
作者: Bessler, Gabriel and Cordova, Josh and Cullen-Baratloo, Shaheen and Dissem, Sofiane and Lu, Emily and Devin, Sofia and Abughararh, Ibrahim and Bang, Lucas
关键词: automated testing, path complexity, symbolic execution
Abstract
This paper presents Metrinome, a tool for performing automatic path complexity analysis of C functions. The path complexity of a function is an expression that describes the number of paths through the function up to a given execution depth. Metrinome constructs the control flow graph (CFG) of a C function using LLVM utilities, analyzes that CFG using algebraic graph theory and analytic combinatorics, and produces a closed-form expression for the path complexity as well as the asymptotic path complexity of the function. Our experiments show that path complexity predicts the growth rate of the number of execution paths that Klee, a popular symbolic execution tool, is able to cover within a given exploration depth. Metrinome is open-source, available as a Docker image for immediate use, and all of our experiments and data are available in our repository and included in our Docker image.
DOI: 10.1109/ICSE-Companion52605.2021.00028
Robot runner: a tool for automatically executing experiments on robotics software
作者: Swanborn, Stan and Malavolta, Ivano
关键词: No keywords
Abstract
Software is becoming the core aspect in robotics development and it is growing in terms of complexity and size. However, roboticists and researchers are struggling in ensuring and even measuring the quality of their software with respect to run-time properties such as energy efficiency and performance.This paper presents Robot Runner, a tool for streamlining the execution of measurement-based experiments involving robotics software. The tool is able to automatically setup, start, resume, and fully replicate user-defined experiments. Thanks to its pluginbased architecture, the tool is fully independent of the number, type, and complexity of the used robots (both real and simulated). GitHub repository - https://github.com/S2-group/robot-runner Youtube video - https://youtu.be/le-SAXI2k1E
DOI: 10.1109/ICSE-Companion52605.2021.00029
Creating and migrating chatbots with conga
作者: P'{e
关键词: chatbots, domain-specific languages, migration, model-driven engineering
Abstract
Chatbots are agents that enable the interaction of users and software by means of written or spoken natural language conversation. Their use is growing, and many companies are starting to offer their services via chatbots, e.g., for booking, shopping or customer support. For this reason, many chatbot development tools have emerged, which makes choosing the most appropriate tool difficult. Moreover, there is hardly any support for migrating chatbots between tools.To alleviate these issues, we propose a model-driven engineering solution that includes: (i) a domain-specific language to model chatbots independently of the development tool; (ii) a recommender that suggests the most suitable development tool for the given chatbot requirements and model; (iii) code generators that synthesize the chatbot code for the selected tool; and (iv) parsers to extract chatbot models out of existing chatbot implementations. Our solution is supported by a web IDE called Conga that can be used for both chatbot creation and migration. A demo video is available at https://youtu.be/3sw1FDdZ7XY.
DOI: 10.1109/ICSE-Companion52605.2021.00030
R-MOZART: a reconfiguration tool for WebThings applications
作者: Dur'{a
关键词: No keywords
Abstract
The Internet of Things (IoT) is a network of physical devices and software entities that interact together for fulfilling an overall objective and thus providing added-value services. Designing such applications by selecting a set of candidate objects and defining how they interact with one another is a difficult and error-prone task. Moreover, IoT applications are not monolithic applications built once and for all. In contrast, they are constantly modified due to removal, replacement, or addition of new objects during the application’s lifetime. In this paper, we present a tool built on top of the WebThings platform, which supports users when they want to dynamically change a running WebThings application. To do so, R-MOZART provides three components for (i) designing the new application using a user-friendly UI, (ii) verifying that this new application respects some consistency properties with respect to the current application, and (iii) deploying this new application in an automated manner. This tool was applied on several smart home applications for evaluation purposes. Video URL: https://youtu.be/bG4oiQUrWSQ
DOI: 10.1109/ICSE-Companion52605.2021.00031
The software heritage filesystem (SwhFS): integrating source code archival with development
作者: Allan\c{c
关键词: FUSE, digital libraries, digital preservation, filesystem, open source, source code, version control system
Abstract
We introduce the Software Heritage filesystem (SwhFS), a user-space filesystem that integrates large-scale open source software archival with development workflows. SwhFS provides a POSIX filesystem view of Software Heritage, the largest public archive of software source code and version control system (VCS) development history.Using SwhFS, developers can quickly “checkout” any of the 2 billion commits archived by Software Heritage, even after they disappear from their previous known location and without incurring the performance cost of repository cloning. SwhFS works across unrelated repositories and different VCS technologies. Other source code artifacts archived by Software Heritage—individual source code files and trees, releases, and branches—can also be accessed using common programming tools and custom scripts, as if they were locally available.A screencast of SwhFS is available online at dx.doi.org/10.5281/zenodo.4531411.
DOI: 10.1109/ICSE-Companion52605.2021.00032
Guiding engineers with the passive process engine environment
作者: Mayr-Dorn, Christoph and Bichler, Stefan and Keplinger, Felix and Egyed, Alexander
关键词: constraints, developer guidance, deviation, monitoring, software engineering process
Abstract
Research as early as the 90s identified rigid, active process enactment as detrimental to engineers’ flexibility. While software engineering processes thus are rarely “executable”, engineers would benefit from guidance in safety critical domains where standards, regulations, and processes are often complicated. In this paper, we present the Passive Process Engine Environment (P2E2) that tracks process progress in the background and automatically evaluates quality assurance constraints even in the presence of process deviations. Our approach is engineering artifact agnostic and comes with two exemplary tool connectors to Jira and Jama. Video at: https://youtu.be/kXwU_baVWoQ
DOI: 10.1109/ICSE-Companion52605.2021.00033
μSE: mutation-based evaluation of security-focused static analysis tools for Android
作者: Ami, Amit Seal and Kafle, Kaushal and Nadkarni, Adwait and Poshyvanyk, Denys and Moran, Kevin
关键词: Java, security and privacy, software, testing strategies
Abstract
This demo paper presents the technical details and usage scenarios of μSE: a mutation-based tool for evaluating security-focused static analysis tools for Android. Mutation testing is generally used by software practitioners to assess the robustness of a given test-suite. However, we leverage this technique to systematically evaluate static analysis tools and uncover and document soundness issues. μSE’s analysis has found 25 previously undocumented flaws in static data leak detection tools for Android. μSE offers four mutation schemes, namely Reachability, Complex-reachability, TaintSink, and ScopeSink, which determine the locations of seeded mutants. Furthermore, the user can extend μSE by customizing the API calls targeted by the mutation analysis. μSE is also practical, as it makes use of filtering techniques based on compilation and execution criteria that reduces the number of ineffective mutations.Website: https://muse-security-evaluation.github.ioVideo URL: https://youtu.be/Kfkzi57gYys
DOI: 10.1109/ICSE-Companion52605.2021.00034
Quartermaster: a tool for modeling and simulating system degradation
作者: Pope, Matt and Sillito, Jonathan
关键词: No keywords
Abstract
It is essential that software systems be tolerant to degradations in components they rely on. There are patterns and techniques which software engineers use to ensure their systems gracefully degrade. Despite these techniques being available in practice, tuning and configuration is hard to get right and it is expensive to explore possible changes to components and techniques in complex systems. To fill these gaps, we propose Quartermaster to model and simulate systems and fault-tolerant techniques. We anticipate that Quartermaster will be useful to further research on graceful degradation and help inform software engineers about techniques that are most appropriate for their use cases.
DOI: 10.1109/ICSE-Companion52605.2021.00035
Efficient fuzz testing for apache spark using framework abstraction
作者: Zhang, Qian and Wang, Jiyuan and Gulzar, Muhammad Ali and Padhye, Rohan and Kim, Miryung
关键词: data intensive scalable computing, dataflow programs, executable specifications, fuzz testing
Abstract
The emerging data-intensive applications are increasingly dependent on data-intensive scalable computing (DISC) systems, such as Apache Spark, to process large data. Despite their popularity, DISC applications are hard to test. In recent years, fuzz testing has been remarkably successful; however, it is nontrivial to apply such traditional fuzzing to big data analytics directly because: (1) the long latency of DISC systems prohibits the applicability of fuzzing, and (2) conventional branch coverage is unlikely to identify application logic from the DISC framework implementation. We devise a novel fuzz testing tool called BigFuzz that automatically generates concrete data for an input Apache Spark program. The key essence of our approach is that we abstract the dataflow behavior of the DISC framework with executable specifications and we design schema-aware mutations based on common error types in DISC applications. Our experiments show that compared to random fuzzing, BigFuzz is able to speed up the fuzzing time by 1477X, improves application code coverage by 271%, and achieves 157% improvement in detecting application errors. The demonstration video of BigFuzz is available at https://www.youtube.com/watch?v=YvYQISILQHsfeature=youtu.be.
DOI: 10.1109/ICSE-Companion52605.2021.00036
V2S: a tool for translating video recordings of mobile app usages into replayable scenarios
作者: Havranek, Madeleine and Bernal-C'{a
关键词: No keywords
Abstract
Screen recordings are becoming increasingly important as rich software artifacts that inform mobile application development processes. However, the amount of manual effort required to extract information from these graphical artifacts can hinder resource-constrained mobile developers. This paper presents Video2Scenario (V2S), an automated tool that processes video recordings of Android app usages, utilizes neural object detection and image classification techniques to classify the depicted user actions, and translates these actions into a replayable scenario. We conducted a comprehensive evaluation to demonstrate V2S’s ability to reproduce recorded scenarios across a range of devices and a diverse set of usage cases and applications. The results indicate that, based on its performance with 175 videos depicting 3,534 GUI-based actions, V2S is able to reproduce ≈ 89% of sequential actions from collected videos. Demo URL: https://tinyurl.com/v2s-demo-video
DOI: 10.1109/ICSE-Companion52605.2021.00037
gazel: supporting source code edits in eye-tracking studies
作者: Fakhoury, Sarah and Roy, Devjeet and Pines, Harry and Cleveland, Tyler and Peterson, Cole S. and Arnaoudova, Venera and Sharif, Bonita and Maletic, Jonathan I.
关键词: No keywords
Abstract
Eye tracking tools are used in software engineering research to study various software development activities. However, a major limitation of these tools is their inability to track gaze data for activities that involve source code editing. We present a novel solution to support eye tracking experiments for tasks involving source code edits as an extension of the iTrace [9] community infrastructure. We introduce the iTrace-Atom plugin and gazel [gundefined’zel]—a Python data processing pipeline that maps gaze information to changing source code elements and provides researchers with a way to query this dynamic data. iTrace-Atom is evaluated via a series of simulations and is over 99% accurate at high eye-tracking speeds of over 1,000Hz. iTrace and gazel completely revolutionize the way eye tracking studies are conducted in realistic settings with the presence of scrolling, context switching, and now editing. This opens the doors to support many day-to-day software engineering tasks such as bug fixing, adding new features, and refactoring.
DOI: 10.1109/ICSE-Companion52605.2021.00038
COSTER: a tool for finding fully qualified names of API elements in online code snippets
作者: Saifullah, C M Khaled and Asaduzzaman, Muhammad and Roy, Chanchal K.
关键词: API usages, code examples, fully qualified name, type inference, type resolution
Abstract
Code snippets available on question answering sites (e.g., Stack Overflow) are a great source of information for learning how to use APIs. However, it is difficult to determine which APIs are discussed in those code snippets because they often suffer from declaration ambiguities and missing external references. In this paper, we introduce COSTER, a context-sensitive type solver that can determine the fully qualified names (FQNs) of API elements in those code snippets. The tool uses three different similarity measures to rank potential FQNs of a query API element. Results from our quantitative evaluation and user study demonstrate that the proposed tool can not only recommend FQNs of API elements with great accuracy but can also help developers to reuse online code snippets by suggesting the required import statements.Website: https://khaledkucse.github.io/COSTER/Demo Video: https://youtu.be/oDZtw9MzUWM
DOI: 10.1109/ICSE-Companion52605.2021.00039
FastCA: an effective and efficient tool for combinatorial covering array generation
作者: Lin, Jinkun and Cai, Shaowei and He, Bing and Fu, Yingjie and Luo, Chuan and Lin, Qingwei
关键词: combinatorial interaction testing, constrained covering array, search-based software testing
Abstract
Combinatorial interaction testing (CIT) is a popular approach to detecting faults in highly configurable software systems. The core task of CIT is to generate a small test suite called a t-way covering array (CA), where t is the covering strength. A major drawback of existing solvers for CA generation is that they usually need considerable time to obtain a high-quality solution, which hinders its wider applications. In this paper, we describe FastCA, an effective and efficient tool for generating constrained CAs. We observe that the high time consumption of existing meta-heuristic algorithms is mainly due to the procedure of score computation. To this end, we present a much more efficient method for score computation. Thanks to this new lightweight score computation method, FastCA can work in the gradient mode to effectively explore the search space. Experiments on a broad range of real-world benchmarks and synthetic benchmarks show that FastCA significantly outperforms state-of-the-art solvers, in terms of both the size of obtained covering array and the run time.Video: https://youtu.be/-6CuojQIt-kRepository: https://github.com/jkunlin/FastCATool.git
DOI: 10.1109/ICSE-Companion52605.2021.00040
Testing framework for black-box AI models
作者: Aggarwal, Aniya and Shaikh, Samiulla and Hans, Sandeep and Haldar, Swastik and Ananthanarayanan, Rema and Saha, Diptikalyan
关键词: No keywords
Abstract
With widespread adoption of AI models for important decision making, ensuring reliability of such models remains an important challenge. In this paper, we present an end-to-end generic framework for testing AI Models which performs automated test generation for different modalities such as text, tabular, and time-series data and across various properties such as accuracy, fairness, and robustness. Our tool has been used for testing industrial AI models and was very effective to uncover issues present in those models.Demo video link- https://youtu.be/984UCU17YZI
DOI: 10.1109/ICSE-Companion52605.2021.00041
GAssert: a fully automated tool to improve assertion oracles
作者: Terragni, Valerio and Jahangirova, Gunel and Tonella, Paolo and Pezz`{e
关键词: Oracle improvement, automated test generation, evolutionary algorithm, genetic programming, mutation analysis, program assertions, the oracle problem
Abstract
This demo presents the implementation and usage details of GAssert, the first tool to automatically improve assertion oracles. Assertion oracles are executable boolean expressions placed inside the program that should pass (return true) for all correct executions and fail (return false) for all incorrect executions. Because designing perfect assertion oracles is difficult, assertions are prone to both false positives (the assertion fails but should pass) and false negatives (the assertion passes but should fail). Given a Java method containing an assertion oracle to improve, GAssert returns an improved assertion with fewer false positives and false negatives than the initial assertion. Internally, GAssert implements a novel co-evolutionary algorithm that explores the space of possible assertions guided by two fitness functions that reward assertions with fewer false positives, fewer false negatives, and smaller size.
DOI: 10.1109/ICSE-Companion52605.2021.00042
UIS-hunter: detecting UI design smells in Android apps
作者: Yang, Bo and Xing, Zhenchang and Xia, Xin and Chen, Chunyang and Ye, Deheng and Li, Shanping
关键词: GUI testing, UI design smell, material design, violation detection
Abstract
Similar to code smells in source code, UI design has visual design smells that indicate violations of good UI design guidelines. UI design guidelines constitute design systems for a vast variety of products, platforms, and services. Following a design system, developers can avoid common design issues and pitfalls. However, a design system is often complex, involving various design dimensions and numerous UI components. Lack of concerns on GUI visual effect results in little support for detecting UI design smells that violate the design guidelines in a complex design system. In this paper, we propose an automated UI design smell detector named UIS-Hunter (<u>UI</u> design <u>S</u>mell <u>Hunter</u>). The tool is able to (i) automatically process UI screenshots or prototype files to detect UI design smells and generate reports, (ii) highlight the violated UI regions and list the material design guidelines that the found design smells violate, and (iii) present conformance and violation UI design examples to assist understanding. This tool consists of a Material Design guidelines gallery website and a tool website. The gallery website is a back-end knowledge base that attaches conformance and violation examples to abstract design guidelines and allows developers and designers to explore the multi-dimensional space of a complex design system in a more structured way. As a front-end application, the tool website takes a UI design as input, returns a detailed UI design smell report, and marks the violation regions (if any). Moreover, the tool website presents conformance and violation examples based on the gallery website.Demo URL: https://uishunter.net.cn/https://uishuntergallery.net.cn/Demo Video: https://youtu.be/7UZ0jtD_1gM
DOI: 10.1109/ICSE-Companion52605.2021.00043
Effect on brain activity while programming with (without) music
作者: Thapaliya, Ananga
关键词: empirical software engineering, mental states while programming, programming with music
Abstract
In this study, I investigate the effect of programming with (without) music on electromagnetic waves in software developers’ brain and analyzing how music influences the overall result of their tasks. For this research, I used an EEG device to measure the brain activity of the programmer and analyzed electromagnetic waves by calculating the arousal-valence coefficients & using pre-processing techniques (EEG studio). The experiment was performed with 8 students who were also software developers (5 undergraduates and 3 graduates). As a result, when programming with music, I discovered that the mean valence was greater while the mean arousal was lower. These early results suggest the feasibility of the technique.
DOI: 10.1109/ICSE-Companion52605.2021.00044
Distribution awareness for AI system testing
作者: Berend, David
关键词: deep learning, distribution awareness, software testing
Abstract
As Deep Learning (DL) is continuously adopted in many safety critical applications, its quality and reliability start to raise concerns. Similar to the traditional software development process, testing the DL software to uncover its defects at an early stage is an effective way to reduce risks after deployment. Although recent progress has been made in designing novel testing techniques for DL software, the distribution of generated test data is not taken into consideration. It is therefore hard to judge whether the identified errors are indeed meaningful errors to the DL application. Therefore, we propose a new distribution aware testing technique which aims to generate new unseen test cases relevant to the underlying DL system task. Our results show that this technique is able to filter up to 55.44% of error test case on CIFAR-10 and is 10.05% more effective in enhancing robustness.
DOI: 10.1109/ICSE-Companion52605.2021.00045
Scalable call graph constructor for maven
作者: Keshani, Mehdi
关键词: logic and verification, program analysis, theory of computation
Abstract
As a rich source of data, Call Graphs are used for various applications including security vulnerability detection. Despite multiple studies showing that Call Graphs can drastically improve the accuracy of analysis, existing ecosystem-scale tools like Dependabot do not use Call Graphs and work at the package-level. Using Call Graphs in ecosystem use cases is not practical because of the scalability problems that Call Graph generators have. Call Graph generation is usually considered to be a “full program analysis” resulting in large Call Graphs and expensive computation. To make an analysis applicable to ecosystem scale, this pragmatic approach does not work, because the number of possible combinations of how a particular artifact can be combined in a full program explodes. Therefore, it is necessary to make the analysis incremental. There are existing studies on different types of incremental program analysis. However, none of them focuses on Call Graph generation for an entire ecosystem. In this paper, we propose an incremental implementation of the CHA algorithm that can generate Call Graphs on-demand, by stitching together partial Call Graphs that have been extracted for libraries before. Our preliminary evaluation results show that the proposed approach scales well and outperforms the most scalable existing framework called OPAL.
DOI: 10.1109/ICSE-Companion52605.2021.00046
System component-level self-adaptations for security via bayesian games
作者: Zhang, Mingyue
关键词: bayesian game, self-adaptation
Abstract
Security attacks present unique challenges to self-adaptive system design due to the adversarial nature of the environment. However, modeling the system as a single player, as done in prior works in security domain, is insufficient for the system under partial compromise and for the design of finegrained defensive strategies where the rest of the system with autonomy can cooperate to mitigate the impact of attacks. To deal with such issues, we propose a new self-adaptive framework incorporating Bayesian game and model the defender (i.e., the system) at the granularity of components in system architecture. The system architecture model is translated into a Bayesian multiplayer game, where each component is modeled as an independent player while security attacks are encoded as variant types for the components. The defensive strategy for the system is dynamically computed by solving the pure equilibrium to achieve the best possible system utility, improving the resiliency of the system against security attacks.
DOI: 10.1109/ICSE-Companion52605.2021.00047
Metamorphic testing of autonomous vehicles: a case study on simulink
作者: Valle, Pablo
关键词: No keywords
Abstract
Autonomous Vehicles (AVs) will revolutionize the way people travel by car. However, in order to deploy autonomous vehicles, effective testing techniques are required. The driving quality of an AV should definitely be considered when testing such systems. However, as in other complex systems, determining the outcome of a test in the driving quality on an AV can be extremely complex. To solve this issue, in this paper we explore the application of Quality-of-Service (QoS) aware metamorphic testing to test AVs modeled in MATLAB/Simulink, one of the predominant modeling tools in the market. We first defined a set of QoS measures applied to AVs by considering as input a recent study. With them, we define metamorphic relations. Lastly we assess the approach in an AV modeled in Simulink by using mutation testing. The results suggests that our approach is effective at detecting faults.
DOI: 10.1109/ICSE-Companion52605.2021.00048
SetDroid: detecting user-configurable setting issues of Android apps via metamorphic fuzzing
作者: Sun, Jingling
关键词: Android, setting, testing
Abstract
Android, the most popular mobile system, offers a number of app-independent, user-configurable settings (e.g., network, location and permission) for controlling the devices and the apps. However, apps may fail to properly adapt their behaviors when these settings are changed, and thus frustrate users. We name such issues as setting issues, which reside in the apps and are induced by the changes of settings. According to our investigation, the majority of setting issues are non-crash (logic) bugs, which however cannot be detected by existing automated app testing techniques due to the lack of test oracles. To this end, we designed and introduced, setting-wise metamorphic fuzzing, the first automated testing technique to overcome the oracle problem in detecting setting issues. Our key insight is that, in most cases, the app behaviors should keep consistent if a given setting is changed and later properly restored. We realized this technique as an automated GUI testing tool, SetDroid, and applied it on 26 popular, open-source Android apps. SetDroid successfully found 32 unique, previously-unknown setting issues in these apps. So far, 25 have been confirmed and 17 were already fixed. We further applied SetDroid on 4 commercial apps with billions of monthly active users and successfully detected 15 previously unknown setting issues, all of which have been confirmed and under fixing. The majority of all these bugs (37 out of 47) are noncrash bugs, which cannot be detected by prior testing techniques.
DOI: 10.1109/ICSE-Companion52605.2021.00049
Anomaly detection in scratch assignments
作者: K"{o
关键词: anomaly detection, block-based programming, program analysis, scratch, teaching
Abstract
For teachers, automated tool support for debugging and assessing their students’ programming assignments is a great help in their everyday business. For block-based programming languages which are commonly used to introduce younger learners to programming, testing frameworks and other software analysis tools exist, but require manual work such as writing test suites or formal specifications. However, most of the teachers using languages like Scratch are not trained for or experienced in this kind of task. Linters do not require manual work but are limited to generic bugs and therefore miss potential task-specific bugs in student solutions. In prior work, we proposed the use of anomaly detection to find project-specific bugs in sets of student programming assignments automatically, without any additional manual labour required from the teachers’ side. Evaluation on student solutions for typical programming assignments showed that anomaly detection is a reliable way to locate bugs in a data set of student programs. In this paper, we enhance our initial approach by lowering the abstraction level. The results suggest that the lower abstraction level can focus anomaly detection on the relevant parts of the programs.
DOI: 10.1109/ICSE-Companion52605.2021.00050
Let’s not make a fuzz about it
作者: Lobo-Vesga, Elisabet
关键词: No keywords
Abstract
One of the most popular recipes to achieve differential privacy is to add noise calibrated to the global sensitivity of the data analysis that one wants to make private [1]. This simple idea has also found important applications in the design of programming frameworks to help programmers to implement differentially private programs. This approach has been first pioneered by Reed and Pierce [2] who have designed a functional programming language, named Fuzz, where types allow one to reason about the sensitivity of ones programs. The technical devices that Reed and Pierce use are linear indexed types, where the index represents the sensitivity of a function. This approach has been further extended in a series of works [3]-[6]. In order to extend the expressivity of the sensitivity analysis proposed by Fuzz, these works have added to the type system originally proposed by Reed and Pierce additional programming languages features, such as linear types, modal types and partial evaluation. These features are not mainstream and usually require to design a new language from scratch. Furthermore, these features are part of the classical toolbox of programming language researchers but they are often rather obscure to users who do not have such a background. This makes the languages using these feature only accessible to programmers which are also expert in programming language research.
DOI: 10.1109/ICSE-Companion52605.2021.00051
Testing object detection for autonomous driving systems via 3D reconstruction
作者: Shao, Jinyang
关键词: image processing, metamorphic testing, object detection system, vanishing point
Abstract
Object detection is to identify objects from images. In autonomous driving systems, object detection serves as an intermediate module, which is used as the input of autonomous decisions for vehicles. That is, the accuracy of autonomous decisions relies on the object detection. The state-of-the-art object detection modules are designed based on Deep Neural Networks (DNNs). It is difficult to employ white-box testing on DNNs since the output of a single neuron is inexplicable. Existing work conducted metamorphic testing for object detection via image synthesis: the detected object in the original image should be detected in the new synthetic image. However, a synthetic image may not look real from humans’ perspective. Even the object detection module fails in detecting such synthetic image, the failure may not reflect the ability of object detection. In this paper, we propose an automatic approach to testing object detection via 3D reconstruction of vehicles in real photos. The 3D reconstruction is developed via vanishing point estimation in photos and heuristic based image insertion. Our approach adds new objects to blank spaces in photos to synthesize images. For example, a new vehicle can be added to a photo of a road and vehicles. In this approach, the output synthetic images are expected to be more natural-looking than randomly synthesizing images. The experiment is conducting on 500 driving photos from the Apollo autonomous driving dataset.
DOI: 10.1109/ICSE-Companion52605.2021.00052
Mutagen: faster mutation-based random testing
作者: Mista, Agust'{\i
关键词: heuristics, mutation, random testing
Abstract
We present Mutagen, a fully automated mutation-oriented framework for property-based testing. Our tool uses novel heuristics to improve the performance of the testing loop, and it is capable of finding complex bugs within seconds. We evaluate Mutagen by generating random WebAssembly programs that we use to find bugs in a faulty validator.
DOI: 10.1109/ICSE-Companion52605.2021.00053
Detecting user-perceived failure in mobile applications via mining user traces
作者: Tian, Deyu
关键词: failure, mobile application, user trace
Abstract
Mobile applications (apps) often suffer from failure nowadays. Developers usually pay more attention to the failure that is perceived by users and compromises the user experience. Existing approaches focus on mining large volume logs to detect failure, however, to our best knowledge, there is no approach focusing on detecting whether users have actually perceived failure, which directly influence the user experience. In this paper, we propose a novel approach to detecting user-perceived failure in mobile apps. By leveraging the frontend user traces, our approach first builds an app page model, and applies an unsupervised detection algorithm to detect whether a user has perceived failure. Our insight behind the algorithm is that when user-perceived failure occurs on an app page, the users will backtrack and revisit the certain page to retry. Preliminary evaluation results show that our approach can achieve good detection performance on a dataset collected from real world users.
DOI: 10.1109/ICSE-Companion52605.2021.00054
NodeSRT: a selective regression testing tool for node.js application
作者: Chen, Yufeng
关键词: JavaScript, dynamic analysis, node.js application, selective regression testing, static analysis
Abstract
Node.js is one of the most popular frameworks for building web applications. As software systems mature, the cost of running their entire regression test suite can become significant. Selective Regression Testing (SRT) is a technique that executes only a subset of tests the regression test suite can detect software failures more efficiently. Previous SRT studies mainly focused on standard desktop applications. Node.js applications are considered hard to perform test reduction because of Node’s asynchronous, event-driven programming model and because JavaScript is a dynamic programming language. In this paper, we present NodeSRT, a Selective Regression Testing framework for Node.js applications. By performing static and dynamic analysis, NodeSRT identifies the relationship between changed methods and tests, then reduces the regression test suite to only tests that are affected by the change to improve the execution time of the regression test suite. To evaluate our selection technique, we applied NodeSRT to two open-source projects: Uppy and Simorgh, then compared our approach with the retest-all strategy and current industry-standard SRT technique: Jest OnlyChange. The results demonstrate that NodeSRT correctly selects affected tests based on changes and is 250% faster, 450% more precise than the Jest OnlyChange.
DOI: 10.1109/ICSE-Companion52605.2021.00055
Explainable just-in-time bug prediction: are we there yet?
作者: Aleithan, Reem
关键词: bug prediction, prediction explanation
Abstract
Explaining the prediction results of software bug prediction models is a challenging task, which can provide useful information for developers to understand and fix the predicted bugs. Recently, Jirayus et al. [4] proposed to use two model-agnostic techniques (i.e., LIME and iBreakDown) to explain the prediction results of bug prediction models. Although their experiments on file-level bug prediction show promising results, the performance of these techniques on explaining the results of just-in-time (i.e., change-level) bug prediction is unknown. This paper conducts the first empirical study to explore the explainability of these model-agnostic techniques on just-in-time bug prediction models. Specifically, this study takes a three-step approach, 1) replicating previously widely used just-in-time bug prediction models, 2) applying Local Interpretability Model-agnostic Explanation Technique (LIME) and iBreakDown on the prediction results, and 3) manually evaluating the explanations for buggy instances (i.e. positive predictions) against the root cause of the bugs. The results of our experiment show that LIME and iBreakDown fail to explain defect prediction explanations for just-in-time bug prediction models, unlike file-level [4]. This paper urges for new approaches for explaining the results of just-in-time bug prediction models.
DOI: 10.1109/ICSE-Companion52605.2021.00056
Understanding the challenges and assisting developers with developing spark applications
作者: Wang, Zehao
关键词: apache spark, debugging, empirical study, monitoring
Abstract
To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the challenges in using big data frameworks, we first conduct an empirical study on 1,000 Apache Spark-related questions on Stack Overflow. We find that most of the challenges are related to data transformation and API usage. To solve these challenges, we design an approach, which assists developers with understanding and debugging data processing in Spark. Our approach leverages statistical sampling to minimize performance overhead, and provides intermediate information and hint messages for each data processing step of a chained method pipeline. The preliminary evaluation of our approach shows that it has low performance overhead and we receive good feedback from developers.
DOI: 10.1109/ICSE-Companion52605.2021.00057
A better approach to track the evolution of static code warnings
作者: Li, Junjie
关键词: empirical study, static analysis, tranking static code warnings
Abstract
Static bug detection tools help developers detect code problems. However, it is known that they remain underutilized due to various reasons. Recent advances to incorporate static bug detectors in modern software development workflows can better motivate developers to fix the reported warnings on the fly.In this paper, we study the effectiveness of the state-of-the-art (SOA) solution in tracking warnings by static bug detectors and propose a better solution based on our analysis of the insufficiencies of the SOA solution. In particular, we examined four large-scale open-source systems and crafted a data set of 3,452 static code warnings by two static bug detectors. We manually uncover the ground-truth evolution status of the selected warnings: persistent, resolved, or newly-introduced. Moreover, upon manual analysis, we identified the critical reasons behind the insufficiencies of the SOA matching algorithm. Finally, we propose a better approach to improve the tracking of static warnings over software development history. Our evaluation shows that our proposed approach provides a significant improvement in the precision of the tracking, i.e., from 66.9% to 90.0%.
DOI: 10.1109/ICSE-Companion52605.2021.00058
Please don’t go: increasing women’s participation in open source software
作者: Trinkenreich, Bianca
关键词: career, diversity, gender, inclusion, open source software, participation, success, women
Abstract
Women represent less than 24% of the software development industry and suffer from various types of prejudice and biases. In Open Source Software projects, despite a variety of efforts to increase diversity and multi-gendered participation, women are even more underrepresented (less than 10%). My research focuses on answering the question: How can OSS communities increase women’s participation in OSS projects? I will identify the different OSS career pathways, and develop a holistic view of women’s motivations to join or leave OSS, along with their definitions of success. Based on this empirical investigation, I will work together with the Linux Foundation to design attraction and retention strategies focused on women. Before and after implementing the strategies, I will conduct empirical studies to evaluate the state of the practice and understand the implications of the strategies.
DOI: 10.1109/ICSE-Companion52605.2021.00059
WebEvo: taming web application evolution via semantic structure change detection
作者: Shao, Fei and Xiao, Xusheng
关键词: No keywords
Abstract
In order to prevent information retrieval (IR) and robotic process automation (RPA) tools from functioning improperly due to website evolution, it is important to develop web monitoring tools to monitor changes in a website and report them to the developers and testers. Existing monitoring tools commonly make use of DOM-tree based similarity and visual analysis between different versions of web pages. However, DOM-tree based similarity suffers are prone to false positives, since they cannot identify content-based changes (i.e., contents refreshed every time a web page is retrieved) and GUI widget evolution (e.g., moving a button). Such imprecision adversely affect IR tools or test scripts. To address this problem, we propose approach, WebEvo, that first performs DOM-based change detection, and then leverages historic pages to identify the regions that represent content-based changes, which can be safely ignored. Further, to identify refactoring changes that preserve semantics and appearances of GUI widgets, WebEvo adapts computer vision (CV) techniques to identify the mappings of the GUI widgets from the old web page to the new web page on an element-by-element basis. We evaluated WebEvo on 10 real-world websites from 8 popular categories to demonstrate the superiority of WebEvo over the existing work that relies on DOM-tree based detection or whole-page visual comparison.
DOI: 10.1109/ICSE-Companion52605.2021.00060
ProMal: precise window transition graphs for Android via synergy of program analysis and machine learning
作者: Liu, Changlin and Xiao, Xusheng
关键词: No keywords
Abstract
Mobile apps have been an integral part in our daily life. As these apps become more complex, it is critical to provide automated analysis techniques to ensure the correctness, security, and performance of these apps. A key component for these automated analysis techniques is to create a graphical user interface (GUI) model of an app, i.e., a window transition graph (WTG), that models windows and transitions among the windows. While existing work has provided both static and dynamic analysis to build the WTG for an app, the constructed WTG misses many transitions or contains many infeasible transitions due to the coverage issues of dynamic analysis and over-approximation of the static analysis. We propose ProMal, a “tribrid” analysis that synergistically combines static analysis, dynamic analysis, and machine learning to construct a precise WTG. Specifically, ProMal first applies static analysis to build a static WTG, and then applies dynamic analysis to verify the transitions in the static WTG. For the unverified transitions, ProMal further provides machine learning techniques that leverage runtime information (i.e., screenshots, UI layouts, and text information) to predict whether they are feasible transitions. Our evaluations on 40 real-world apps demonstrate the superiority of ProMal in building WTGs over static analysis, dynamic analysis, and machine learning techniques when they are applied separately.
DOI: 10.1109/ICSE-Companion52605.2021.00061
Microservice-based performance problem detection in cyber-physical system software updates
作者: Gartziandia, Aitor
关键词: cyber-physical systems, machine learning, microservices, performance bugs
Abstract
Software embedded in Cyber-Physical Systems (CPSs) usually has a large life-cycle and is continuously evolving. The increasing expansion of IoT and CPSs has highlighted the need for additional mechanisms for remote deployment and updating of this software, to ensure its correct behaviour. Performance problems require special attention, as they may appear in operation due to limitations in lab testing and environmental conditions.In this context, we propose a microservice-based method to detect performance problems in CPSs. These microservices will be deployed in installation to detect performance problems in run-time when new software versions are deployed. The problem detection is based on Machine Learning algorithms, which predict the performance of a new software release based on knowledge from previous releases. This permits taking corrective actions so that system reliability is guaranteed.
DOI: 10.1109/ICSE-Companion52605.2021.00062
Automation and evaluation of mutation testing for the new C++ standards
作者: '{A
关键词: C++ new standards, mutation testing, software testing
Abstract
Mutation testing is becoming increasingly widely used to evaluate the quality of test suites, especially to test programs coded in widely used programming languages in the industry. Mutation tools have arisen to automate the technique in different languages, including C++. With the increasing use of this technique, new mutation operators modeling possible faults often emerge to improve its abilities and adapt the tools to new advanced features. In this work, mutation operators for the new C++ standards, defined in previous work, are implemented and applied to generate and execute mutants in real programs. With this study, the MuCPP mutation tool is updated with the inclusion of these new operators. In addition, the improvements suggested in the definition of those operators can be finally tested, and conclusions about their utility in practice can be drawn. The implemented operators are checked on a set of four C++ programs that use these advanced features. The results show significant differences with the previous manual analysis: the number of invalid mutants was reduced by 64%, and we found fewer alive mutants (88%) and an increase in dead mutants (31%). In summary, both the number of mutants incorrectly classified in the previous manual analysis and the number of mutants generated (particularly equivalent mutants) have been reduced.
DOI: 10.1109/ICSE-Companion52605.2021.00063
Investigating the interplay between developers and automation
作者: Elazhary, Omar
关键词: automation, continuous integration, continuous software development, software engineering, software engineering theory
Abstract
Continuous practices are a staple of the modern software development workflow. Automation, in particular, is widely adopted due to its benefits related to quality and productivity. However, automation, similarly to all other aspects of the software development workflow, interacts with humans (in this case developers). While some work has investigated the impact of automation on developers, it is not clear to what extent context and process influence that impact. We present our ADEPT theory of developers and automation, in an attempt to bridge this gap and identify the possible ways context, process, and other factors may influence how developers perceive, interpret, and interact with automation.
DOI: 10.1109/ICSE-Companion52605.2021.00064
JEST: N+1-version differential testing of both Javascript engines and specification
作者: Park, Jihyeok and An, Seungmin and Youn, Dongjun and Kim, Gyeongwon and Ryu, Sukyoung
关键词: JavaScript, conformance test generation, differential testing, mechanized specification
Abstract
Modern programming follows the continuous integration (CI) and continuous deployment (CD) approach rather than the traditional waterfall model. Even the development of modern programming languages uses the CI/CD approach to swiftly provide new language features and to adapt to new development environments. Unlike in the conventional approach, in the modern CI/CD approach, a language specification is no more the oracle of the language semantics because both the specification and its implementations (interpreters or compilers) can co-evolve. In this setting, both the specification and implementations may have bugs, and guaranteeing their correctness is non-trivial.In this paper, we propose a novel N+1-version differential testing to resolve the problem. Unlike the traditional differential testing, our approach consists of three steps: 1) to automatically synthesize programs guided by the syntax and semantics from a given language specification, 2) to generate conformance tests by injecting assertions to the synthesized programs to check their final program states, 3) to detect bugs in the specification and implementations via executing the conformance tests on multiple implementations, and 4) to localize bugs on the specification using statistical information. We actualize our approach for the JavaScript programming language via JEST, which performs N+1-version differential testing for modern JavaScript engines and ECMAScript, the language specification describing the syntax and semantics of JavaScript in a natural language. We evaluated JEST with four JavaScript engines that support all modern JavaScript language features and the latest version of ECMAScript (ES11, 2020). JEST automatically synthesized 1,700 programs that covered 97.78% of syntax and 87.70% of semantics from ES11. Using the assertion-injected JavaScript programs, it detected 44 engine bugs in four different engines and 27 specification bugs in ES11.
DOI: 10.1109/ICSE-Companion52605.2021.00065
A replication of are machine learning cloud APIs used correctly
作者: Wan, Chengcheng and Liu, Shicheng and Hoffmann, Henry and Maire, Michael and Lu, Shan
关键词: No keywords
Abstract
This artifact aims to provide benchmark suite, data, and script used in our study “Are Machine Learning Cloud APIs Used Correctly?”. We collected a suite of 360 non-trivial applications that use ML cloud APIs for manual study. We also developed checkers and tool to detect and fix API mis-uses. We hope this artifact can motivate and help future research to further tackle ML API mis-uses. All related data are available online.
DOI: 10.1109/ICSE-Companion52605.2021.00066
A replication package for it takes two to tango: combining visual and textual information for detecting duplicate video-based bug reports
作者: Cooper, Nathan and Bernal-C'{a
关键词: bug reporting, duplicate detection, screen recordings
Abstract
When a bug manifests in a user-facing application, it is likely to be exposed through the graphical user interface (GUI). Given the importance of visual information to the process of identifying and understanding such bugs, users are increasingly making use of screenshots and screen-recordings as a means to report issues to developers. Due to their graphical nature, screen-recordings present challenges for automated analysis that preclude the use of current duplicate bug report detection techniques. This paper describes in detail our reproduction package artifact for Tango, a duplicate detection technique that operates purely on video-based bug reports by leveraging both visual and textual information to overcome these challenges and aid developers in this task. Specifically, this reproduction package contains the data and code that enables our TANGO’s empirical evaluation replication and future research in the area of duplicate video-based bug report detection.
DOI: 10.1109/ICSE-Companion52605.2021.00067
Artifact: reducing DNN properties to enable falsification with adversarial attacks
作者: Shriver, David and Elbaum, Sebastian and Dwyer, Matthew B.
关键词: adversarial attacks, falsification, formal methods, neural nets
Abstract
We present an artifact to accompany Reducing DNN Properties to Enable Falsification with Adversarial Attacks which includes the DNNF tool, data and scripts to facilitate the replication of its study. The artifact is both reusable and available. DNNF is available on Github, and we provide an artifact to reproduce our study as a VirtualBox virtual machine image. Full replication of the study requires 64GB of memory and 8 CPU cores. Users should know how to use VirtualBox, as well as have basic knowledge of the bash shell.
DOI: 10.1109/ICSE-Companion52605.2021.00068
IMGDroid: a static analyzer for detecting image loading defects in Android applications
作者: Song, Wei and Han, Mengqi and Huang, Jeff
关键词: Android apps, image loading defects, static analysis
Abstract
We summarize five anti-patterns of image loading defects in Android apps, including image passing by intent, image decoding without resizing, local image loading without permission, repeated decoding without caching, and image decoding in UI thread. Based on the anti-patterns, we propose a static analyzer, IMGDroid, to automatically and effectively detect such defects. Readers can access our artifacts from GitHub and Zenodo, and can run IMGDroid to detect image loading defects in Android apps; so we are applying for Reusable and Available Badges. We implement IMGDroid in Java, and perform the experiments on a computer with Windows 10, JDK 1.8, and Android 7.1.1. Therefore, reviewers are required to be familiar with Java and proficient in using Eclipse.
DOI: 10.1109/ICSE-Companion52605.2021.00069
CIBench: a dataset and collection of techniques for build and test selection and prioritization in continuous integration
作者: Jin, Xianhao and Servant, Francisco
关键词: continuous integration, empirical software engineering, software maintenance
Abstract
Continuous integration (CI) is a widely used practice in modern software engineering. Unfortunately, it is also an expensive practice — Google and Mozilla estimate their CI systems in millions of dollars. There are a number of techniques and tools designed to or having the potential to save the cost of CI or expand its benefit - reducing time to feedback. However, their benefits in some dimensions may also result in drawbacks in others. They may also be beneficial in other scenarios where they are not designed to help. Therefore, we built CIBench, a dataset and collection of techniques for build and test selection and prioritization in Continuous Integration. CIBench is based on a popular existing dataset for CI — TravisTorrent [2] and extends it in multiple ways including mining additional Travis logs, Github commits, and building dependency graphs for studied projects. This dataset allows us to replicate and evaluate existing techniques to improve CI under the same settings, to better understand the impact of applying different strategies in a more comprehensive way.
DOI: 10.1109/ICSE-Companion52605.2021.00070
Program comprehension and code complexity metrics: a replication package of an fMRI study
作者: Peitek, Norman and Apel, Sven and Parnin, Chris and Brechmann, Andr'{e
关键词: No keywords
Abstract
In this artifact, we document our publicly shared data set of our functional magnetic resonance imaging (fMRI) study on programmers. We have conducted an fMRI study with 19 participants observing program comprehension of short code snippets at varying complexity levels [1]. We dissected four classes of code complexity metrics and their relationship to neuronal, behavioral, and subjective correlates of program comprehension. Our data corroborate that complexity metrics can—to a limited degree—explain programmers’ cognition in program comprehension.
DOI: 10.1109/ICSE-Companion52605.2021.00071
Too quiet in the library: an empirical study of security updates in android apps’ native code
作者: Almanee, Sumaya and "{U
关键词: No keywords
Abstract
Android apps include third-party native libraries to increase performance and to reuse functionality. Native code is directly executed from apps through the Java Native Interface or the Android Native Development Kit. Android developers add precompiled native libraries to their projects, enabling their use. Unfortunately, developers often struggle or simply neglect to update these libraries in a timely manner. This results in the continuous use of outdated native libraries with unpatched security vulnerabilities years after patches became available.To further understand such phenomena, we study the security updates in native libraries in the most popular 200 free apps on Google Play from Sept. 2013 to May 2020. A core difficulty we face in this study is the identification of libraries and their versions. Developers often rename or modify libraries, making their identification challenging. We create an approach called LibRARIAN (LibRAry veRsion IdentificAtioN) that accurately identifies native libraries and their versions as found in Android apps based on our novel similarity metric bin2sim. LibRARIAN leverages different features extracted from libraries based on their metadata and identifying strings in read-only sections.We discovered 53/200 popular apps (26.5%) with vulnerable versions with known CVEs between Sept. 2013 and May 2020, with 14 of those apps remaining vulnerable. We find that app developers took, on average, 528.71±40.20 days to apply security patches, while library developers release a security patch after 54.59 ± 8.12 days—a 10 times slower rate of update.
DOI: 10.1109/ICSE-Companion52605.2021.00072
JUSTGen: effective test generation for unspecified JNI behaviors on JVMs
作者: Hwang, Sungjae and Lee, Sungho and Kim, Jihoon and Ryu, Sukyoung
关键词: Java native interface, Java virtual machine, debugging, empirical study, testing
Abstract
Java Native Interface (JNI) provides a way for Java applications to access native libraries, but it is difficult to develop correct JNI programs. By leveraging native code, the JNI enables Java developers to implement efficient applications and to reuse code written in other programming languages such as C and C++. Besides, the core Java libraries already use the JNI to provide system features like a graphical user interface. As a result, many mainstream Java Virtual Machines (JVMs) support the JNI. However, due to the complex interoperation semantics between different programming languages, implementing correct JNI programs is not trivial. Moreover, because of the performance overhead, JVMs do not validate erroneous JNI interoperations by default, but they validate them only when the debug feature, the -Xcheck:jni option, is enabled. Therefore, the correctness of JNI programs highly relies on the checks by the -Xcheck:jni option of JVMs. Questions remain, however, on the quality of the checks provided by the feature. Are there any properties that the -Xcheck:jni option fails to validate? If so, what potential issues can arise due to the lack of such validation? To the best of our knowledge, no research has explored these questions in-depth.In this paper, we empirically study the validation quality and impacts of the -Xcheck:jni option on mainstream JVMs using unspecified corner cases in the JNI specification. Such unspecified cases may lead to unexpected run-time behaviors because their semantics is not defined in the specification. For a systematic study, we propose JUSTGen, a semi-automated approach to identify unspecified cases from a specification and generate test programs. JUSTGen receives the JNI specification written in our domain specific language (DSL), and automatically discovers unspecified cases using an SMT solver. It then generates test programs that trigger the behaviors of unspecified cases. Using the generated tests, we empirically study the validation ability of the -Xcheck:jni option. Our experimental result shows that the JNI debug feature does not validate thousands of unspecified cases on JVMs, and they can cause critical run-time errors such as violation of the Java type system and memory corruption. We reported 792 unspecified cases that are not validated by JVMs to their corresponding JVM vendors. Among them, 563 cases have been fixed and the remaining cases will be fixed in near future. Based on our empirical study, we believe that the JNI specification should specify the semantics of the missing cases clearly and the debug feature should be supported completely.
DOI: 10.1109/ICSE-Companion52605.2021.00073
An empirical assessment of global COVID-19 contact tracing applications
作者: Sun, Ruoxi and Wang, Wei and Xue, Minhui and Tyson, Gareth and Camtepe, Seyit and Ranasinghe, Damith C.
关键词: No keywords
Abstract
This is the artifact accompanying the paper “An Empirical Assessment of Global COVID-19 Contact Tracing Applications”, accepted by ICSE 2021. The artifact presents the first automated security and privacy assessment tool that tests contact tracing apps for security weaknesses, malware, embedded trackers and private information leakage. COVIDGuardian outperforms 4 state-of-the-practice industrial and open-source tools. Note that, Although the tool is tailored to focus on contact tracing apps, it can also be adapted to other types of apps with respect to the NLP PII learning context, e.g., by changing the source & sink list or updating the sensitive PII keywords.
DOI: 10.1109/ICSE-Companion52605.2021.00074
An evolutionary study of configuration design and implementation in cloud systems
作者: Zhang, Yuanliang and He, Haochen and Legunsen, Owolabi and Li, Shanshan and Dong, Wei and Xu, Tianyin
关键词: No keywords
Abstract
Many techniques have been proposed for detecting software misconfigurations and diagnosing unintended behavior caused by misconfigurations in cloud systems. Detection and diagnosis are steps in the right direction: misconfigurations cause many costly failures and severe performance issues. But, we argue that continued focus on detection and diagnosis is symptomatic of a more serious problem: configuration design and implementation are not yet first-class software engineering endeavors in cloud systems. Little is known about how and why developers evolve configuration design and implementation, and the challenges that they face in doing so.
DOI: 10.1109/ICSE-Companion52605.2021.00075
Artifact for “GenTree: using decision trees to learn interactions for configurable software”
作者: Nguyen, KimHao and Nguyen, ThanhVu
关键词: No keywords
Abstract
This document describes the artifact package accompanying the ICSE’21 paper “GenTree: Using Decision Trees to Learn Interactions for Configurable Software” [1]. The artifact includes GenTree source code, pre-built binaries, benchmark program specifications, and scripts to replicate the data presented in the paper. Furthermore, GenTree is applicable to new programs written in supported languages (C, C++, Python, Perl, Ocaml), or can be extended to support new languages easily. GenTree implementation is highly modular and optimized, hence, it can also be used as a framework for developing and testing new interaction inference algorithms. We hope the artifact will be useful for researchers who are interested in interaction learning, especially iterative and data-driven approaches.
DOI: 10.1109/ICSE-Companion52605.2021.00076
Artifact of ‘FLACK: counterexample-guided fault localization for alloy models’
作者: Zheng, Guolong and Nguyen, ThanhVu and Brida, Sim'{o
关键词: alloy, fault localization
Abstract
This document provides instructions to setup and execute FLACK. FLACK is an automatic fault localization tool for Alloy. Given an Alloy model with violated assertions, FLACK automatically outputs a list of expressions ranking based on their suspiciousness to the error. The link to the replication package is https://github.com/guolong-zheng/flack-ae. The replication package contains the source code of FLACK and benchmarks to reproduce all the evaluation results in the ICSE 2021 submission.
DOI: 10.1109/ICSE-Companion52605.2021.00077
Artifact for improving fault localization by integrating value and predicate based causal inference techniques
作者: K"{u
关键词: No keywords
Abstract
This work presents an overview of the artifact for the paper titled “Improving Fault Localization by Integrating Value and Predicate Based Causal Inference Techniques”. The artifact was implemented in a virtual machine and includes the scripts for the UniVal algorithm for fault localization employing the Defects4J test suite. Technical information about the individual components for the artifact’s repository as well as guidance on the necessary documentation for utilizing the software are provided.
DOI: 10.1109/ICSE-Companion52605.2021.00078
ThEodorE: a trace checker for CPS properties
作者: Menghi, Claudio and Vigan`{o
关键词: formal methods, monitors, semantics, specification, validation
Abstract
ThEodorE is a trace checker for Cyber-Physical systems (CPS). It provides users with (i) a GUI editor for writing CPS requirements; (ii) an automatic procedure to check whether the requirements hold on execution traces of a CPS. ThEodorE enables writing requirements using the Hybrid Logic of Signals (HLS), a novel, logic-based specification language to express CPS requirements. The trace checking procedure of ThEodorE reduces the problem of checking if a requirement holds on an execution trace to a satisfiability problem, which can be solved using off-the-shelf Satisfiability Modulo Theories (SMT) solvers. This artifact paper presents the tool support provided by ThEodorE.
DOI: 10.1109/ICSE-Companion52605.2021.00079
EvoSpex: an evolutionary algorithm for learning postconditions (artifact)
作者: Molina, Facundo and Ponzio, Pablo and Aguirre, Nazareno and Frias, Marcelo
关键词: No keywords
Abstract
Having the expected behavior of software specified in a formal language can greatly improve the automation of software verification activities, since these need to contrast the intended behavior with the actual software implementation. Unfortunately, software many times lacks such specifications, and thus providing tools and techniques that can assist developers in the construction of software specifications are relevant in software engineering. As an aid in this context, we present EvoSpex, a tool that given a Java method, automatically produces a specification of the method’s current behavior, in the form of postcondition assertions. EvoSpex is based on generating software runs from the implementation (valid runs), making modifications to the runs to build divergent behaviors (invalid runs), and executing a genetic algorithm that tries to evolve a specification to satisfy the valid runs, and leave out the invalid ones. Our tool supports a rich JML-like assertion language, that can capture complex specifications, including sophisticated object structural properties.
DOI: 10.1109/ICSE-Companion52605.2021.00080
FlakeFlagger: predicting flakiness without rerunning tests
作者: Alshammari, Abdulrahman and Morris, Christopher and Hilton, Michael and Bell, Jonathan
关键词: No keywords
Abstract
This is an extended abstract that describes the code and data artifact [1] of our paper “FlakeFlagger: Predicting Flakiness Without Rerunning Tests” [2]. The goal of our artifact is to let researchers reproduce our experiment on our provided flaky dataset or reuse our tool on different flaky tests datasets.
DOI: 10.1109/ICSE-Companion52605.2021.00081
MAANA: an automated tool for DoMAin-specific HANdling of ambiguity
作者: Ezzini, Saad and Abualhaija, Sallam and Arora, Chetan and Sabetzadeh, Mehrdad and Briand, Lionel C.
关键词: Wikipedia, ambiguity, corpus generation, natural language processing, natural-language requirements, requirements engineering
Abstract
MAANA (in Arabic: “meaning”) is a tool for performing domain-specific handling of ambiguity in requirements. Given a requirements document as input, MAANA detects the requirements that are potentially ambiguous. The focus of MAANA is on coordination ambiguity and prepositional-phrase attachment ambiguity; these are two common ambiguity types that have been studied in the requirements engineering literature. To detect ambiguity, MAANA utilizes structural patterns and a set of heuristics derived from a domain-specific corpus. The generated analysis file after running the tool can be reviewed by requirements analysts. Through combining different knowledge sources, MAANA highlights also the requirements that might contain unacknowledged ambiguity. That is when the analysts understand different interpretations for the same requirement, without explicitly discussing it with the other analysts due to time constraints. This artifact paper presents the details of MAANA. MAANA is associated with the ICSE 2021 technical paper titled “Using Domain-specific Corpora for Improved Handling of Ambiguity in Requirements”. The tool is publicly available on GitHub and Zenodo.
DOI: 10.1109/ICSE-Companion52605.2021.00082
Replication of SOAR: a synthesis approach for data science API refactoring
作者: Ni, Ansong and Ramos, Daniel and Yang, Aidan Z. H. and Lynce, In^{e
关键词: No keywords
Abstract
This paper provides provides a guide to the replication package of SOAR: A Synthesis Approach for Data Science API Refactoring. Our replication package provides a reliable way of reproducing results of the paper using a virtual machine. The replication packages includes scripts to generate the tables and figures presented in results section of the paper. Details on how to use those scripts and run SOAR are explained throughout this guide.
DOI: 10.1109/ICSE-Companion52605.2021.00083
Research artifact: the potential of meta-maintenance on GitHub
作者: Hata, Hideaki and Kula, Raula Gaikovina and Ishio, Takashi and Treude, Christoph
关键词: No keywords
Abstract
This is a research artifact for the paper “Same File, Different Changes: The Potential of Meta-Maintenance on GitHub”. This artifact is a data repository including a list of studied 32,007 repositories on GitHub, a list of targeted 401,610,677 files, the results of the qualitative analysis for RQ2, RQ3, and RQ4, the results of the quantitative analysis for RQ5, and survey material for RQ6. The purpose of this artifact is enabling researchers to replicate our mixed-methods results of the paper, and to reuse the results of our exploratory study for further software engineering research. This research artifact is available at https://github.com/NAIST-SE/MetaMaintenancePotential and https://doi.org/10.5281/zenodo.4456668.
DOI: 10.1109/ICSE-Companion52605.2021.00084
Replication package for article: data-oriented differential testing of object-relational mapping systems
作者: Sotiropoulos, Thodoris and Chaliasos, Stefanos and Atlidakis, Vaggelis and Mitropoulos, Dimitris and Spinellis, Diomidis
关键词: No keywords
Abstract
The ICSE 2021 paper titled “Data-Oriented Differential Testing Object-Relational Mapping Systems” [1] comes with a replication package, which has been awarded with the “Available” badge by the Artifact Evaluation Committee. The artifact contains scripts, and step-by-step instructions to (1) get yourself familiar with the corresponding bug-finding tool (namely CYNTHIA), (2) reproduce the results of the main paper, and (3) re-run the bugs discovered by CYNTHIA. Specifically, the artifact has the following structure:• scripts/: This is the directory that contains the scripts needed to re-run the experiments presented in our paper.• bugs/bug_schema.sql: This is the database schema that contains the bugs discovered by CYNTHIA.• bugs/bugdb.sqlite3: This is the SQLite database file corresponding to the schema defined in bugs/bug_schema.sql.• example_bugs/: Contains test cases that trigger the two ORM bugs demonstrated in Section II of the main paper.• cynthia/: Contains the source code of CYNTHIA.
DOI: 10.1109/ICSE-Companion52605.2021.00085
Understanding community smells variability: A statistical approach: replication package instructions
作者: Catolino, Gemma and Palomba, Fabio and Tamburri, Damian Andrew and Serebrenik, Alexander
关键词: community smells, replication package, statistical models
Abstract
In this document, we present the replication package of the paper “Understanding Community Smells Variability: A Statistical Approach” accepted at the 43rd International Conference on Software Engineering - Software Engineering in Society Track (ICSE '21).
DOI: 10.1109/ICSE-Companion52605.2021.00086
Shipwright: a human-in-the-loop system for dockerfile repair (artifact abstract)
作者: Henkel, Jordan and Silva, Denini and Teixeira, Leopoldo and d’Amorim, Marcelo and Reps, Thomas
关键词: No keywords
Abstract
Shipwright is a human-in-the-loop system for Dockerfile repair. In this artifact, we provide the data, tools, and scripts necessary to allow others to run our experiments (either in full, or reduced versions where necessary). In particular, we provide code and data corresponding to each of the four research questions we answered in the Shipwright paper.
DOI: 10.1109/ICSE-Companion52605.2021.00087
A replication package for PyCG: practical call graph generation in Python
作者: Salis, Vitalis and Sotiropoulos, Thodoris and Louridas, Panos and Spinellis, Diomidis and Mitropoulos, Dimitris
关键词: No keywords
Abstract
The ICSE 2021 paper titled “PyCG: Practical Call Graph Generation in Python” comes with a replication package with the purpose of providing open access to (1) our prototype call graph generator, namely PyCG, and (2) the data and scripts that replicate the results of the paper. The Artifact Evaluation Committee found that this package leads to the reproduction of the results outlined in the paper and is openly available1. The replication package contains the following:1) A Docker image which can be either built manually or downloaded from DockerHub. It contains the source code and installation of PyCG, as well as the installations of two other call graph generators (i.e., Pyan and Depends), which we compare PyCG with.2) A micro-benchmark suite of 112 Python modules (Section I-A).3) A macro-benchmark suite of 5 popular Python packages (Section I-B).4) Python and Bash scripts used to execute PyCG, Pyan and Depends against the micro- and macro-benchmarks and compare the corresponding results.
DOI: 10.1109/ICSE-Companion52605.2021.00088
RUSTINA: Automatically checking and patching inline assembly interface compliance (artifact evaluation): accepted submission #992 - “interface compliance of inline assembly: automatically check, patch and refine”
作者: Recoules, Fr'{e
关键词: No keywords
Abstract
The main goal of the artifact is to support the experimental claims of the paper #992 “Interface Compliance of Inline Assembly: Automatically Check, Patch and Refine” [3] by making both the prototype and data Available to the community. The expected result is the same output as the figures given in Table I and Table IV (appendix C) of the paper. In addition, we hope the released snapshot of our prototype is simple, documented and robust enough to have some uses for people dealing with inline assembly.
DOI: 10.1109/ICSE-Companion52605.2021.00089
Data and materials for: Why don’t developers detect improper input validation?'; DROP TABLE papers;
作者: Braz, Larissa and Fregnan, Enrico and \c{C
关键词: No keywords
Abstract
Improper Input Validation (IIV) is a dangerous software vulnerability that occurs when a system does not safely handle input data. Although IIV is easy to detect and fix, it still commonly happens in practice; so, why do developers not recognize IIV? Answering this question is key to understand how to support developers in creating secure software systems.In our work, we studied to what extent developers can detect IIV and investigate underlying reasons. To do so, we conducted an online experiment with 146 software developers. In this document, we explain how to obtain the artifact package of our study, the artifact material, and how to use the artifacts.
DOI: 10.1109/ICSE-Companion52605.2021.00090
Artifact: distribution-aware testing of neural networks using generative models
作者: Dola, Swaroopa and Dwyer, Matthew B. and Soffa, Mary Lou
关键词: deep neural networks, input validation, test coverage, test generation
Abstract
The artifact used for the experimental evaluation of Distribution-Aware Testing of Neural Networks Using Generative Models is publicly available on GitHub and it is reusable. The artifact consists of python scripts, trained deep neural network model files and data required for running the experiments. It is also provided as a VirtualBox VM image for reproducing the paper results. Users should be familiar with using VirtualBox software and Linux platform to reproduce or reuse the artifact.
DOI: 10.1109/ICSE-Companion52605.2021.00091
A partial replication of “RAICC: revealing atypical inter-component communication in Android apps”
作者: Samhi, Jordan and Bartel, Alexandre and Bissyand'{e
关键词: No keywords
Abstract
This short paper presents the artefacts related to our ICSE 2021 research paper.
DOI: 10.1109/ICSE-Companion52605.2021.00092
Artifact of bounded exhaustive search of alloy specification repairs
作者: Brida, Sim'{o
关键词: No keywords
Abstract
BeAFix is a tool and technique for automated repair of faulty models written in Alloy, a declarative formal specification language based on first-order relational logic. BeAFix takes a faulty Alloy model, i.e., an Alloy model with at least one analysis command whose result is contrary to the developer’s expectation, and a set of suspicious specification locations, and explores the space of fix candidates consisting of all alternative expressions for the indicated locations, that can be constructed by bounded application of a family of mutation operations. BeAFix can work with any kind of specification oracle, from Alloy test cases to standard predicates and assertions typically found in Alloy specifications, and is backed with a number of sound pruning strategies, for efficient exploration of fix candidate search spaces.
DOI: 10.1109/ICSE-Companion52605.2021.00093
PASTA: Synthesizing object state transformers for dynamic software updates
作者: Zhao, Zelin and Jiang, Yanyan and Xu, Chang and Gu, Tianxiao and Ma, Xiaoxing
关键词: No keywords
Abstract
Object transformation (upgrading heap objects to their new-version counterparts) is a crucial step in dynamic software update (DSU). However, providing non-trivial object transformers for complex software updates can be difficult for software developers and upgrade maintainers. This paper presents the design and implementation of PASTA, a tool for automatic object transformer synthesis.
DOI: 10.1109/ICSE-Companion52605.2021.00094
Verifying determinism in sequential programs
作者: Mudduluru, Rashmi and Waataja, Jason and Millstein, Suzanne and Ernst, Michael D.
关键词: No keywords
Abstract
When a program is nondeterministic, it is difficult to test and debug. Nondeterminism occurs even in sequential programs iterating over the elements of a hash table.
DOI: 10.1109/ICSE-Companion52605.2021.00095
Dataset to study indirectly dependent documentation in GitHub repositories
作者: Sondhi, Devika and Gupta, Avyakt and Purandare, Salil and Rana, Ankit and Kaushal, Deepanshu and Purandare, Rahul
关键词: GitHub study, commits dataset
Abstract
In the research work, we have highlighted the importance of regularly updating the software documentation. For this purpose, we analyzed the function documentations indirectly dependent on other functions. This artifact provides scripts to extract the data and the final dataset containing observations obtained on manually annotating the extracted data. The details of this work may be found in the paper appearing in the technical track, titled ‘On Indirectly Dependent Documentation in the Context of Code Evolution: A Study’.
DOI: 10.1109/ICSE-Companion52605.2021.00096
Unrealizable cores for reactive systems specifications: artifact
作者: Maoz, Shahar and Shalom, Rafi
关键词: No keywords
Abstract
This document describes the artifact that accompanies the ICSE’21 paper “Unrealizable Cores for Reactive Systems Specifications”. The artifact includes the specifications that were used in the experiments that are described in the paper. It further includes an executable that allows interested readers to reproduce these experiments and inspect their results. Additionally, the executable is applicable to any specification in Spectra format, which allows conducting similar experiments over any Spectra specification. We hope the artifact will be useful for researchers who are interested in reactive synthesis, specifically in different means to deal with unrealizable specifications.
DOI: 10.1109/ICSE-Companion52605.2021.00097
Replication package for input algebras
作者: Gopinath, Rahul and Nemati, Hamed and Zeller, Andreas
关键词: debugging, faults, testing
Abstract
Grammar-based fuzzers are effective and efficient. They can produce an infinite number of syntactically valid test inputs, which can be used to explore the input space without bias. However, it is notoriously difficult to generate focused inputs to induce a specific behavior such as failure without affecting their effectiveness. This is the fuzzer taming problem.In our paper Input Algebras, we show how one can specialize the grammar towards inclusion or exclusion of specific patterns, and their arbitrary boolean combinations. The resulting specialized grammars can be used both for focused fuzzing and also as validators that can indicate the presence or absence of specific behavior-inducing input patterns.In our evaluation of real-world bugs, we show that specialized grammars are accurate both in producing and validating targeted inputs. We also provide a completely worked out Jupyter notebook that explains our algorithms in detail along with a sufficient number of examples. Further, we describe in detail how to replicate our evaluation.
DOI: 10.1109/ICSE-Companion52605.2021.00098
Artifact for enhancing genetic improvement of software with regression test selection
作者: Guizzo, Giovani and Petke, Justyna and Sarro, Federica and Harman, Mark
关键词: No keywords
Abstract
We present in this document the basic information needed to download, unpack, and then interpret the instructions we provide as requested in the ICSE 2021 Artifact Submission Guidelines. The artifact contains all the subject programs, scripts, tools, results, and a series of guidelines on how to use them.
DOI: 10.1109/ICSE-Companion52605.2021.00099
CodeShovel: a reusable and available tool for extracting source code histories
作者: Grund, Felix and Chowdhury, Shaiful and Bradley, Nick C. and Hall, Braxton and Holmes, Reid
关键词: artifact, code histories, software evolution
Abstract
Being able to accurately understand how source code evolved is fundamentally important for both software engineers and researchers. Our ICSE 2021 Research Paper CodeShovel: Constructing Method-Level Source Code Histories describes a novel approach for quickly uncovering these method histories. The approach, codified in the CodeShovel tool, is available for researchers and practitioners alike to use and extend. It is available both as a public web service that can be used interactively or through a REST API and as a stand-alone Java component. This document details how to install and use CodeShovel, although all pertinent details are available online enabling CodeShovel to be reused as desired.
DOI: 10.1109/ICSE-Companion52605.2021.00100
Survey instruments for “how was your weekend?”: software development teams working from home during COVID-19
作者: Miller, Courtney and Rodeghero, Paige and Storey, Margaret-Anne and Ford, Denae and Zimmermann, Thomas
关键词: COVID-19, productivity, software development, work from home
Abstract
This document describes the survey instruments from our paper “How Was Your Weekend?” Software Development Teams Working From Home During COVID-19 as well as how to access them.
DOI: 10.1109/ICSE-Companion52605.2021.00101
Research tools, survey responses, and interview analysis from a case study of onboarding software teams at Microsoft
作者: Ju, An and Sajnani, Hitesh and Kelly, Scot and Herzig, Kim
关键词: No keywords
Abstract
The artifacts is publicly available at [1]https://doi.org/10.5281/zenodo.4455936This repository contains the supplementary material for the paper [2]. It contains• Interview guides• Surveys• Anonymized survey responses• Interview analysis and quotesWe have removed open-ended questions from survey responses to protect participants’ privacy.
DOI: 10.1109/ICSE-Companion52605.2021.00102
IoT development in the wild: bug taxonomy and developer challenges
作者: Makhshari, Amir and Mesbah, Ali
关键词: empirical study, internet of things, mining software repositories, software engineering
Abstract
IoT systems are rapidly adopted in various domains, from embedded systems to smart homes. Despite their growing adoption and popularity, there has been no thorough study to understand IoT development challenges from the practitioners’ point of view. We provide the first systematic study of bugs and challenges that IoT developers face in practice, through a large-scale empirical investigation. We highlight frequent bug categories and their root causes, correlations between them, and common pitfalls and challenges that IoT developers face. We recommend future directions for IoT areas that require research and development attention.
DOI: 10.1109/ICSE-Companion52605.2021.00103
Smart contract security: A practitioners’ perspective: the artifact of a paper accepted in the 43rd IEEE/ACM international conference on software engineering (ICSE 2021)
作者: Wan, Zhiyuan and Xia, Xin and Lo, David and Chen, Jiachi and Luo, Xiapu and Yang, Xiaohu
关键词: No keywords
Abstract
Blockchain is a distributed ledger that provides an open, decentralized, and fault-tolerant transaction mechanism. Blockchain technology has attracted considerable attention from both industry and academia since it is originally introduced for Bitcoin [7] to support the exchange of cryptocurrency. Blockchain technology evolves to facilitate generalpurpose computations with a wide range of decentralized applications. The Smart contract technology is one appealing decentralized application that enables the computations on top of a blockchain.
DOI: 10.1109/ICSE-Companion52605.2021.00104
Semantic patches for adaptation of JavaScript programs to evolving libraries
作者: Nielsen, Benjamin Barslev and Torp, Martin Toldam and M\o{
关键词: No keywords
Abstract
The artifact contains the JSFIX tool along with instructions for how to replicate all of the experiments from the paper. The purpose of the artifact is to allow the the reader to check that the results presented in the paper are correct.
DOI: 10.1109/ICSE-Companion52605.2021.00105
PLELog: semi-supervised log-based anomaly detection via probabilistic label estimation
作者: Yang, Lin and Chen, Junjie and Wang, Zan and Wang, Weijing and Jiang, Jiajun and Dong, Xuyuan and Zhang, Wenbin
关键词: No keywords
Abstract
PLELog is a novel approach for log-based anomaly detection via probabilistic label estimation. It is designed to effectively detect anomalies in unlabeled logs and meanwhile avoid the manual labeling effort for training data generation. We embed semantic information within log events as fixed-length vectors and apply HDBSCAN to automatically cluster log sequences. After that, we also propose a Probabilistic Label Estimation approach to automatically label log sequences, which can reduce the noises introduced by error labeling and put “labeled” instances into an attention-based GRU network for training. We conducted an empirical study to evaluate the effectiveness of PLELog on two open-source log data (i.e., HDFS and BGL). The results demonstrate the effectiveness of PLELog. In particular, PLELog has been applied to two real-world systems from a university and a large corporation, and the results further demonstrate its practicability.
DOI: 10.1109/ICSE-Companion52605.2021.00106
White-box performance-influence models: a profiling and learning approach (replication package)
作者: Weber, Max and Apel, Sven and Siegmund, Norbert
关键词: No keywords
Abstract
These artifacts refer to the study and implementation of the paper ‘White-Box Performance-Influence Models: A Profiling and Learning Approach’. In this document, we describe the idea and process of how to build white-box performance models for configurable software systems. Specifically, we describe the general steps and tools that we have used to implement our approach, the data we have obtained, and the evaluation setup. We further list the available artifacts, such as raw measurements, configurations, and scripts at our software heritage repository.
DOI: 10.1109/ICSE-Companion52605.2021.00107
UI-based flaky tests datasets
作者: Romano, Alan and Song, Zihe and Grandhi, Sampath and Yang, Wei and Wang, Weihang
关键词: No keywords
Abstract
This artifact submission contains the dataset used in the paper accepted in the ICSE 2021 Technical Track, “An Empirical Analysis of UI-based Flaky Tests”. The dataset contains 235 samples of flaky UI tests composed of 152 samples from web projects and 83 samples from mobile projects.
DOI: 10.1109/ICSE-Companion52605.2021.00108
Replication package for representation of developer expertise in open source software
作者: Dey, Tapajit and Karnauch, Andrey and Mockus, Audris
关键词: No keywords
Abstract
This describes the artifact associated with the article “Representation of Developer Expertise in Open Source Software” at the International Conference on Software Engineering 2021. The aim of the original paper was to define a feasible representation of a developer’s expertise in specific focus areas of software development by gauging their fluency with different sets of APIs. The artifact is made available through Zenodo under the CC-BY-4.0 license at https://doi.org/10.5281/zenodo.4457107. The README file has detailed instructions on how to replicate the results presented in the original paper. The artifact includes the input dataset (with the developers’ names and email addresses replaced by their corresponding SHA1 digest values to protect privacy) and all the associated scripts. The trained Doc2Vec models are also included in the artifact. These models can be used to obtain the Skill Space representations of developers, projects, and APIs without having to re-train the model.
DOI: 10.1109/ICSE-Companion52605.2021.00109
Abacus: a tool for precise side-channel analysis
作者: Bao, Qinkun and Wang, Zihao and Larus, James R. and Wu, Dinghao
关键词: No keywords
Abstract
Side-channel vulnerabilities can leak sensitive information unconsciously. In this paper, we introduce the usage of Abacus. Abacus is a tool that can analyze secret-dependent control-flow and secret-dependent data-access leakages in binary programs. Unlike previous tools that can only identify leakages, it can also estimate the amount of leaked information for each leakage site. Severe vulnerabilities usually leak more information, allowing developers to triage the patching effort for side-channel vulnerabilities. This paper is to help users make use of Abacus and reproduce our previous results. Abacus is available at https://github.com/s3team/Abacus.
DOI: 10.1109/ICSE-Companion52605.2021.00110
An open dataset for onboarding new contributors: empirical study of openstack ecosystem
作者: Foundjem, Armstrong and Eghan, Ellis E. and Adams, Bram
关键词: available, open data, open science, replication, transparency, verifiable
Abstract
This dataset provides the qualitative and quantitative data of our mixed-method empirical study of onboarding in the OpenStack software ecosystem (SECO). First, we carried out a SECO-level participant observation study of 72 new contributors during a 2-day OpenStack onboarding (in-person) event yielding a rich set of qualitative data; 14 files amount to 60% of the entire dataset originating from a participant observation study. Second, we quantitatively validated the extent to which SECOs achieve benefits such as diversity, productivity, and quality by mining 1281 contributors’ code changes, reviews, and issues with(out) OpenStack onboarding experience. Our quantitative dataset includes nine files, which is about 40% of the entire dataset, and we obtained these files by mining new contributors’ codebase activities from four OpenStack repositories. Besides, we make available the scripts that e used to extract and analyze this dataset. By providing this data, we are claiming the “Available Badge,” and our data are online on a public archived repository at Zenodo: DOI: 10.5281/zenodo.4457683
DOI: 10.1109/ICSE-Companion52605.2021.00111
A survey on method naming standards: questions and responses artifact
作者: Alsuhaibani, Reem S. and Newman, Christian D. and Deeker, Michael J. and Collard, Miehael L. and Maletic, Jonathan I.
关键词: identifier naming conventions, identifier naming standards, method naming
Abstract
The artifacts of a large (+1100 responses) survey of professional software developers concerning standards for naming source code methods is presented. The artifact consists of the survey questions along with all the responses from participants. The artifact allows other researchers to examine and study the responses to the survey.
DOI: 10.1109/ICSE-Companion52605.2021.00112
A dataset of vulnerable code changes of the chromium OS project
作者: Paul, Rajshakhar and Turzo, Asif Kamal and Bosu, Amiangshu
关键词: code review, dataset, vulnerability
Abstract
This paper presents a an empirically built and validated dataset of code reviews from the Chromium OS project that either identified or missed security vulnerabilities. The dataset includes total 890 vulnerable code changes categorized based on the CWE specification and is publicly available at: https://zenodo.org/record/4539891
DOI: 10.1109/ICSE-Companion52605.2021.00113
PyART: Python API recommendation in real-time
作者: He, Xincheng and Xu, Lei and Zhang, Xiangyu and Hao, Rui and Feng, Yang and Xu, Baowen
关键词: API recommendation, Python, context analysis, data flow analysis, real-time recommendation
Abstract
This is the research artifact of the paper titled ‘PyART: Python API Recommendation in Real-Time’. PyART is a real-time API recommendation tool for Python, which includes two main functions: data-flow analysis and real-time API recommendation for both incomplete and complete Python code context. Compared to classical tools, PyART has two important particularities: it is able to work on real-time recommendation scenario, and it provides data-flow analysis and API recommendation for dynamic language. Classical tools often fail to make static analysis in real-time recommendation scenario, due to the incompletion of syntax. And the dynamic features of Python language also bring challenges to type inference and API recommendation. Different from classical tools, PyART derives optimistic data-flow that is neither sound nor complete but sufficient for API recommendation and cost-effective to collect, and provides real-time API recommendations based on novel candidate collection, context analysis and feature learning techniques. The artifact evaluation experiments of PyART include three main aspects: data-flow analysis, intra-project API recommendation and across-project API recommendation. We assume users of the artifact is able to use Linux Ubuntu Operating System.
DOI: 10.1109/ICSE-Companion52605.2021.00114
Scalable quantitative verification for deep neural networks
作者: Baluta, Teodora and Chua, Zheng Leong and Meel, Kuldeep S. and Saxena, Prateek
关键词: No keywords
Abstract
Despite the functional success of deep neural networks (DNNs), their trustworthiness remains a crucial open challenge. To address this challenge, both testing and verification techniques have been proposed. But these existing techniques provide either scalability to large networks or formal guarantees, not both. In this paper, we propose a scalable quantitative verification framework for deep neural networks, i.e., a test-driven approach that comes with formal guarantees that a desired probabilistic property is satisfied. Our technique performs enough tests until soundness of a formal probabilistic property can be proven. It can be used to certify properties of both deterministic and randomized DNNs. We implement our approach in a tool called PROVERO1 and apply it in the context of certifying adversarial robustness of DNNs. In this context, we first show a new attack-agnostic measure of robustness which offers an alternative to purely attack-based methodology of evaluating robustness being reported today. Second, PROVERO provides certificates of robustness for large DNNs, where existing state-of-the-art verification tools fail to produce conclusive results. Our work paves the way forward for verifying properties of distributions captured by real-world deep neural networks, with provable guarantees, even where testers only have black-box access to the neural network.
DOI: 10.1109/ICSE-Companion52605.2021.00115
Team-oriented consistency checking of heterogeneous engineering artifacts
作者: Tr"{o
关键词: collaboration, engineering artifacts, global consistency checking
Abstract
Consistency checking of interdependent heterogeneous engineering artifacts, such as requirements, specifications, and code, is a challenging task in large-scale engineering projects. The lack of team-oriented solutions allowing a multitude of project stakeholders to collaborate in a consistent manner is thus becoming a critical problem. In this context, this work proposes an approach for team-oriented consistency checking of collaboratively developed heterogeneous engineering artifacts.
DOI: 10.1109/ICSE-Companion52605.2021.00116
RPT: effective and efficient retrieval of program translations from big code
作者: Chen, Binger and Abedjan, Ziawasch
关键词: No keywords
Abstract
Program translation is a growing demand in software engineering. Manual program translation requires programming expertise in source and target language. One way to automate this process is to make use of the big data of programs, i.e., Big Code. In particular, one can search for program translations in Big Code. However, existing code retrieval techniques are not designed for cross-language code retrieval. Other data-driven approaches require human efforts in constructing cross-language parallel datasets to train translation models. In this paper, we present Rpt, a novel code translation retrieval system. We propose a lightweight but informative program representation, which can be generalized to all imperative PLs. Furthermore, we present our index structure and hierarchical filtering mechanism for efficient code retrieval from a Big Code database.
DOI: 10.1109/ICSE-Companion52605.2021.00117
Finding metamorphic relations for scientific software
作者: Lin, Xuanyi and Peng, Zedong and Niu, Nan and Wang, Wentao and Liu, Hui
关键词: metamorphic relation identification, scientific software, storm water management model (SWMM)
Abstract
Metamorphic testing uncovers defects by checking whether a relation holds among multiple software executions. These relations are known as metamorphic relations (MRs). For scientific software operating in a large multi-parameter input space, identifying MRs that determine the simultaneous changes among multiple variables is challenging. In this poster, we propose a fully automatic approach to classifying input and output variables from scientific software’s user manual, mining these variables’ associations from the user forum to generate MRs, and validating the MRs with existing regression tests. Preliminary results of our end-to-end MR support for the Storm Water Management Model (SWMM) are reported.
DOI: 10.1109/ICSE-Companion52605.2021.00118
Understanding language selection in multi-language software projects on GitHub
作者: Li, Wen and Meng, Na and Li, Li and Cai, Haipeng
关键词: evolution, functionality relevance, language selection, multi-language software
Abstract
There are hundreds of programming languages available for software development today. As a result, modern software is increasingly developed in multiple languages. In this context, there is an urgent need for automated tools for multi-language software quality assurance. To that end, it is useful to first understand how languages are chosen by developers in multi-language software projects. One intuitive perspective towards the understanding would be to explore the potential functionality relevance of those choices. With a plethora of publicly hosted multi-language software projects available on GitHub, we were able to obtain thousands of popular, relevant repositories across 10 years from 2010 to 2019 to enable the exploration. We start by estimating the functionality domain of each project through topic modeling, followed by studying the statistical correlation between these domains and language selection over all the sample projects through association mining. We proceed with an evolutionary characterization of these projects to provide a longitudinal view of how the association has changed over the years. Our findings offer useful insights into the rationale behind developers’ choices of language combinations in multi-language software construction.
DOI: 10.1109/ICSE-Companion52605.2021.00119
We’ll fix it in post: what do bug fixes in video game update notes tell us?
作者: Truelove, Andrew and de Almeida, Eduardo Santana and Ahmed, Iftekhar
关键词: No keywords
Abstract
Bugs that persist into releases of video games can have negative impact on both developers and users. Despite these impacts, it is common for games to release with bugs that are fixed through subsequent updates. In 2014, for example, over a third of big-budget games “released on Xbox One, Wii U and PS4” received an update within 24 hours of the game’s initial release [5]. The differences between the development of games and development of traditional software could explain the appearance of these bugs. Prior research has indicated that there is considerable difficulty in comprehensively testing all aspects of a video game [4], [7]. Developers have difficulty writing comprehensive tests, because games can have a significantly large number of possible user interactions compared to other types of software [7]. As a result, many games release with undiscovered bugs that only reveal themselves once customers begin playing the game [5].It has become common practice for developers to apply updates to games in order to fix missed bugs [3]. These updates are often accompanied by notes that describe the changes to the game included in the update [8]. However, some bugs might recur across multiple updates. There exist cases in which developers have attempted to fix a bug in one update, only for that bug to reappear in the game anyway, necessitating another attempted fix in a later update [1].Previous research has focused on creating a taxonomy for video game bugs [2]. To address shortcomings of this prior taxonomy and to attain a deeper understanding of the types of bugs in games, we expand the taxonomy of bug types. We analyzed 12,122 bug fixes taken from 723 updates for 30 popular games on the Steam platform. We categorized these bug fixes using our taxonomy of bug types. We then analyzed the frequency at which the different bug types appear in the update notes and investigated which types of bugs recur more often over multiple updates. Additionally, we investigated which types of bugs most frequently appear in urgent updates or hotfixes, as the bugs that appear in these updates are more likely to have a severe negative impact on users [8]. Finally, we surveyed game developers on their experience with these different types of bugs as well as what challenges and techniques are involved in fixing these bugs.For the recurrence analysis, we used an automated approach to identify potential recurring bug fixes, then manually evaluated these fixes to find the true recurring bugs. We first performed cosine similarity analysis [6] between bug fix lines. For each bug fix line from a game’s updates, we compared that line to each line in all subsequent updates for the game. If two lines had a similarity score of at least 90%, they were treated as a potential match and grouped together for manual review. To determine bug severity, we applied a combination of metrics to identify urgent updates, including some metrics employed by past researchers [3]. Once we identified all the urgent updates in our data, we flagged all the bug fixes lines from our data that appeared in an urgent update. An urgent update is generally intended to fix “problems that are deemed critical enough to not be left unfixed until a regular-cycle update” [3]. We calculated the severity of each bug type by finding the proportion of bug fixes from the data that appeared in an update marked as urgent. For each bug type, we divided the frequency of the bug type in an urgent update by the frequency of the bug type in all updates.Fig. 1. Frequency of Bug Types Across All UpdatesThe most frequently occurring bug types were Information, Game Graphics, and Action (Figure 1). The bug types that recurred the most frequently over multiple updates were Crash, Game Graphics, and Triggered Event (Figure 2). Based on the update data, the bug type with the highest severity was Crash. The next most severe bug types were Object Persistence and Triggered Event (Figure 3). Meanwhile, from the survey, the bug types that were deemed to have the greatest severity were Crash, Action, and Exploit (Figure 4).Fig. 2. Frequency of Recurring Bug TypesFig. 3. Severity of Bug TypesWe received 47 applicable responses to our survey. According to respondents, the aspects of game development most frequently linked to bug recurrence were testing, game design, and code quality. With testing, responses ranged from dealing with a lack of testing in general to dealing with a lack of certain kinds of testing, such as automated testing, integration testing, and cross-platform testing. Additionally, the most frequently mentioned main challenges to identifying and fixing bugs in video games were inadequate testing, reproducing bugs, and code quality. With respect to code quality, there was a particular emphasis on the importance of well-written code that does not cause conflicts elsewhere in the game.Fig. 4. Agreement Level to Whether Selected Bug Types are Likely to Have a Severe Negative Impact on Game ExperienceThese results can help game developers identify which types of bugs to pay more attention to when testing and fixing bugs. Developers can also use these results to help adjust practices related to game development process in order to better prevent, identify, and fix bugs. Researchers can take advantage of these results in order to develop tools or methods that can target specific bug types that are more likely to severely impact the game. There is room for future work that can identify aspects of game development that might benefit from specialized tools or methods that address some of the challenges provided in the survey. For example, multi-component interaction testing is one area that could benefit from future research. Our study takes the first step towards fulfilling these goals.
DOI: 10.1109/ICSE-Companion52605.2021.00120
Unburdening onboarding in software product lines
作者: Medeiros, Raul
关键词: concept maps, onboarding, recommender systems, software product lines
Abstract
The number of studies focusing on onboarding in software organizations has increased significantly during the last years. However, current literature overlooks onboarding in Software Product Lines (SPLs). SPLs have been proven effective in managing the increasing variability of industry software and enabling systematic reuse of a product family. Despite these benefits, SPLs are complex and exhibit particular characteristics that distinguish them from traditional software. Due to these peculiarities, SPLs require a tailor-made onboarding process. Assistance tools might help. In this dissertation, we propose assistance tools (i.e., tools built on top of the software project that help learners understand and develop knowledge) as a means for helping newcomers during onboarding in SPLs.
DOI: 10.1109/ICSE-Companion52605.2021.00121
Extractive multi product-line engineering
作者: Rosiak, Kamil
关键词: clone detection, multi product-line, refactoring, variability mining
Abstract
Cloning is a general approach to create new functionality within variants as well as new system variants. It is a fast, flexible, intuitive, and economical approach to evolve systems in the short run. However, in the long run, the maintenance effort increases. A common solution to this problem is the extraction of a product line from a set of cloned variants. This process requires a detailed analysis of variants to extract variability information. However, clones within a variant are usually not considered in the process, but are also a cause for unsustainable software. This thesis proposes an extractive multi product-line engineering approach to re-establish the sustainable development of software variants. We propose an approach to re-engineer intra-system and inter-system clones into reusable, configurable components stored in an integrated platform and synthesize a matching multilayer feature model.
DOI: 10.1109/ICSE-Companion52605.2021.00122
Group recommendation techniques for feature modeling and configuration
作者: Le, Viet-Man
关键词: configuration, feature models, group decision making, group-based recommendation, software product line
Abstract
In large-scale feature models, feature modeling and configuration processes are highly expected to be done by a group of stakeholders. In this context, recommendation techniques can increase the efficiency of feature-model design and find optimal configurations for groups of stakeholders. Existing studies show plenty of issues concerning feature model navigation support, group members’ satisfaction, and conflict resolution. This study proposes group recommendation techniques for feature modeling and configuration on the basis of addressing the mentioned issues.
DOI: 10.1109/ICSE-Companion52605.2021.00123
A proposal to systematize introducing DevOps into the software development process
作者: de Aguiar Monteiro, Luciano
关键词: DevOps, maturity model, software development
Abstract
The software development industry has been evolving with new development standards and service delivery models. Agile methodologies have reached their completion with DevOps, thereby increasing the quality of the software and creating greater speed in delivery. However, a gap regarding the formalization of its adoption and implementation doubts became relevant. My hypothesis is that, by systematizing the introduction of DevOps into the software development process and defining the function of the members of the DevOps team members, may well make it quicker to implement this process, thus reducing conflicts between the teams. As part of the investigation of this hypothesis, the result of the research will be applied in practical development environments i.e. in a Technology Agency of the State of the Brazilian Government and also at the Brazilian Company Neurotech in order to evaluate its effectiveness from metrics appropriate for DevOps environments.
DOI: 10.1109/ICSE-Companion52605.2021.00124
A model using agile methodologies for defining metrics to be used by the public sector in Brazil to set remuneration for outsourced software development
作者: de Carvalho Almeida, Washington Henrique
关键词: agile, contracts, metrics, public sector, software development
Abstract
The process of contracting software factories within the scope of the Federal Public Administration (APF, in Portuguese) in Brazil has undergone some changes due to legislative alterations and a model has been proposed to improve the rendering of services regarding delivering results. Software factory contracts based on predictive processes and on the metrics of Function Points are the target of criticisms and have issues in achieving results, as verified in previous research. The initial objective of the study now proposed will be to define a process model for formulating metrics that can be used in agile contracts as opposed to those of standardized Function Points which have already proved to be quite problematic and difficult for the contractor to manage. Thus, in line with the theme of agile contracts for software development companies, the study proposed will looking for better understand the problem of software development and its maintenance by public agencies by means of contracts using agil methodologies and appropriate metrics for remunerating these services.
DOI: 10.1109/ICSE-Companion52605.2021.00125
Learning to boost the efficiency of modern code review
作者: Heum"{u
关键词: automated software engineering, deep learning, modern code review
Abstract
Modern Code Review (MCR) is a standard in all kinds of organizations that develop software. MCR pays for itself through perceived and proven benefits in quality assurance and knowledge transfer. However, the time invest in MCR is generally substantial. The goal of this thesis is to boost the efficiency of MCR by developing AI techniques that can partially replace or assist human reviewers. The envisioned techniques distinguish from existing MCR-related AI models in that we interpret these challenges as graph-learning problems. This should allow us to use state-of-science algorithms from that domain to learn coding and reviewing standards directly from existing projects. The required training data will be mined from online repositories and the experiments will be designed to use standard, quantitative evaluation metrics. This research proposal defines the motivation, research-questions, and solution components for the thesis, and gives an overview of the relevant related work.
DOI: 10.1109/ICSE-Companion52605.2021.00126
Towards a testing tool that learns to test
作者: Rodriguez-Valdes, Olivia
关键词: GUI testing, reinforcement learning, scriptless testing
Abstract
We will study the application of Reinforcement Learning techniques in automated GUI testing. Using the script-less GUI testing tool TESTAR as a vehicle, we will focus our research on improving the action selection mechanisms, that guide the learning process. We will tailor rewards towards the goal of enhancing the effectiveness and efficiency of testing.
DOI: 10.1109/ICSE-Companion52605.2021.00127
A functional paradigm for capacity planning of cloud computing workloads
作者: Pereira, Carlos Diego Cavalcanti
关键词: capacity planning, cloud capacity planning, cloud computing
Abstract
Cloud computing allows the shared use of computational resources with its cost based on operational cycles. Given the technical and financial constraints of cloud computing projects, capacity planning processes help to identify the need for resources in a given workload in order for it to operate properly in the face of different contexts. The current industrial paradigm considers historical usage precedents for the capacity planning of cloud computing workloads. In addition, current practices of cloud capacity planning do not establish how accurate these plans are. This work aims to establish a functional paradigm for capacity planning of cloud computing workloads, which include: an architectural meta-model; a method for functional and architectural classification of workloads; a sizing equation and confidence index.
DOI: 10.1109/ICSE-Companion52605.2021.00128
Interactive graph exploration for comprehension of static analysis results
作者: Toledo, Rafael
关键词: graph visualization, program comprehension, static analysis
Abstract
Static analysis results can be overwhelming depending on their complexity and the total number of results. Interactive graph visualization can help engineers explore the connections between different code entities while visually supporting insights about the code’s behaviour. In our doctoral research, we aim to investigate how a graphical model of a program and its analysis results can support the engineer’s understanding. We expect that a graphical interface can ease the diagnose of faults and reduce the cognitive load required to comprehend reported control and data flows present in the codebase.
DOI: 10.1109/ICSE-Companion52605.2021.00129
Data analytics and machine learning methods, techniques and tool for model-driven engineering of smart IoT services
作者: Moin, Armin
关键词: automl, data analytics, domain-specific modeling, internet of things, machine learning, model-driven software engineering
Abstract
This doctoral dissertation proposes a novel approach to enhance the development of smart services for the Internet of Things (IoT) and smart Cyber-Physical Systems (CPS). The proposed approach offers abstraction and automation to the software engineering processes, as well as the Data Analytics (DA) and Machine Learning (ML) practices. This is realized in an integrated and seamless manner. We implement and validate the proposed approach by extending an open source modeling tool, called ThingML. ThingML is a domain-specific language and modeling tool with code generation for the IoT/CPS domain. Neither ThingML nor any other IoT/CPS modeling tool supports DA/ML at the modeling level. Therefore, as the primary contribution of the doctoral dissertation, we add the necessary syntax and semantics concerning DA/ML methods and techniques to the modeling language of ThingML. Moreover, we support the APIs of several ML libraries and frameworks for the automated generation of the source code of the target software in Python and Java. Our approach enables platform-independent, as well as platform-specific models. Further, we assist in carrying out semi-automated DA/ML tasks by offering Automated ML (AutoML), in the background (in expert mode), and through model-checking constraints and hints at design-time. Finally, we consider three use case scenarios from the domains of network security, smart energy systems and energy exchange markets.
DOI: 10.1109/ICSE-Companion52605.2021.00130
Please don’t go: a comprehensive approach to increase women’s participation in open source software
作者: Trinkenreich, Bianca
关键词: career, diversity, gender, open source software, participation, success, women
Abstract
Women represent less than 24% of employees in the software development industry and experience various types of prejudice and bias. Despite various efforts to increase diversity and multi-gendered participation, women are even more under-represented in Open Source Software (OSS) projects. In my PhD, I investigate the following question: How can OSS communities increase women’s participation in their projects? I will identify different OSS career pathways and develop a holistic view of women’s motivations to join or leave OSS, as well as their definitions of success. Based on this empirical investigation, I will work together with the Linux Foundation to design attraction and retention strategies focused on women. Before and after implementing the strategies, I will conduct empirical studies to evaluate the state of the practice and understand the implications of the strategies.
DOI: 10.1109/ICSE-Companion52605.2021.00131
Speculative analysis for quality assessment of code comments
作者: Rani, Pooja
关键词: code comments, comment quality assessment, developer information needs, mining developer sources
Abstract
Previous studies have shown that high-quality code comments assist developers in program comprehension and maintenance tasks. However, the semi-structured nature of comments, unclear conventions for writing good comments, and the lack of quality assessment tools for all aspects of comments make their evaluation and maintenance a non-trivial problem. To achieve high-quality comments, we need a deeper understanding of code comment characteristics and the practices developers follow. In this thesis, we approach the problem of assessing comment quality from three different perspectives: what developers ask about commenting practices, what they write in comments, and how researchers support them in assessing comment quality.Our preliminary findings show that developers embed various kinds of information in class comments across programming languages. Still, they face problems in locating relevant guidelines to write consistent and informative comments, verifying the adherence of their comments to the guidelines, and evaluating the overall state of comment quality. To help developers and researchers in building comment quality assessment tools, we provide: (i) an empirically validated taxonomy of comment convention-related questions from various community forums, (ii) an empirically validated taxonomy of comment information types from various programming languages, (iii) a language-independent approach to automatically identify the information types, and (iv) a comment quality taxonomy prepared from a systematic literature review.
DOI: 10.1109/ICSE-Companion52605.2021.00132
Vulnerability detection is just the beginning
作者: Elder, Sarah
关键词: computer security, security management, software testing
Abstract
Vulnerability detection plays a key role in secure software development [1]–[4]. There are many different vulnerability detection tools and techniques to choose from, and insufficient information on which vulnerability detection techniques to use and when. The goal of this research is to assist managers and other decision-makers on software projects in making informed choices about the use of different software vulnerability detection techniques through empirical analysis of the efficiency and effectiveness of each technique. We will examine the relationships between the vulnerability detection technique used to find a vulnerability, the type of vulnerability found, the exploitability of the vulnerability, and the effort needed to fix a vulnerability on two projects where we ensure all vulnerabilities found have been fixed. We will then examine how these relationships are seen in Open Source Software more broadly where practitioners may use different vulnerability detection techniques, or may not fix all vulnerabilities found due to resource constraints.
DOI: 10.1109/ICSE-Companion52605.2021.00133
High-quality automated program repair
作者: Motwani, Manish
关键词: No keywords
Abstract
Automatic program repair (APR) has recently gained attention because it proposes to fix software defects with no human intervention. To automatically fix defects, most APR tools use the developer-written tests to (a) localize the defect, and (b) generate and validate the automatically produced candidate patches based on the constraints imposed by the tests. While APR tools can produce patches that appear to fix the defect for 11–19% of the defects in real-world software, most of the patches produced are not correct or acceptable to developers because they overfit to the tests used during the repair process. This problem is known as the patch overfitting problem. To address this problem, I propose to equip APR tools with additional constraints derived from natural-language software artifacts such as bug reports and requirements specifications that describe the bug and intended software behavior but are not typically used by the APR tools. I hypothesize that patches produced by APR tools while using such additional constraints would be of higher quality. To test this hypothesis, I propose an automated and objective approach to evaluate the quality of patches, and propose two novel methods to improve the fault localization and developer-written test suites using natural-language software artifacts. Finally, I propose to use my patch evaluation methodology to analyze the effect of the improved fault localization and test suites on the quality of patches produced by APR tools for real-world defects.
DOI: 10.1109/ICSE-Companion52605.2021.00134
On the interplay between static and dynamic analysis for mining sandboxes
作者: Costa, Francisco Handrick da
关键词: Android platform, empirical studies and benchmarks, malware detection, mining sandboxes, software security
Abstract
Due to the popularization of Android and the full range of applications (apps) targeting this platform, many security issues have emerged, attracting researchers and practitioners’ attention. As such, many techniques for addressing security Android issues appeared, including approaches for mining sandboxes. Previous research studies have compared Android test case generation tools for this specific goal. Our research aims to explore new techniques for mining sandboxes, especially we are interested in understanding the limits of both static and dynamic analysis in this process. Although the use of tests for mining sandboxes has been explored before, the potential to combine static analysis and dynamic analysis has not been sufficiently investigated yet. That is, in this thesis we will investigate the hypothesis that combining static and dynamic analysis techniques increases the process of mining Android sandboxes.
DOI: 10.1109/ICSE-Companion52605.2021.00135
Reactive synthesis with spectra: a tutorial
作者: Maoz, Shahar and Ringert, Jan Oliver
关键词: No keywords
Abstract
Spectra is a formal specification language specifically tailored for use in the context of reactive synthesis, an automated procedure to obtain a correct-by-construction reactive system from its temporal logic specification. Spectra comes with the Spectra Tools, a set of analyses, including a synthesizer to obtain a correct-by-construction implementation, several means for executing the resulting controller, and additional analyses aimed at helping engineers write higher-quality specifications.This hands-on tutorial will introduce participants to the language and the tool set, using examples and exercises, covering an end-to-end process from specification writing to synthesis to execution. The tutorial may be of interest to software engineers and researchers who are interested in the potential applications of formal methods to software engineering.
DOI: 10.1109/ICSE-Companion52605.2021.00136
NLP for requirements engineering: tasks, techniques, tools, and technologies
作者: Ferrari, Alessio and Zhao, Liping and Alhoshan, Waad
关键词: No keywords
Abstract
Requirements engineering (RE) is one of the most natural language-intensive fields within the software engineering area. Therefore, several works have been developed across the years to automate the analysis of natural language artifacts that are relevant for RE, including requirements documents, but also app reviews, privacy policies, and social media content related to software products. Furthermore, the recent diffusion of game-changing natural language processing (NLP) techniques and platforms has also boosted the interest of RE researchers. However, a reference framework to provide a holistic understanding of the field of NLP for RE is currently missing. Based on the results of a recent systematic mapping study, and stemming from a previous ICSE tutorial by one of the authors, this technical briefing gives an overview of NLP for RE tasks, available techniques, supporting tools and NLP technologies. It is oriented to both researchers and practitioners, and will gently guide the audience towards a clearer view of how NLP can empower RE, providing pointers to representative works and specialised tools.
DOI: 10.1109/ICSE-Companion52605.2021.00137
The software challenges of building smart chatbots
作者: Daniel, Gwendal and Cabot, Jordi
关键词: No keywords
Abstract
Chatbots are becoming complex software artifacts that require a high-level of expertise in a variety of technical domains. This technical briefing will cover the software engineering challenges of developing high-quality chatbots. Attendees will be able to create their own bots leveraging the open source chatbot development platform Xatkit.
DOI: 10.1109/ICSE-Companion52605.2021.00138
Decoding grounded theory for software engineering
作者: Hoda, Rashina
关键词: grounded theory, research method, socio-technical grounded theory, software engineering
Abstract
Grounded Theory, while becoming increasingly popular in software engineering, is also one of the most misunderstood, misused, and poorly presented and evaluated method in software engineering. When applied well, GT results in dense and valuable explanations of how and why phenomena occur in practice. GT can be applied as a full research method leading to mature theories and also in limited capacity for data analysis within other methods, using its robust open coding and constant comparison procedures. This technical briefing will go through the social origins of GT, present examples of grounded theories developed in SE, discuss the key challenges SE researchers face, and provide a gentle introduction to socio-technical grounded theory, a variant of GT for software engineering research.
DOI: 10.1109/ICSE-Companion52605.2021.00139
Bayesian data analysis for software engineering
作者: Torkar, Richard and Furia, Carlo A. and Feldt, Robert
关键词: No keywords
Abstract
Slowly but surely, statistical practices in the empirical sciences are undergoing a complete makeover. Researchers in empirical software engineering, where too statistics is an essential tool, must become familiar with these new practices to ensure rigor of their research methods and soundness of their research results.
DOI: 10.1109/ICSE-Companion52605.2021.00140
Advances in code summarization
作者: Desai, Utkarsh and Sridhara, Giriprasad and Tamilselvam, Srikanth G
关键词: code summarization, neural networks
Abstract
Several studies have suggested that comments describing source code can help mitigate the burden of program understanding. However, software systems usually lack adequate comments and sometime even when present, they may be outdated. Researchers have addressed this issue by automatically generating comments from source code, a task referred to as Code Summarization. In this technical presentation, we take a deeper look at some of the significant, recent works in the area of code summarization and how each of them attempts to take a new perspective of this task including methods leveraging RNNs, Transformers, Graph neural networks and Reinforcement learning. We review individual methods in detail and discuss future avenues for this task.
DOI: 10.1109/ICSE-Companion52605.2021.00141
Technical briefing: Hands-on session on the development of trustworthy AI software
作者: Vakkuri, Ville and Kemell, Kai-Kristian and Abrahamsson, Pekka
关键词: artificial intelligence, design methods, ethics
Abstract
Following various real-world incidents involving both purely digital and cyber-physical Artificial Intelligence (AI) systems, AI Ethics has become a prominent topic of discussion in both research and practice, accompanied by various calls for trustworthy AI systems. Failures are often costly, and many of them stem from issues that could have been avoided during development. For example, AI ethics issues, such as data privacy are currently highly topical. However, implementing AI ethics in practice remains a challenge for organizations. Various guidelines have been published to aid companies in doing so, but these have not seen widespread adoption and may feel impractical. In this technical briefing, we discuss how to implement AI ethics. We showcase a method developed for this purpose, ECCOLA, which is based on academic research. ECCOLA is intended to make AI ethics more practical for developers in order to make it easier to incorporate into AI development to create trustworthy AI systems. It is a sprint-based and adaptive tool designed for agile development that facilitates reflection within the development team and helps developers make ethics into tangible product backlog items.
DOI: 10.1109/ICSE-Companion52605.2021.00142