ICSE 2022

μAFL： non-intrusive feedback-driven fuzzing for microcontroller firmware

作者: Li, Wenqiang and Shi, Jiameng and Li, Fengjun and Lin, Jingqiang and Wang, Wei and Guan, Le
关键词: ETM, IoT, firmware security, fuzzing, microcontroller

Abstract

Fuzzing is one of the most effective approaches to finding software flaws. However, applying it to microcontroller firmware incurs many challenges. For example, rehosting-based solutions cannot accurately model peripheral behaviors and thus cannot be used to fuzz the corresponding driver code. In this work, we present μAFL, a hardware-in-the-loop approach to fuzzing microcontroller firmware. It leverages debugging tools in existing embedded system development to construct an AFL-compatible fuzzing framework. Specifically, we use the debug dongle to bridge the fuzzing environment on the PC and the target firmware on the microcontroller device. To collect code coverage information without costly code instrumentation, μAFL relies on the ARM ETM hardware debugging feature, which transparently collects the instruction trace and streams the results to the PC. However, the raw ETM data is obscure and needs enormous computing resources to recover the actual instruction flow. We therefore propose an alternative representation of code coverage, which retains the same path sensitivity as the original AFL algorithm, but can directly work on the raw ETM data without matching them with disassembled instructions. To further reduce the workload, we use the DWT hardware feature to selectively collect runtime information of interest. We evaluated μAFL on two real evaluation boards from two major vendors: NXP and STMicroelectronics. With our prototype, we discovered ten zero-day bugs in the driver code shipped with the SDK of STMicroelectronics and three zero-day bugs in the SDK of NXP. Eight CVEs have been allocated for them. Considering the wide adoption of vendor SDKs in real products, our results are alarming.

DOI: 10.1145/3510003.3510208

A grounded theory of coordination in remote-first and hybrid software teams

作者: de Souza Santos, Ronnie E. and Ralph, Paul
关键词: COVID-19, agile methods, coordination, grounded theory, remote work, software development, work-from-home

Abstract

While the long-term effects of the COVID-19 pandemic on software professionals and organizations are difficult to predict, it seems likely that working from home, remote-first teams, distributed teams, and hybrid (part-remote/part-office) teams will be more common. It is therefore important to investigate the challenges that software teams and organizations face with new remote and hybrid work. Consequently, this paper reports a year-long, participant-observation, constructivist grounded theory study investigating the impact of working from home on software development. This study resulted in a theory of software team coordination. Briefly, shifting from in-office to at-home work fundamentally altered coordination within software teams. While group cohesion and more effective communication appear protective, coordination is undermined by distrust, parenting and communication bricolage. Poor coordination leads to numerous problems including misunderstandings, help requests, lower job satisfaction among team members, and more ill-defined tasks. These problems, in turn, reduce overall project success and prompt professionals to alter their software development processes (in this case, from Scrum to Kanban). Our findings suggest that software organizations with many remote employees can improve performance by encouraging greater engagement within teams and supporting employees with family and childcare responsibilities.

DOI: 10.1145/3510003.3510105

A scalable t-wise coverage estimator

作者: Baranov, Eduard and Chakraborty, Sourav and Legay, Axel and Meel, Kuldeep S. and Variyam, Vinodchandran N.
关键词: t-wise coverage, approximation, configurable software

Abstract

Owing to the pervasiveness of software in our modern lives, software systems have evolved to be highly configurable. Combinatorial testing has emerged as a dominant paradigm for testing highly configurable systems. Often constraints are employed to define the environments where a given system under test (SUT) is expected to work. Therefore, there has been a sustained interest in designing constraint-based test suite generation techniques. A significant goal of test suite generation techniques is to achieve t-wise coverage for higher values of t. Therefore, designing scalable techniques that can estimate t-wise coverage for a given set of tests and/or the estimation of maximum achievable t-wise coverage under a given set of constraints is of crucial importance. The existing estimation techniques face significant scalability hurdles.The primary scientific contribution of this work is the design of scalable algorithms with mathematical guarantees to estimate (i) t-wise coverage for a given set of tests, and (ii) maximum t-wise coverage for a given set of constraints. In particular, we design a scalable framework ApproxCov that takes in a test set u, a coverage parameter t, a tolerance parameter ε, and a confidence parameter δ, and returns an estimate of the t-wise coverage of u that is guaranteed to be within (1 ± ε)-factor of the ground truth with probability at least 1 - δ. We design a scalable framework ApproxMaxCov that, for a given formula F, a coverage parameter t, a tolerance parameter ε, and a confidence parameter δ, outputs an approximation which is guaranteed to be within (1 ± ε) factor of the maximum achievable t-wise coverage under F, with probability ≥ 1 - δ. Our comprehensive evaluation demonstrates that ApproxCov and ApproxMaxCov can handle benchmarks that are beyond the reach of current state-of-the-art approaches. We believe that the availability of ApproxCov and ApproxMaxCov will enable test suite designers to evaluate the effectiveness of their generators and thereby significantly impact the development of combinatorial testing techniques.

DOI: 10.1145/3510003.3510218

A universal data augmentation approach for fault localization

作者: Xie, Huan and Lei, Yan and Yan, Meng and Yu, Yue and Xia, Xin and Mao, Xiaoguang
关键词: data augmentation, fault localization, imbalanced data

Abstract

Data is the fuel to models, and it is still applicable in fault localization (FL). Many existing elaborate FL techniques take the code coverage matrix and failure vector as inputs, expecting the techniques could find the correlation between program entities and failures. However, the input data is high-dimensional and extremely unbalanced since the real-world programs are large in size and the number of failing test cases is much less than that of passing test cases, which are posing severe threats to the effectiveness of FL techniques.To overcome the limitations, we propose Aeneas, a universal data augmentation approach that generAtes synthesized failing test cases from reduced feature space for more precise fault localization. Specifically, to improve the effectiveness of data augmentation, Aeneas applies a revised principal component analysis (PCA) first to generate reduced feature space for more concise representation of the original coverage matrix, which could also gain efficiency for data synthesis. Then, Aeneas handles the imbalanced data issue through generating synthesized failing test cases from the reduced feature space through conditional variational autoencoder (CVAE). To evaluate the effectiveness of Aeneas, we conduct large-scale experiments on 458 versions of 10 programs (from ManyBugs, SIR, and Defects4J) by six state-of-the-art FL techniques. The experimental results clearly show that Aeneas is statistically more effective than baselines, e.g., our approach can improve the six original methods by 89% on average under the Top-1 accuracy.

DOI: 10.1145/3510003.3510136

Adaptive performance anomaly detection for online service systems via pattern sketching

作者: Chen, Zhuangbin and Liu, Jinyang and Su, Yuxin and Zhang, Hongyu and Ling, Xiao and Yang, Yongqiang and Lyu, Michael R.
关键词: cloud computing, online learning, performance anomaly detection

Abstract

To ensure the performance of online service systems, their status is closely monitored with various software and system metrics. Performance anomalies represent the performance degradation issues (e.g., slow response) of the service systems. When performing anomaly detection over the metrics, existing methods often lack the merit of interpretability, which is vital for engineers and analysts to take remediation actions. Moreover, they are unable to effectively accommodate the ever-changing services in an online fashion. To address these limitations, in this paper, we propose ADSketch, an interpretable and adaptive performance anomaly detection approach based on pattern sketching. ADSketch achieves interpretability by identifying groups of anomalous metric patterns, which represent particular types of performance issues. The underlying issues can then be immediately recognized if similar patterns emerge again. In addition, an adaptive learning algorithm is designed to embrace unprecedented patterns induced by service updates or user behavior changes. The proposed approach is evaluated with public data as well as industrial data collected from a representative online service system in Huawei Cloud. The experimental results show that ADSketch outperforms state-of-the-art approaches by a significant margin, and demonstrate the effectiveness of the online algorithm in new pattern discovery. Furthermore, our approach has been successfully deployed in industrial practice.

DOI: 10.1145/3510003.3510085

Adaptive test selection for deep neural networks

作者: Gao, Xinyu and Feng, Yang and Yin, Yining and Liu, Zixi and Chen, Zhenyu and Xu, Baowen
关键词: adaptive random testing, deep learning testing, deep neural networks, test selection

Abstract

Deep neural networks (DNN) have achieved tremendous development in the past decade. While many DNN-driven software applications have been deployed to solve various tasks, they could also produce incorrect behaviors and result in massive losses. To reveal the incorrect behaviors and improve the quality of DNN-driven applications, developers often need rich labeled data for the testing and optimization of DNN models. However, in practice, collecting diverse data from application scenarios and labeling them properly is often a highly expensive and time-consuming task.In this paper, we proposed an adaptive test selection method, namely ATS, for deep neural networks to alleviate this problem. ATS leverages the difference between the model outputs to measure the behavior diversity of DNN test data. And it aims at selecting a subset with diverse tests from a massive unlabelled dataset. We experiment ATS with four well-designed DNN models and four widely-used datasets in comparison with various kinds of neuron coverage (NC). The results demonstrate that ATS can significantly outperform all test selection methods in assessing both fault detection and model improvement capability of test suites. It is promising to save the data labeling and model retraining costs for deep neural networks.

DOI: 10.1145/3510003.3510232

An exploratory study of deep learning supply chain

作者: Tan, Xin and Gao, Kai and Zhou, Minghui and Zhang, Li
关键词: deep learning, open source, software evolution, software structure, software supply chain

Abstract

Deep learning becomes the driving force behind many contemporary technologies and has been successfully applied in many fields. Through software dependencies, a multi-layer supply chain (SC) with a deep learning framework as the core and substantial down-stream projects as the periphery has gradually formed and is constantly developing. However, basic knowledge about the structure and characteristics of the SC is lacking, which hinders effective support for its sustainable development. Previous studies on software SC usually focus on the packages in different registries without paying attention to the SCs derived from a single project. We present an empirical study on two deep learning SCs: TensorFlow and PyTorch SCs. By constructing and analyzing their SCs, we aim to understand their structure, application domains, and evolutionary factors. We find that both SCs exhibit a short and sparse hierarchy structure. Overall, the relative growth of new projects increases month by month. Projects have a tendency to attract downstream projects shortly after the release of their packages, later the growth becomes faster and tends to stabilize. We propose three criteria to identify vulnerabilities and identify 51 types of packages and 26 types of projects involved in the two SCs. A comparison reveals their similarities and differences, e.g., TensorFlow SC provides a wealth of packages in experiment result analysis, while PyTorch SC contains more specific framework packages. By fitting the GAM model, we find that the number of dependent packages is significantly negatively associated with the number of downstream projects, but the relationship with the number of authors is nonlinear. Our findings can help further open the “black box” of deep learning SCs and provide insights for their healthy and sustainable development.

DOI: 10.1145/3510003.3510199

An exploratory study of productivity perceptions in software teams

作者: Ruvimova, Anastasia and Lill, Alexander and Gugler, Jan and Howe, Lauren and Huang, Elaine and Murphy, Gail and Fritz, Thomas
关键词: productivity, software developer, team, user study

Abstract

Software development is a collaborative process requiring a careful balance of focused individual effort and team coordination. Though questions of individual productivity have been widely examined in past literature, less is known about the interplay between developers’ perceptions of their own productivity as opposed to their team’s. In this paper, we present an analysis of 624 daily surveys and 2899 self-reports from 25 individuals across five software teams in North America and Europe, collected over the course of three months. We found that developers tend to operate in fluid team constructs, which impacts team awareness and complicates gauging team productivity. We also found that perceived individual productivity most strongly predicted perceived team productivity, even more than the amount of team interactions, unplanned work, and time spent in meetings. Future research should explore how fluid team structures impact individual and organizational productivity.

DOI: 10.1145/3510003.3510081

Analyzing user perspectives on mobile app privacy at scale

作者: Nema, Preksha and Anthonysamy, Pauline and Taft, Nina and Peddinti, Sai Teja
关键词: empirical, mobile apps, nlp, privacy

Abstract

In this paper we present a methodology to analyze users’ concerns and perspectives about privacy at scale. We leverage NLP techniques to process millions of mobile app reviews and extract privacy concerns. Our methodology is composed of a binary classifier that distinguishes between privacy and non-privacy related reviews. We use clustering to gather reviews that discuss similar privacy concerns, and employ summarization metrics to extract representative reviews to summarize each cluster. We apply our methods on 287M reviews for about 2M apps across the 29 categories in Google Play to identify top privacy pain points in mobile apps. We identified approximately 440K privacy related reviews. We find that privacy related reviews occur in all 29 categories, with some issues arising across numerous app categories and other issues only surfacing in a small set of app categories. We show empirical evidence that confirms dominant privacy themes - concerns about apps requesting unnecessary permissions, collection of personal information, frustration with privacy controls, tracking and the selling of personal data. As far as we know, this is the first large scale analysis to confirm these findings based on hundreds of thousands of user inputs. We also observe some unexpected findings such as users warning each other not to install an app due to privacy issues, users uninstalling apps due to privacy reasons, as well as positive reviews that reward developers for privacy friendly apps. Finally we discuss the implications of our method and findings for developers and app stores.

DOI: 10.1145/3510003.3510079

Aper： evolution-aware runtime permission misuse detection for Android apps

作者: Wang, Sinan and Wang, Yibo and Zhan, Xian and Wang, Ying and Liu, Yepang and Luo, Xiapu and Cheung, Shing-Chi
关键词: Android runtime permission, compatibility issues, static analysis

Abstract

The Android platform introduces the runtime permission model in version 6.0. The new model greatly improves data privacy and user experience, but brings new challenges for app developers. First, it allows users to freely revoke granted permissions. Hence, developers cannot assume that the permissions granted to an app would keep being granted. Instead, they should make their apps carefully check the permission status before invoking dangerous APIs. Second, the permission specification keeps evolving, bringing new types of compatibility issues into the ecosystem. To understand the impact of the challenges, we conducted an empirical study on 13,352 popular Google Play apps. We found that 86.0% apps used dangerous APIs asynchronously after permission management and 61.2% apps used evolving dangerous APIs. If an app does not properly handle permission revocations or platform differences, unexpected runtime issues may happen and even cause app crashes. We call such Android Runtime Permission issues as ARP bugs. Unfortunately, existing runtime permission issue detection tools cannot effectively deal with the ARP bugs induced by asynchronous permission management and permission specification evolution. To fill the gap, we designed a static analyzer, Aper, that performs reaching definition and dominator analysis on Android apps to detect the two types of ARP bugs. To compare Aper with existing tools, we built a benchmark, ARPfix, from 60 real ARP bugs. Our experiment results show that Aper significantly outperforms two academic tools, ARPDroid and RevDroid, and an industrial tool, Lint, on ARPfix, with an average improvement of 46.3% on F1-score. In addition, Aper successfully found 34 ARP bugs in 214 open-source Android apps, most of which can result in abnormal app behaviors (such as app crashes) according to our manual validation. We reported these bugs to the app developers. So far, 17 bugs have been confirmed and seven have been fixed.

DOI: 10.1145/3510003.3510074

ARCLIN： automated API mention resolution for unformatted texts

作者: Huo, Yintong and Su, Yuxin and Zhang, Hongming and Lyu, Michael R.
关键词: API, API disambiguation, text mining

Abstract

Online technical forums (e.g., StackOverflow) are popular platforms for developers to discuss technical problems such as how to use a specific Application Programming Interface (API), how to solve the programming tasks, or how to fix bugs in their code. These discussions can often provide auxiliary knowledge of how to use the software that is not covered by the official documents. The automatic extraction of such knowledge may support a set of downstream tasks like API searching or indexing. However, unlike official documentation written by experts, discussions in open forums are made by regular developers who write in short and informal texts, including spelling errors or abbreviations. There are three major challenges for the accurate APIs recognition and linking mentioned APIs from unstructured natural language documents to an entry in the API repository: (1) distinguishing API mentions from common words; (2) identifying API mentions without a fully qualified name; and (3) disambiguating API mentions with similar method names but in a different library. In this paper, to tackle these challenges, we propose an ARCLIN tool, which can effectively distinguish and link APIs without using human annotations. Specifically, we first design an API recognizer to automatically extract API mentions from natural language sentences by a Conditional Random Field (CRF) on the top of a Bi-directional Long Short-Term Memory (Bi-LSTM) module, then we apply a context-aware scoring mechanism to compute the mention-entry similarity for each entry in an API repository. Compared to previous approaches with heuristic rules, our proposed tool without manual inspection outperforms by 8% in a high-quality dataset Py-mention, which contains 558 mentions and 2,830 sentences from five popular Python libraries. To our best knowledge, ARCLIN is the first approach to achieve full automation of API mention resolution from unformatted text without manually collected labels.

DOI: 10.1145/3510003.3510158

AST-trans： code summarization with efficient tree-structured attention

作者: Tang, Ze and Shen, Xiaoyu and Li, Chuanyi and Ge, Jidong and Huang, Liguo and Zhu, Zhelin and Luo, Bin
关键词: source code summarization, tree-based neural network

Abstract

Code summarization aims to generate brief natural language descriptions for source codes. The state-of-the-art approaches follow a transformer-based encoder-decoder architecture. As the source code is highly structured and follows strict grammars, its Abstract Syntax Tree (AST) is widely used for encoding structural information. However, ASTs are much longer than the corresponding source code. Existing approaches ignore the size constraint and simply feed the whole linearized AST into the encoders. We argue that such a simple process makes it difficult to extract the truly useful dependency relations from the overlong input sequence. It also incurs significant computational overhead since each node needs to apply self-attention to all other nodes in the AST. To encode the AST more effectively and efficiently, we propose AST-Trans in this paper which exploits two types of node relationships in the AST: ancestor-descendant and sibling relationships. It applies the tree-structured attention to dynamically allocate weights for relevant nodes and exclude irrelevant nodes based on these two relationships. We further propose an efficient implementation to support fast parallel computation for tree-structure attention. On the two code summarization datasets, experimental results show that AST-Trans significantly outperforms the state-of-the-arts while being times more efficient than standard transformers1.

DOI: 10.1145/3510003.3510224

Automated assertion generation via information retrieval and its integration with deep learning

作者: Yu, Hao and Lou, Yiling and Sun, Ke and Ran, Dezhi and Xie, Tao and Hao, Dan and Li, Ying and Li, Ge and Wang, Qianxiang
关键词: deep learning, information retrieval, test assertion, unit testing

Abstract

Unit testing could be used to validate the correctness of basic units of the software system under test. To reduce manual efforts in conducting unit testing, the research community has contributed with tools that automatically generate unit test cases, including test inputs and test oracles (e.g., assertions). Recently, ATLAS, a deep learning (DL) based approach, was proposed to generate assertions for a unit test based on other already written unit tests. Despite promising, the effectiveness of ATLAS is still limited. To improve the effectiveness, in this work, we make the first attempt to leverage Information Retrieval (IR) in assertion generation and propose an IR-based approach, including the technique of IR-based assertion retrieval and the technique of retrieved-assertion adaptation. In addition, we propose an integration approach to combine our IR-based approach with a DL-based approach (e.g., ATLAS) to further improve the effectiveness. Our experimental results show that our IR-based approach outperforms the state-of-the-art DL-based approach, and integrating our IR-based approach with the DL-based approach can further achieve higher accuracy. Our results convey an important message that information retrieval could be competitive and worthwhile to pursue for software engineering tasks such as assertion generation, and should be seriously considered by the research community given that in recent years deep learning solutions have been over-popularly adopted by the research community for software engineering tasks.

DOI: 10.1145/3510003.3510149

Automated detection of password leakage from public GitHub repositories

作者: Feng, Runhan and Yan, Ziyang and Peng, Shiyan and Zhang, Yuanyuan
关键词: GitHub, deep learning, mining software repositories, password

Abstract

The prosperity of the GitHub community has raised new concerns about data security in public repositories. Practitioners who manage authentication secrets such as textual passwords and API keys in the source code may accidentally leave these texts in the public repositories, resulting in secret leakage. If such leakage in the source code can be automatically detected in time, potential damage would be avoided. With existing approaches focusing on detecting secrets with distinctive formats (e.g., API keys, cryptographic keys in PEM format), textual passwords, which are ubiquitously used for authentication, fall through the crack. Given that textual passwords could be virtually any strings, a naive detection scheme based on regular expression performs poorly. This paper presents PassFinder, an automated approach to effectively detecting password leakage from public repositories that involve various programming languages on a large scale. PassFinder utilizes deep neural networks to unveil the intrinsic characteristics of textual passwords and understand the semantics of the code snippets that use textual passwords for authentication, i.e., the contextual information of the passwords in the source code. Using this new technique, we performed the first large-scale and longitudinal analysis of password leakage on GitHub. We inspected newly uploaded public code files on GitHub for 75 days and found that password leakage is pervasive, affecting over sixty thousand repositories. Our work contributes to a better understanding of password leakage on GitHub, and we believe our technique could promote the security of the open-source ecosystem.

DOI: 10.1145/3510003.3510150

Automated handling of anaphoric ambiguity in requirements： a multi-solution study

作者: Ezzini, Saad and Abualhaija, Sallam and Arora, Chetan and Sabetzadeh, Mehrdad
关键词: BERT, ambiguity, language models, machine learning (ML), natural language processing (NLP), natural-language requirements, requirements engineering

Abstract

Ambiguity is a pervasive issue in natural-language requirements. A common source of ambiguity in requirements is when a pronoun is anaphoric. In requirements engineering, anaphoric ambiguity occurs when a pronoun can plausibly refer to different entities and thus be interpreted differently by different readers. In this paper, we develop an accurate and practical automated approach for handling anaphoric ambiguity in requirements, addressing both ambiguity detection and anaphora interpretation. In view of the multiple competing natural language processing (NLP) and machine learning (ML) technologies that one can utilize, we simultaneously pursue six alternative solutions, empirically assessing each using a collection of ≈1,350 industrial requirements. The alternative solution strategies that we consider are natural choices induced by the existing technologies; these choices frequently arise in other automation tasks involving natural-language requirements. A side-by-side empirical examination of these choices helps develop insights about the usefulness of different state-of-the-art NLP and ML technologies for addressing requirements engineering problems. For the ambiguity detection task, we observe that supervised ML outperforms both a large-scale language model, SpanBERT (a variant of BERT), as well as a solution assembled from off-the-shelf NLP coreference re-solvers. In contrast, for anaphora interpretation, SpanBERT yields the most accurate solution. In our evaluation, (1) the best solution for anaphoric ambiguity detection has an average precision of ≈60% and a recall of 100%, and (2) the best solution for anaphora interpretation (resolution) has an average success rate of ≈98%.

DOI: 10.1145/3510003.3510157

Automated patching for unreproducible builds

作者: Ren, Zhilei and Sun, Shiwei and Xuan, Jifeng and Li, Xiaochen and Zhou, Zhide and Jiang, He
关键词: automated patch generation, dynamic tracing, reproducible builds

Abstract

Software reproducibility plays an essential role in establishing trust between source code and the built artifacts, by comparing compilation outputs acquired from independent users. Although the testing for unreproducible builds could be automated, fixing unreproducible build issues poses a set of challenges within the reproducible builds practice, among which we consider the localization granularity and the historical knowledge utilization as the most significant ones. To tackle these challenges, we propose a novel approach RepFix that combines tracing-based fine-grained localization with history-based patch generation mechanisms.On the one hand, to tackle the localization granularity challenge, we adopt system-level dynamic tracing to capture both the system call traces and user-space function call information. By integrating the kernel probes and user-space probes, we could determine the location of each executed build command more accurately. On the other hand, to tackle the historical knowledge utilization challenge, we design a similarity based relevant patch retrieving mechanism, and generate patches by applying the edit operations of the existing patches. With the abundant patches accumulated by the reproducible builds practice, we could generate patches to fix the unreproducible builds automatically.To evaluate the usefulness of RepFix, extensive experiments are conducted over a dataset with 116 real-world packages. Based on RepFix, we successfully fix the unreproducible build issues for 64 packages. Moreover, we apply RepFix to the Arch Linux packages, and successfully fix four packages. Two patches have been accepted by the repository, and there is one package for which the patch is pushed and accepted by its upstream repository, so that the fixing could be helpful for other downstream repositories.

DOI: 10.1145/3510003.3510102

Automated testing of software that uses machine learning APIs

作者: Wan, Chengcheng and Liu, Shicheng and Xie, Sophie and Liu, Yifan and Hoffmann, Henry and Maire, Michael and Lu, Shan
关键词: machine learning, machine learning API, software testing

Abstract

An increasing number of software applications incorporate machine learning (ML) solutions for cognitive tasks that statistically mimic human behaviors. To test such software, tremendous human effort is needed to design image/text/audio inputs that are relevant to the software, and to judge whether the software is processing these inputs as most human beings do. Even when misbehavior is exposed, it is often unclear whether the culprit is inside the cognitive ML API or the code using the API.This paper presents Keeper, a new testing tool for software that uses cognitive ML APIs. Keeper designs a pseudo-inverse function for each ML API that reverses the corresponding cognitive task in an empirical way (e.g., an image search engine pseudo-reverses the image-classification API), and incorporates these pseudo-inverse functions into a symbolic execution engine to automatically generate relevant image/text/audio inputs and judge output correctness. Once misbehavior is exposed, Keeper attempts to change how ML APIs are used in software to alleviate the misbehavior. Our evaluation on a variety of open-source applications shows that Keeper greatly improves the branch coverage, while identifying many previously unknown bugs.

DOI: 10.1145/3510003.3510068

Automatic detection of performance bugs in database systems using equivalent queries

作者: Liu, Xinyu and Zhou, Qi and Arulraj, Joy and Orso, Alessandro
关键词: database testing, differential testing, query optimization

Abstract

Because modern data-intensive applications rely heavily on database systems (DBMSs), developers extensively test these systems to eliminate bugs that negatively affect functionality. Besides functional bugs, however, there is another important class of faults that negatively affect the response time of a DBMS, known as performance bugs. Despite their potential impact on end-user experience, performance bugs have received considerably less attention than functional bugs. To fill this gap, we present Amoeba, a technique and tool for automatically detecting performance bugs in DBMSs. The core idea behind Amoeba is to construct semantically equivalent query pairs, run both queries on the DBMS under test, and compare their response time. If the queries exhibit significantly different response times, that indicates the possible presence of a performance bug in the DBMS. To construct equivalent queries, we propose to use a set of structure and expression mutation rules especially targeted at uncovering performance bugs. We also introduce feedback mechanisms for improving the effectiveness and efficiency of the approach. We evaluate Amoeba on two widely-used DBMSs, namely PostgreSQL and CockroachDB, with promising results: Amoeba has so far discovered 39 potential performance bugs, among which developers have already confirmed 6 bugs and fixed 5 bugs.

DOI: 10.1145/3510003.3510093

AutoTransform： automated code transformation to support modern code review process

作者: Thongtanunam, Patanamon and Pornprasit, Chanathip and Tantithamthavorn, Chakkrit
关键词: No keywords

Abstract

Code review is effective, but human-intensive (e.g., developers need to manually modify source code until it is approved). Recently, prior work proposed a Neural Machine Translation (NMT) approach to automatically transform source code to the version that is reviewed and approved (i.e., the after version). Yet, its performance is still suboptimal when the after version has new identifiers or literals (e.g., renamed variables) or has many code tokens. To address these limitations, we propose AutoTransform which leverages a Byte-Pair Encoding (BPE) approach to handle new tokens and a Transformer-based NMT architecture to handle long sequences. We evaluate our approach based on 14,750 changed methods with and without new tokens for both small and medium sizes. The results show that when generating one candidate for the after version (i.e., beam width = 1), our AutoTransform can correctly transform 1,413 changed methods, which is 567% higher than the prior work, highlighting the substantial improvement of our approach for code transformation in the context of code review. This work contributes towards automated code transformation for code reviews, which could help developers reduce their effort in modifying source code during the code review process.

DOI: 10.1145/3510003.3510067

BeDivFuzz： integrating behavioral diversity into generator-based fuzzing

作者: Nguyen, Hoang Lam and Grunske, Lars
关键词: behavioral diversity, random testing, structure-aware fuzzing

Abstract

A popular metric to evaluate the performance of fuzzers is branch coverage. However, we argue that focusing solely on covering many different branches (i.e., the richness) is not sufficient since the majority of the covered branches may have been exercised only once, which does not inspire a high confidence in the reliability of the covered code. Instead, the distribution of the executed branches (i.e., the evenness) should also be considered. That is, behavioral diversity is only given if the generated inputs not only trigger many different branches, but also trigger them evenly often with diverse inputs. We introduce BeDivFuzz, a feedback-driven fuzzing technique for generator-based fuzzers. BeDivFuzz distinguishes between structure-preserving and structure-changing mutations in the space of syntactically valid inputs, and biases its mutation strategy towards validity and behavioral diversity based on the received program feedback. We have evaluated BeDivFuzz on Ant, Maven, Rhino, Closure, Nashorn, and Tomcat. The results show that BeDivFuzz achieves better behavioral diversity than the state of the art, measured by established biodiversity metrics, namely the Hill numbers, from the field of ecology.

DOI: 10.1145/3510003.3510182

Big data = big insights? operationalising brooks’ law in a massive GitHub data set

作者: Gote, Christoph and Mavrodiev, Pavlin and Schweitzer, Frank and Scholtes, Ingo
关键词: No keywords

Abstract

Massive data from software repositories and collaboration tools are widely used to study social aspects in software development. One question that several recent works have addressed is how a software project’s size and structure influence team productivity, a question famously considered in Brooks’ law. Recent studies using massive repository data suggest that developers in larger teams tend to be less productive than smaller teams. Despite using similar methods and data, other studies argue for a positive linear or even super-linear relationship between team size and productivity, thus contesting the view of software economics that software projects are diseconomies of scale.In our work, we study challenges that can explain the disagreement between recent studies of developer productivity in massive repository data. We further provide, to the best of our knowledge, the largest, curated corpus of GitHub projects tailored to investigate the influence of team size and collaboration patterns on individual and collective productivity. Our work contributes to the ongoing discussion on the choice of productivity metrics in the operationalisation of hypotheses about determinants of successful software projects. It further highlights general pitfalls in big data analysis and shows that the use of bigger data sets does not automatically lead to more reliable insights.

DOI: 10.1145/3510003.3510619

Bots for pull requests： the good, the bad, and the promising

作者: Wessel, Mairieli and Abdellatif, Ahmad and Wiese, Igor and Conte, Tayana and Shihab, Emad and Gerosa, Marco A. and Steinmacher, Igor
关键词: GitHub bots, automation, collaborative development, design fiction, human-bot interaction, open source software, software bots

Abstract

Software bots automate tasks within Open Source Software (OSS) projects’ pull requests and save reviewing time and effort (“the good”). However, their interactions can be disruptive and noisy and lead to information overload (“the bad”). To identify strategies to overcome such problems, we applied Design Fiction as a participatory method with 32 practitioners. We elicited 22 design strategies for a bot mediator or the pull request user interface (“the promising”). Participants envisioned a separate place in the pull request interface for bot interactions and a bot mediator that can summarize and customize other bots’ actions to mitigate noise. We also collected participants’ perceptions about a prototype implementing the envisioned strategies. Our design strategies can guide the development of future bots and social coding platforms.

DOI: 10.1145/3510003.3512765

Bridging pre-trained models and downstream tasks for source code understanding

作者: Wang, Deze and Jia, Zhouyang and Li, Shanshan and Yu, Yue and Xiong, Yun and Dong, Wei and Liao, Xiangke
关键词: curriculum learning, data augmentation, fine-tuning, test-time augmentation

Abstract

With the great success of pre-trained models, the pretrain-then-finetune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to organize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models.We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models.

DOI: 10.1145/3510003.3510062

BugListener： identifying and synthesizing bug reports from collaborative live chats

作者: Shi, Lin and Mu, Fangwen and Zhang, Yumin and Yang, Ye and Chen, Junjie and Chen, Xiao and Jiang, Hanzhi and Jiang, Ziyou and Wang, Qing
关键词: bug report generation, live chats mining, open source

Abstract

In community-based software development, developers frequently rely on live-chatting to discuss emergent bugs/errors they encounter in daily development tasks. However, it remains a challenging task to accurately record such knowledge due to the noisy nature of interleaved dialogs in live chat data. In this paper, we first formulate the task of identifying and synthesizing bug reports from community live chats, and propose a novel approach, named BugListener, to address the challenges. Specifically, BugListener automates three sub-tasks: 1) Disentangle the dialogs from massive chat logs by using a Feed-Forward neural network; 2) Identify the bug-report dialogs from separated dialogs by leveraging the Graph neural network to learn the contextual information; 3) Synthesize the bug reports by utilizing Transfer Learning techniques to classify the sentences into: observed behaviors (OB), expected behaviors (EB), and steps to reproduce the bug (SR). BugListener is evaluated on six open source projects. The results show that: for bug report identification, BugListener achieves the average F1 of 77.74%, improving the best baseline by 12.96%; and for bug report synthesis task, BugListener could classify the OB, EB, and SR sentences with the F1 of 84.62%, 71.46%, and 73.13%, improving the best baselines by 9.32%, 12.21%, 10.91%, respectively. A human evaluation study also confirms the effectiveness of BugListener in generating relevant and accurate bug reports. These demonstrate the significant potential of applying BugListener in community-based software development, for promoting bug discovery and quality improvement.

DOI: 10.1145/3510003.3510108

BuildSheriff： change-aware test failure triage for continuous integration builds

作者: Zhang, Chen and Chen, Bihuan and Peng, Xin and Zhao, Wenyun
关键词: continuous integration, failure triage, test failures

Abstract

Test failures are one of the most common reasons for broken builds in continuous integration. It is expensive to diagnose all test failures in a build. As test failures are usually caused by a few underlying faults, triaging test failures with respect to their underlying root causes can save test failure diagnosis cost. Existing failure triage methods are mostly developed for triaging crash or bug reports, and hence not applicable in the context of test failure triage in continuous integration. In this paper, we first present a large-scale empirical study on 163,371 broken builds caused by test failures to characterize test failures in real-world Java projects. Then, motivated by our study, we propose a new change-aware approach, BuildSheriff, to triage test failures in each continuous integration build such that test failures with the same root cause are put in the same cluster. Our evaluation on 200 broken builds has demonstrated that BuildSheriff can significantly improve the state-of-the-art methods on the triaging effectiveness.

DOI: 10.1145/3510003.3510132

Causality in configurable software systems

作者: Dubslaff, Clemens and Weis, Kallistos and Baier, Christel and Apel, Sven
关键词: No keywords

Abstract

Detecting and understanding reasons for defects and inadvertent behavior in software is challenging due to their increasing complexity. In configurable software systems, the combinatorics that arises from the multitude of features a user might select from adds a further layer of complexity. We introduce the notion of feature causality, which is based on counterfactual reasoning and inspired by the seminal definition of actual causality by Halpern and Pearl. Feature causality operates at the level of system configurations and is capable of identifying features and their interactions that are the reason for emerging functional and non-functional properties. We present various methods to explicate these reasons, in particular well-established notions of responsibility and blame that we extend to the feature-oriented setting. Establishing a close connection of feature causality to prime implicants, we provide algorithms to effectively compute feature causes and causal explications. By means of an evaluation on a wide range of configurable software systems, including community benchmarks and real-world systems, we demonstrate the feasibility of our approach: We illustrate how our notion of causality facilitates to identify root causes, estimate the effects of features, and detect feature interactions.

DOI: 10.1145/3510003.3510200

Causality-based neural network repair

作者: Sun, Bing and Sun, Jun and Pham, Long H. and Shi, Jie
关键词: No keywords

Abstract

Neural networks have had discernible achievements in a wide range of applications. The wide-spread adoption also raises the concern of their dependability and reliability. Similar to traditional decision-making programs, neural networks can have defects that need to be repaired. The defects may cause unsafe behaviors, raise security concerns or unjust societal impacts. In this work, we address the problem of repairing a neural network for desirable properties such as fairness and the absence of backdoor. The goal is to construct a neural network that satisfies the property by (minimally) adjusting the given neural network’s parameters (i.e., weights). Specifically, we propose CARE (CAusality-based REpair), a causality-based neural network repair technique that 1) performs causality-based fault localization to identify the ‘guilty’ neurons and 2) optimizes the parameters of the identified neurons to reduce the misbehavior. We have empirically evaluated CARE on various tasks such as backdoor removal, neural network repair for fairness and safety properties. Our experiment results show that CARE is able to repair all neural networks efficiently and effectively. For fairness repair tasks, CARE successfully improves fairness by 61.91% on average. For backdoor removal tasks, CARE reduces the attack success rate from over 98% to less than 1%. For safety property repair tasks, CARE reduces the property violation rate to less than 1%. Results also show that thanks to the causality-based fault localization, CARE’s repair focuses on the misbehavior and preserves the accuracy of the neural networks.

DOI: 10.1145/3510003.3510080

Change is the only constant： dynamic updates for workflows

作者: Sokolowski, Daniel and Weisenburger, Pascal and Salvaneschi, Guido
关键词: dynamic software updating, software evolution, workflows

Abstract

Software systems must be updated regularly to address changing requirements and urgent issues like security-related bugs. Traditionally, updates are performed by shutting down the system to replace certain components. In modern software organizations, updates are increasingly frequent—up to multiple times per day—hence, shutting down the entire system is unacceptable. Safe dynamic software updating (DSU) enables component updates while the system is running by determining when the update can occur without causing errors. Safe DSU is crucial, especially for long-running or frequently executed asynchronous transactions (workflows), e.g., user-interactive sessions or order fulfillment processes. Unfortunately, previous research is limited to synchronous transaction models and does not address this case.In this work, we propose a unified model for safe DSU in workflows. We discuss how state-of-the-art DSU solutions fit into this model and show that they incur significant overhead. To improve the performance, we introduce Essential Safety, a novel safe DSU approach that leverages the notion of non-essential changes, i.e., semantics preserving updates. In 106 realistic BPMN workflows, Essential Safety reduces the delay of workflow completions, on average, by 47.8% compared to the state of the art. We show that the distinction of essential and non-essential changes plays a crucial role in this reduction and that, as suggested in the literature, non-essential changes are frequent: at least 60% and often more than 90% of systems’ updates in eight monorepos we analyze.

DOI: 10.1145/3510003.3510065

Characterizing and detecting bugs in WeChat mini-programs

作者: Wang, Tao and Xu, Qingxin and Chang, Xiaoning and Dou, Wensheng and Zhu, Jiaxin and Xie, Jinhui and Deng, Yuetang and Yang, Jianbo and Yang, Jiaheng and Wei, Jun and Huang, Tao
关键词: WeChat mini-programs, bug detection, empirical study

Abstract

Built on the WeChat social platform, WeChat Mini-Programs are widely used by more than 400 million users every day. Consequently, the reliability of Mini-Programs is particularly crucial. However, WeChat Mini-Programs suffer from various bugs related to execution environment, lifecycle management, asynchronous mechanism, etc. These bugs have seriously affected users’ experience and caused serious impacts.In this paper, we conduct the first empirical study on 83 WeChat Mini-Program bugs, and perform an in-depth analysis of their root causes, impacts and fixes. From this study, we obtain many interesting findings that can open up new research directions for combating WeChat Mini-Program bugs. Based on the bug patterns found in our study, we further develop WeDetector to detect WeChat Mini-Program bugs. Our evaluation on 25 real-world Mini-Programs has found 11 previously unknown bugs, and 7 of them have been confirmed by developers.

DOI: 10.1145/3510003.3510114

ACID： an API compatibility issue detector for Android apps

作者: Mahmud, Tarek and Che, Meiru and Yang, Guowei
关键词: Android, API invocation compatibility issues, API evolution, API callback compatibility issues

Abstract

Android API is frequently updated, and compatibility issues may be induced when the API level supported by the device differs from the API level targeted by app developers. This paper presents ACID, an API compatibility issue detector for Android apps. ACID utilizes API differences and static analysis of Android apps to detect both API invocation compatibility issues and API callback compatibility issues. Our evaluation on 20 benchmark apps from previous studies shows that ACID is more accurate and faster in detecting compatibility issues than state-of-the-art techniques. We also ran ACID on 35 more real-world apps to demonstrate ACID’s practical applicability. ACID is available at https://github.com/TSUMahmud/acid and the demonstration video of ACID is available at https://youtu.be/XUNBPMIx2q4.

DOI: 10.1145/3510454.3516854

A dynamic analysis tool for memory safety based on smart status and source-level instrumentation

作者: Chen, Zhe and Wu, Jun and Zhang, Qi and Xue, Jingling
关键词: testing, online monitoring, memory errors, error detection, dynamic analysis, code instrumentation

Abstract

Memory errors may lead to program crashes and security vulnerabilities. In this paper, we present Movec, a dynamic analysis tool that can automatically detect memory errors at runtime. To address the three major challenges faced by existing tools in detecting memory errors, low effectiveness, optimization sensitivity and platform dependence, Movec leverages a smart-status-based monitoring algorithm and performs its intrumentation at the source-level. Our extensive evaluation shows that Movec is capable of finding a wide range of memory errors with moderate and competitive overheads.Demo video link: https://youtu.be/V8H2MroNxSM, also available at https://www.bilibili.com/video/BV1H34y117tAMovec website: https://drzchen.github.io/projects/movecMovec download link: https://github.com/drzchen/movec

DOI: 10.1145/3510454.3516872

作者: Luong, Kien and Thung, Ferdian and Lo, David
关键词: No keywords

Abstract

Stack Overflow and GitHub are two popular platforms containing API-related resources for developers to learn how to use APIs. The platforms are good sources for information about API such as code examples, usages, sentiment, bug reports, etc. However, it is difficult to collect the correct resources regarding a particular API due to the ambiguity of an API method name. An API method name mentioned in the text would only refer to one API, but the method name could match with different APIs. To help people in finding the correct resources for a particular API, we introduce ARSearch. ARSearch finds Stack Overflow threads that mention the particular API and their relevant code examples from GitHub. We demonstrate our tool by a video available at https://youtu.be/Rr-zTfUD_z0.

DOI: 10.1145/3510454.3517048

Asymob： a platform for measuring and clustering chatbots

作者: L'{o
关键词: quality assurance, metrics, chatbot design

Abstract

Chatbots have become a popular way to access all sorts of services via natural language. Many platforms and tools have been proposed for their construction, like Google’s Dialogflow, Amazon’s Lex or Rasa. However, most of them still miss integrated quality assurance methods like metrics. Moreover, there is currently a lack of mechanisms to compare and classify chatbots possibly developed with heterogeneous technologies.To tackle these issues, we present Asymob, a web platform that enables the measurement of chatbots using a suite of 20 metrics. The tool features a repository supporting chatbots built with different technologies, like Dialogflow and Rasa. Asymob’s metrics help in detecting quality issues and serve to compare chatbots across and within technologies. The tool also helps in classifying chatbots along conversation topics or design features by means of two clustering methods: based on the chatbot metrics or on the phrases expected and produced by the chatbot. A video showcasing the tool is available at https://www.youtube.com/watch?v=8lpETkILpv8.

DOI: 10.1145/3510454.3516843

A tool for rejuvenating feature logging levels via git histories and degree of interest

作者: Tang, Yiming and Spektor, Allan and Khatchadourian, Raffi and Bagherzadeh, Mehdi
关键词: source code analysis and transformation, software repository mining, software evolution, logging, degree of interest

Abstract

Logging is a significant programming practice. Due to the highly transactional nature of modern software applications, massive amount of logs are generated every day, which may overwhelm developers. Logging information overload can be dangerous to software applications. Using log levels, developers can print the useful information while hiding the verbose logs during software runtime. As software evolves, the log levels of logging statements associated with the surrounding software feature implementation may also need to be altered. Maintaining log levels necessitates a significant amount of manual effort. In this paper, we demonstrate an automated approach that can rejuvenate feature log levels by matching the interest level of developers in the surrounding features. The approach is implemented as an open-source Eclipse plugin, using two external plug-ins (JGit and Mylyn). It was tested on 18 open-source Java projects consisting of ~3 million lines of code and ~4K log statements. Our tool successfully analyzes 99.22% of logging statements, increases log level distributions by ~20%, and increases the focus of logs in bug fix contexts ~83% of the time. For further details, interested readers can watch our demonstration video (https://www.youtube.com/watch?v=qIULoAXoDv4).

DOI: 10.1145/3510454.3516838

CIDER： concept-based interactive design recovery

作者: Fang, Hongzhou and Cai, Yuanfang and Kazman, Rick and Lefever, Jason
关键词: No keywords

Abstract

In this paper, we introduce CIDER, a Concept-based Interactive DEsign Recovery tool that recovers a software design in the form of hierarchically organized concepts. In addition to facilitating design comprehension, it also enables designers to assess design quality and identify design problems. It integrates multiple clustering algorithms to reduce the complexity of the recovered design structure, leverages information retrieval techniques to name each cluster using the most relevant topic terms to ease design comprehension, and identifies and labels highly-coupled file clusters to reveal possible design problems. It enables interactive selection of concepts of interest and recovers partial design structures accordingly. The user can also interactively change the levels of recovered hierarchical structure to visualize the design at different granularities.

DOI: 10.1145/3510454.3516861

Code implementation recommendation for Android GUI components

作者: Zhao, Yanjie and Li, Li and Sun, Xiaoyu and Liu, Pei and Grundy, John
关键词: icon implementation, collaborative filtering, app development, Android, API recommendation

Abstract

We present a prototype tool Icon2Code, targeted to helping app developers more quickly implement the callback functions of complex Android GUI components by recommending code implementations learnt from similar GUI components from other apps. Given an icon or UI widget provided by designers, Icon2Code first queries a large pre-established database to locate similar icons that other apps have utilized. It then leverages a collaborative filtering model to suggest the most relevant APIs and their usage examples associated with the intended behaviours of these icons. Experimental results on 5,000 randomly selected real-world apps show that Icon2Code is useful and effective in recommending code examples for implementing the behaviours of complex GUI components. It has over 50% of success rate when only one recommended API is taken into account, and over 94% of success rate if 20 APIs are considered. The video demo can be found at https://youtu.be/pM3ZBGrQTdQ.

DOI: 10.1145/3510454.3516849

Common data guided crash injection for cloud systems

作者: Gao, Yu and Wang, Dong and Dai, Qianwang and Dou, Wensheng and Wei, Jun
关键词: fault injection, crash recovery, cloud system, bug detection

Abstract

Modern distributed systems are designed to tolerate node crashes. However, incorrect crash recovery mechanisms and implementations can still introduce crash recovery bugs, and hurt reliability and availability of cloud systems. In this paper, we present Deminer, a novel crash injection technique that automatically injects node crashes/reboots to effectively expose crash recovery bugs in cloud systems. We observe that, node crashes that interrupt the execution of related operations, which store common data to different places (i.e., different storage paths or nodes), are more likely to trigger crash recovery bugs. Based on this observation, Deminer first tracks the critical data usage in a correct run. Then Deminer identifies related operations and predicts error-prone crash points. Finally, Deminer tests the predicted crash points and checks whether the target system can behave correctly. We have evaluated Deminer on three widely-used cloud systems: ZooKeeper, HBase and HDFS. Deminer has detected 6 crash recovery bugs. A video demonstration of Deminer is available at https://youtu.be/jS6KBcYnTSM.

DOI: 10.1145/3510454.3516852

COSPEX： a program comprehension tool for novice programmers

作者: Gupta, Nakshatra and Rajput, Ashutosh and Chimalakonda, Sridhar
关键词: software maintenance, program comprehension, dynamic summarization, code summarization

Abstract

Developers often encounter unfamiliar code during software maintenance which consumes a significant amount of time for comprehension, especially for novice programmers. Researchers have come up with automated techniques that can provide effective code comprehension and summaries by analyzing a source code and present key information to the developers. Existing debuggers represent the execution states of the program but they do not show the complete execution at a single point. Studies have revealed that the effort required for program comprehension can be reduced if novice programmers are provided with worked examples. Hence, we propose COSPEX (Comprehension using Summarization via Program Execution) - an Atom plugin that dynamically extracts key information for every line of code executed and presents it to the developers in the form of an interactive example-like dynamic information instance. As a preliminary evaluation, we presented 14 undergraduates having Python programming experience up to 1 year with a code comprehension task in a user survey. We observed that COSPEX helped novice programmers in program comprehension and improved their understanding of the code execution. The source code and tool are available at: https://github.com/rishalab/COSPEX, and the demo on Youtube is available at: https://youtu.be/QQY-8KuDaEM.

DOI: 10.1145/3510454.3516842

DiffWatch： watch out for the evolving differential testing in deep learning libraries

作者: Prochnow, Alexander and Yang, Jinqiu
关键词: testing deep learning libraries, software testing, software quality assurance, differential unit testing

Abstract

Testing deep learning libraries is ultimately important for ensuring the quality and safety of many deep learning applications. As differential testing is commonly used to help the creation of test oracles, its maintenance poses new challenges. In this tool demo paper, we present DiffWatch, a fully automated tool for Python, which identifies differential test practices in DLLs and continuously monitors new changes of external libraries that may trigger the updates of the identified differential tests.Our evaluation on four DLLs demonstrates that DiffWatch can detect differential testing with a high accuracy. In addition, we demonstrate usage examples to show DiffWatch’s capability of monitoring the development of external libraries and alert the maintainers of DLLs about new changes that may trigger the updates of differential test practices. In short, DiffWatch can help developers adequately react to the code evolution of external libraries. DiffWatch is publicly available and a demo video can be found at https://www.youtube.com/watch?v=gR7m5QQuSqE.

DOI: 10.1145/3510454.3516835

DistFax： a toolkit for measuring interprocess communications and quality of distributed systems

作者: Fu, Xiaoqin and Lin, Boxiang and Cai, Haipeng
关键词: IPC, distributed system, quality, software measurement

Abstract

In this paper, we present DistFax, a toolkit for measuring common distributed systems, focusing on their interprocess communications (IPCs), a vital aspect of distributed system run-time behaviors. DistFax measures the coupling and cohesion of distributed systems via respective IPC metrics. It also characterizes the run-time quality of distributed systems via a set of dynamic quality metrics. DistFax then computes statistical correlations between the IPC metrics and quality metrics. It further exploits the correlations to classify the system quality status with respect to various quality metrics in a standard quality model. We empirically demonstrated the practicality and usefulness of DistFax in measuring the IPCs and quality of 11 real-world distributed systems against diverse execution scenarios. The demo video of DistFax can be viewed at https://youtu.be/VLmNiHvOuWQ online, and the artifact package is publicly available at https://tinyurl.com/zaz27ec8.

DOI: 10.1145/3510454.3516859

DScribe： co-generating unit tests and documentation

作者: Hernandez, Alexa and Nassif, Mathieu and Robillard, Martin P.
关键词: documentation generation, maintainability, test generation

Abstract

Test suites and documentation capture similar information despite serving distinct purposes. Such redundancy introduces the risk that the artifacts inconsistently capture specifications. We present DScribe, an approach that leverages the redundant information in tests and documentation to reduce the cost of creating them and the threat of inconsistencies. DScribe allows developers to define simple templates that jointly capture the structure to test and document a specification. They can then use these templates to generate consistent and checkable tests and documentation. By linking documentation to unit tests, DScribe ensures documentation accuracy as outdated documentation is flagged by failing tests. DScribe’s template-based approach also enforces a uniform style throughout the artifacts. Hence, in addition to reducing developer effort, DScribe improves artifact quality by ensuring consistent content and style. Video: https://www.youtube.com/watch?v=CUKp3MjMog

DOI: 10.1145/3510454.3516856

Dynaplex： inferring asymptotic runtime complexity of recursive programs

作者: Ishimwe, Didier and Nguyen, ThanhVu and Nguyen, KimHao
关键词: complexity analysis, dynamic invariant generation, recurrence relations

Abstract

Automated runtime complexity analysis can help developers detect egregious performance issues. Existing runtime complexity analysis are often done for imperative programs using static analyses. In this demo paper, we demonstrate the implementation and usage of Dynaplex, a dynamic analysis tool that computes the asymptotic runtime complexity of recursive programs. Dynaplex infers recurrence relations from execution traces and solve them for a closed-form complexity bound. Experimental results show that Dynaplex can infer a wide range of complexity bounds (eg: logarithmic, polynomial, exponential, non-polynomial) with great precision (eg: O(nlog23) for karatsuba). A video demonstration of Dynaplex usage is available at https://youtu.be/t7dhwZ7fbVs

DOI: 10.1145/3510454.3516853

ESBMC-solidity： an SMT-based model checker for solidity smart contracts

作者: Song, Kunjian and Matulevicius, Nedas and de Lima Filho, Eddie B. and Cordeiro, Lucas C.
关键词: formal verification, solidity

Abstract

Smart contracts written in Solidity are programs used in blockchain networks, such as Etherium, for performing transactions. However, as with any piece of software, they are prone to errors and may present vulnerabilities, which malicious attackers could then use. This paper proposes a solidity frontend for the efficient SMT-based context-bounded model checker (ESBMC), named ESBMC-Solidity, which provides a way of verifying such contracts with its framework. A benchmark suite with vulnerable smart contracts was also developed for evaluation and comparison with other verification tools. The experiments performed here showed that ESBMC-Solidity detected all vulnerabilities, was the fastest tool and provided a counterexample for each benchmark. A demonstration is available at https://youtu.be/3UH8_1QAVN0.

DOI: 10.1145/3510454.3516855

Fairkit-learn： a fairness evaluation and comparison toolkit

作者: Johnson, Brittany and Brun, Yuriy
关键词: bias-free software design, software fairness, visualization

Abstract

Advances in how we build and use software, specifically the integration of machine learning for decision making, have led to widespread concern around model and software fairness. We present fairkit-learn, an interactive Python toolkit designed to support data scientists’ ability to reason about and understand model fairness. We outline how fairkit-learn can support model training, evaluation, and comparison and describe the potential benefit that comes with using fairkit-learn in comparison to the state-of-the-art. Fairkit-learn is open source at https://go.gmu.edu/fairkit-learn/.

DOI: 10.1145/3510454.3516830

FuzzTastic： a fine-grained, fuzzer-agnostic coverage analyzer

作者: Lipp, Stephan and Elsner, Daniel and Hutzelmann, Thomas and Banescu, Sebastian and Pretschner, Alexander and B"{o
关键词: benchmarking, fuzzing, software security

Abstract

Performing sound and fair fuzzer evaluations can be challenging, not only because of the randomness involved in fuzzing, but also due to the large number of fuzz tests generated. Existing evaluations use code coverage as a proxy measure for fuzzing effectiveness. Yet, instead of considering coverage of all generated fuzz inputs, they only consider the inputs stored in the fuzzer queue. However, as we show in this paper, this approach can lead to biased assessments due to path collisions. Therefore, we developed FuzzTastic, a fuzzeragnostic coverage analyzer that allows practitioners and researchers to perform uniform fuzzer evaluations that are not affected by such collisions. In addition, its time-stamped coverage-probing approach enables frequency-based coverage analysis to identify barely tested source code and to visualize fuzzing progress over time and across code. To foster further studies in this field, we make FuzzTastic, together with a benchmark dataset worth ~12 CPU-years of fuzzing, publicly available; the demo video can be found at https://youtu.be/Lm-eBx0aePA.

DOI: 10.1145/3510454.3516847

Gallery D.C： auto-created GUI component gallery for design search and knowledge discovery

作者: Feng, Sidong and Chen, Chunyang and Xing, Zhenchang
关键词: GUI design, multi-faceted design search, object detection

Abstract

GUI design is an integral part of software development. The process of designing a mobile application typically starts with the ideation and inspiration search from existing designs. However, existing information-retrieval based, and database-query based methods cannot efficiently gain inspirations in three requirements: design practicality, design granularity and design knowledge discovery. In this paper we propose a web application, called Gallery D.C. that aims to facilitate the process of user interface design through real world GUI component search. Gallery D.C. indexes GUI component designs using reverse engineering and deep learning based computer vision techniques on millions of real world applications. To perform an advanced design search and knowledge discovery, our approach extracts information about size, color, component type, and text information to help designers explore multi-faceted design space and distill higher-order of design knowledge. Gallery D.C. is well received via an informal evaluation with 7 professional designers.Web Link: http://mui-collection.herokuapp.com/.Demo Video Link: https://youtu.be/zVmsz_wY5OQ.

DOI: 10.1145/3510454.3516873

Gamekins： gamifying software testing in jenkins

作者: Straubinger, Philipp and Fraser, Gordon
关键词: continuous integration, gamification, motivation, software testing

Abstract

Developers have to write thorough tests for their software in order to find bugs and to prevent regressions. Writing tests, however, is not every developer’s favourite occupation, and if a lack of motivation leads to a lack of tests, then this may have dire consequences, such as programs with poor quality or even project failures. This paper introduces Gamekins, a tool that uses gamification to motivate developers to write more and better tests. Gamekins is integrated into the Jenkins continuous integration platform where game elements are based on commits to the source code repository: Developers can earn points for completing test challenges and quests posed by Gamekins, compete with other developers or developer teams on a leaderboard, and are rewarded for their test-related achievements. A demo video of Gamekins is available at https://youtu.be/qnRWEQim12E; The tool, documentation, and source code are available at https://gamekins.org.

DOI: 10.1145/3510454.3516862

gDefects4DL： a dataset of general real-world deep learning program defects

作者: Liang, Yunkai and Lin, Yun and Song, Xuezhi and Sun, Jun and Feng, Zhiyong and Dong, Jin Song
关键词: bugs, datasets, deep learning, defects, neural networks

Abstract

The development of deep learning programs, as a new programming paradigm, is observed to suffer from various defects. Emerging research works have been proposed to detect, debug, and repair deep learning bugs, which drive the need to construct the bug benchmarks. In this work, we present gDefects4DL, a dataset for general bugs of deep learning programs. Comparing to existing datasets, gDefects4DL collects bugs where the root causes and fix solutions can be well generalized to other projects. Our general bugs include deep learning program bugs such as (1) violation of deep learning API usage pattern (e.g., the standard to implement cross entropy function y · log(y), y → 0, without NaN error), (2) shape-mismatch of tensor calculation, (3) numeric bugs, (4) type-mismatch (e.g., confusing similar types among numpy, pytorch, and tensorflow), (5) violation of model architecture design convention, and (6) performance bugs.For each bug in gDefects4DL, we describe why it is general and group the bugs with similar root causes and fix solutions for reference. Moreover, gDefects4DL also maintains (1) its buggy/fixed versions and the isolated fix change, (2) an isolated environment to replicate the defect, and (3) the whole code evolution history from the buggy version to the fixed version. We design gDefects4DL with extensible interfaces to evaluate software engineering methodologies and tools. We have integrated tools such as ShapeFlow, DEBAR, and GRIST. gDefects4DL contains 64 bugs falling into 6 categories (i.e., API Misuse, Shape Mismatch, Number Error, Type Mismatch, Violation of Architecture Convention, and Performance Bug). gDefects4DL is available at https://github.com/llmhyy/defects4dl, its online web demonstration is at http://47.93.14.147:9000/bugList, and the demo video is at https://youtu.be/0XtaEt4Fhm4.

DOI: 10.1145/3510454.3516826

GIFdroid： an automated light-weight tool for replaying visual bug reports

作者: Feng, Sidong and Chen, Chunyang
关键词: Android testing, bug replay, visual recording

Abstract

Bug reports are vital for software maintenance that allow users to inform developers of the problems encountered while using software. However, it is difficult for non-technical users to write clear descriptions about the bug occurrence. Therefore, more and more users begin to record the screen for reporting bugs as it is easy to be created and contains detailed procedures triggering the bug. But it is still tedious and time-consuming for developers to reproduce the bug due to the length and unclear actions within the recording. To overcome these issues, we propose GIFdroid, a lightweight approach to automatically replay the execution trace from visual bug reports. GIFdroid adopts image processing techniques to extract the keyframes from the recording, map them to states in GUI Transitions Graph, and generate the execution trace of those states to trigger the bug. Our automated experiments and user study demonstrate its accuracy, efficiency, and usefulness of the approach.Github Link: https://github.com/sidongfeng/gifdroidVideo Link: https://youtu.be/5GIw1Hdr6CEAppendix Link: https://sites.google.com/view/gifdroid

DOI: 10.1145/3510454.3516857

HUDD： a tool to debug DNNs for safety analysis

作者: Fahmy, Hazem and Pastore, Fabrizio and Briand, Lionel
关键词: DNN debugging, functional safety analysis

Abstract

We present HUDD, a tool that supports safety analysis practices for systems enabled by Deep Neural Networks (DNNs) by automatically identifying the root causes for DNN errors and retraining the DNN. HUDD stands for Heatmap-based Unsupervised Debugging of DNNs, it automatically clusters error-inducing images whose results are due to common subsets of DNN neurons. The intent is for the generated clusters to group error-inducing images having common characteristics, that is, having a common root cause.HUDD identifies root causes by applying a clustering algorithm to matrices (i.e., heatmaps) capturing the relevance of every DNN neuron on the DNN outcome. Also, HUDD retrains DNNs with images that are automatically selected based on their relatedness to the identified image clusters. Our empirical evaluation with DNNs from the automotive domain have shown that HUDD automatically identifies all the distinct root causes of DNN errors, thus supporting safety analysis. Also, our retraining approach has shown to be more effective at improving DNN accuracy than existing approaches.A demo video of HUDD is available at https://youtu.be/drjVakP7jdU.

DOI: 10.1145/3510454.3516858

ICCBot： fragment-aware and context-sensitive ICC resolution for Android applications

作者: Yan, Jiwei and Zhang, Shixin and Liu, Yepang and Yan, Jun and Zhang, Jian
关键词: Android app, ICC resolution, component transition graph, fragment, inter-component communication

Abstract

For GUI programs, like Android apps, the program functionalities are encapsulated in a set of basic components, each of which represents an independent function module. When interacting with an app, users are actually operating a set of components. The transitions among components, which are supported by the Android Inter-Component Communication (ICC) mechanism, can reflect the skeleton of an app. To effectively resolve the source and destination of an ICC message, both the correct entry-point identification and the precise data value tracking of ICC fields are required. However, with the wide usage of Android fragment components, the entry-point analysis usually terminates at an inner fragment but not its host component. Also, the simply tracked ICC field values may become inaccurate when data is transferred among multiple methods. In this paper, we design a practical ICC resolution tool ICCBot, which resolves the component transitions that are connected by fragments to help the entry-point identification. Besides, it performs context-sensitive inter-procedural analysis to precisely obtain the ICC-carried data values. Compared with the state-of-the-art tools, ICCBot achieves both a higher success rate and accuracy. ICCBot is open-sourced at https://github.com/hanada31/ICCBot. A video demonstration of it is at https://www.youtube.com/watch?v=7zcoMBtGiLY.

DOI: 10.1145/3510454.3516864

IDE augmented with human-learning inspired natural language programming

作者: Young, Mitchell and Nan, Zifan and Shen, Xipeng
关键词: code editor, natural language programming, program synthesis

Abstract

Natural Language (NL) programming, the concept of synthesizing code from natural language inputs, has garnered growing interest among the software community in recent years. Unfortunately, current solutions in the space all suffer from the same problem, they require many labeled training examples due to their data-driven nature. To address this issue, this paper proposes an NLU-driven approach that forgoes the need for large numbers of labeled training examples. Inspired by how humans learn programming, this solution centers around Natural Language Understanding and draws on a novel graph-based mapping algorithm. The resulting NL programming framework, HISyn, uses no training examples, but gives synthesis accuracies comparable to data-driven methods trained on hundreds of samples. HISyn meanwhile demonstrates advantages in terms of interpretability, error diagnosis support, and cross-domain extensibility. To encourage adoption of HISyn among developers, the tool is made available as an extension for the Visual Studio Code IDE, thereby allowing users to easily submit inputs to HISyn and insert the generated code expressions into their active programs. A demo of the HISyn Extension can be found at https://youtu.be/KKOqJS24FNo.

DOI: 10.1145/3510454.3516832

IntelliTC： automating type changes in IntelliJ IDEA

作者: Smirnov, Oleg and Ketkar, Ameya and Bryksin, Timofey and Tsantalis, Nikolaos and Dig, Danny
关键词: No keywords

Abstract

Developers often change types of program elements. Such refactoring often involves updating not only the type of the element itself, but also the API of all type-dependent references in the code, thus it is tedious and time-consuming. Despite type changes being more frequent than renamings, just a few current IDE tools provide partially-automated support only for a small set of hard-coded types. Researchers have recently proposed a data-driven approach to inferring API rewrite rules for type change patterns in Java using code commits history. In this paper, we build upon these recent advances and introduce IntelliTC — a tool to perform Java type change refactoring. We implemented it as a plugin for IntelliJ IDEA, a popular Java IDE developed by JetBrains. We present 3 different ways of providing support for such a refactoring from the standpoint of the user experience: Classic mode, Suggested Refactoring, and Inspection mode. To evaluate these modalities of using IntelliTC, we surveyed 22 experienced software developers. They positively rated the usefulness of the tool.The source code and distribution of the plugin are available on GitHub: https://github.com/JetBrains-Research/data-driven-type-migration. A demonstration video is available on YouTube: https://youtu.be/pdcfvADA1PY.

DOI: 10.1145/3510454.3516851

iPFlakies： a framework for detecting and fixing python order-dependent flaky tests

作者: Wang, Ruixin and Chen, Yang and Lam, Wing
关键词: Python, automated repair, flaky tests, order-dependent test

Abstract

Developers typically run tests after code changes. Flaky tests, which are tests that can nondeterministically pass and fail when run on the same version of code, can mislead developers about their recent changes. Much of the prior work on flaky tests is focused on Java projects. One prominent category of flaky tests that the prior work focused on is order-dependent (OD) tests, which are tests that pass or fail depending on the order in which tests are run. For example, our prior work proposed using other tests in the test suite to fix (or correctly set up) the state needed by Java OD tests to pass.Unlike Java flaky tests, flaky tests in other programming languages have received less attention. To help with this problem, another piece of prior work recently studied flaky tests in Python projects and detected many OD tests. Unfortunately, the work did not identify the other tests in the test suites that can be used to fix the OD tests. To fill this gap, we propose iPFlakies, a framework for automatically detecting and fixing Python OD tests. Using iPFlakies, we extend the prior work’s dataset to include (1) tests that can be used to reproduce and fix the OD tests and (2) patches for the OD tests. Our work to extend the dataset finds that reproducing passing and failing test results of flaky tests can be difficult and that iPFlakies is effective at detecting and fixing Python OD tests. To aid future research, we make our iPFlakies framework, dataset improvements, and experimental infrastructure publicly available.

DOI: 10.1145/3510454.3516846

JMocker： refactoring test-production inheritance by mockito

作者: Wang, Xiao and Xiao, Lu and Yu, Tingting and Woepse, Anne and Wong, Sunny
关键词: No keywords

Abstract

Mocking frameworks are dedicated to creating, manipulating, and verifying the execution of “faked” objects in unit testing. This helps developers to overcome the challenge of high inter-dependencies among software units. Despite the various benefits offered by existing mocking frameworks, developers often create a subclass of the dependent class and mock its behavior through method overriding. However, this requires tedious implementation and compromises the design quality of unit tests. We contribute a refactoring tool as an Eclipse Plugin, named JMocker, to automatically identify and replace the usage of inheritance by using Mockito—a well received mocking framework for Java projects. We evaluate JMocker on four open source projects and successfully refactored 214 cases in total. The evaluation results show that our framework is efficient, applicable to different projects, and preserves test behaviors. According to the feedback of six real-life developers, JMocker improves the design quality of test cases. JMocker is available at https://github.com/wx930910/JMocker. The tool demo can be found at https://youtu.be/HFoA2ZKCoxM.

DOI: 10.1145/3510454.3516836

M3triCity： visualizing evolving software & data cities

作者: Ardig`{o
关键词: program comprehension, software and data visualization

Abstract

The city metaphor for visualizing software systems in 3D has been widely explored and has led to many diverse implementations and approaches. Common among all approaches is a focus on the software artifacts, while the aspects pertaining to the data and information (stored both in databases and files) used by a system are seldom taken into account.We present M3triCity, an interactive web application whose goal is to visualize object-oriented software systems, their evolution, and the way they access data and information. We illustrate how it can be used for program comprehension and evolution analysis of data-intensive software systems.Demo video URL: https://youtu.be/uBMvZFIlWtk

DOI: 10.1145/3510454.3516831

MASS： a tool for mutation analysis of space CPS

作者: Cornejo, Oscar and Pastore, Fabrizio and Briand, Lionel
关键词: CPS, European space agency, mutation analysis

Abstract

We present MASS, a mutation analysis tool for embedded software in cyber-physical systems (CPS). We target space CPS (e.g., satellites) and other CPS with similar characteristics (e.g., UAV).Mutation analysis measures the quality of test suites in terms of the percentage of detected artificial faults. There are many mutation analysis tools available but they are inapplicable to CPS because of scalability and accuracy challenges.To overcome such limitations, MASS implements a set of optimization techniques that enable the applicability of mutation analysis and address scalability and accuracy in the CPS context. MASS has been successfully evaluated on a large study involving embedded software systems provided by industry partners; the study includes an on-board software system managing a microsatellite currently on-orbit, a set of libraries used in deployed cubesats, and a mathematical library provided by the European Space Agency. A demo video of MASS is available at https://www.youtube.com/watch?v=gC1x9cU0-tU.

DOI: 10.1145/3510454.3516840

META： multidimensional evaluation of testing ability

作者: Zhou, Tianqi and Liu, Jiawei and Wang, Yifan and Chen, Zhenyu
关键词: multidimensional evaluation system, software testing, testing ability

Abstract

As the market’s demand for software quality continues increasing, companies increase demand for excellent testing engineers. The online testing platform cultivates students by offering software testing courses. Students are encouraged to submit test codes by exams on the online testing platform during the course evaluation. However, the problem of how to effectively assess the test code written by students is to be solved. This paper implements a multidimensional evaluation system named META for software testing based on an online testing platform. META is designed to address the problem of how to evaluate testing effectiveness systematically. This paper evaluates students’ testing effectiveness in seven dimensions for three software testing types: developer unit testing, web application testing, and mobile application testing, combining test codes and test behaviours. For the validity of META, 14 exams are selected from MOOCTest for an experiment in this paper, of which ten exams for developer unit testing, three exams for mobile application testing, and one exam for web application testing, involving 718 students participating in the exam and 26666 records submitted. The experimental results show that META can present significant variability in different dimensions for different students with similar scores. Video URL: https://www.youtube.com/watch?v=EiCSMtefPMU.

DOI: 10.1145/3510454.3516867

ML-quadrat & DriotData： a model-driven engineering tool and a low-code platform for smart IoT services

作者: Moin, Armin and Mituca, Andrei and Challenger, Moharram and Badii, Atta and G"{u
关键词: domain-specific modeling, iot, low-code, machine learning, model-driven software engineering

Abstract

In this paper, we present ML-Quadrat, an open-source research prototype that is based on the Eclipse Modeling Framework (EMF) and the state of the art in the literature of Model-Driven Software Engineering (MDSE) for smart Cyber-Physical Systems (CPS) and the Internet of Things (IoT). Its envisioned users are mostly software developers who might not have deep knowledge and skills in the heterogeneous IoT platforms and the diverse Artificial Intelligence (AI) technologies, specifically regarding Machine Learning (ML). ML-Quadrat is released under the terms of the Apache 2.0 license on Github1. Additionally, we demonstrate an early tool prototype of DriotData, a web-based Low-Code platform targeting citizen data scientists and citizen/end-user software developers. DriotData exploits and adopts ML-Quadrat in the industry by offering an extended version of it as a subscription-based service to companies, mainly Small- and Medium-Sized Enterprises (SME). The current preliminary version of DriotData has three web-based model editors: text-based, tree-/form-based and diagram-based. The latter is designed for domain experts in the problem or use case domains (namely the IoT vertical domains) who might not have knowledge and skills in the field of IT. Finally, a short video demonstrating the tools is available on YouTube: https://youtu.be/VAuz25w0a5k.

DOI: 10.1145/3510454.3516841

NaturalCC： an open-source toolkit for code intelligence

作者: Wan, Yao and He, Yang and Bi, Zhangqian and Zhang, Jianguo and Sui, Yulei and Zhang, Hongyu and Hashimoto, Kazuma and Jin, Hai and Xu, Guandong and Xiong, Caiming and Yu, Philip S.
关键词: benchmark, code embedding, code intelligence, code representation, deep learning, open source, toolkit

Abstract

We present NaturalCC, an efficient and extensible open-source toolkit for machine-learning-based source code analysis (i.e., code intelligence). Using NaturalCC, researchers can conduct rapid prototyping, reproduce state-of-the-art models, and/or exercise their own algorithms. NaturalCC is built upon Fairseq and PyTorch, providing (1) a collection of code corpus with preprocessing scripts, (2) a modular and extensible framework that makes it easy to reproduce and implement a code intelligence model, and (3) a benchmark of state-of-the-art models. Furthermore, we demonstrate the usability of our toolkit over a variety of tasks (e.g., code summarization, code retrieval, and code completion) through a graphical user interface. The website of this project is http://xcodemind.github.io, where the source code and demonstration video can be found.

DOI: 10.1145/3510454.3516863

NaviDroid： a tool for guiding manual Android testing via hint moves

作者: Liu, Zhe and Chen, Chunyang and Wang, Junjie and Su, Yuhui and Wang, Qing
关键词: Android app, GUI testing, human testing, state transition graph

Abstract

Manual testing, as a complement to automated GUI testing, is the last line of defense for app quality especially in spotting usability and accessibility issues. However, the repeated actions and easy missing of some functionalities make manual testing time-consuming, labor-extensive and inefficient. Inspired by the game candy crush with flashy candies as hint moves for players, we develop a tool named NaviDroid for navigating human testers via highlighted next operations for more effective and efficient testing. Within NaviDroid, it constructs an enriched state transition graph (STG) with the trigger actions as the edges for two involved states. Based on the STG, NaviDroid utilizes the dynamic programming algorithm to plan the exploration path, and augment the run-time GUI with visualized hint moves for testers to quickly explore untested states and avoid duplication. The automated experiments demonstrate the high coverage and efficient path planning of NaviDroid. A user study further confirms its usefulness in the participants covering more states and activities, detecting more bugs within less time compared with the control group.NaviDroid demo video: https://youtu.be/lShFyg_nTA0.

DOI: 10.1145/3510454.3516848

Proactive libraries： enforcing correct behaviors in Android apps

作者: Riganelli, Oliviero and Fagadau, Ionut Daniel and Micucci, Daniela and Mariani, Leonardo
关键词: API misuse, Android, proactive library, runtime enforcement, self-healing

Abstract

The Android framework provides a rich set of APIs that can be exploited by developers to build their apps. However, the rapid evolution of these APIs jointly with the specific characteristics of the lifecycle of the Android components challenge developers, who may release apps that use APIs incorrectly.In this demo, we present Proactive Libraries, a tool that can be used to decorate regular libraries with the capability of proactively detecting and healing API misuses at runtime. Proactive Libraries blend libraries with multiple proactive modules that collect data, check the compliance of API usages with correctness policies, and heal executions as soon as the possible violation of a policy is detected. The results of our evaluation with 27 possible API misuses show the effectiveness of Proactive Libraries in correcting API misuses with negligible runtime overhead.Video: https://youtu.be/rkfZ38mPgV0Repo: https://gitlab.com/learnERC/proactivelibrary

DOI: 10.1145/3510454.3516837

PyKokkos： performance portable kernels in Python

作者: Awar, Nader Al and Mehta, Neil and Zhu, Steven and Biros, George and Gligoric, Milos
关键词: PyKokkos, Python, high performance computing, kokkos

Abstract

As modern supercomputers have increasingly heterogeneous hardware, the need for writing parallel code that is both portable and performant across different hardware architectures increases. Kokkos is a C++ library that provides abstractions for writing performance portable code. Using Kokkos, programmers can write their code once and run it efficiently on a variety of architectures. However, the target audience of Kokkos, typically scientists, prefers dynamically typed languages such as Python instead of C++. We demonstrate a framework, dubbed PyKokkos, that enables performance portable code through Python. PyKokkos transparently translates code written in a subset of Python to C++ and Kokkos, and then connects the generated code to Python by automatically generating language bindings. PyKokkos achieves performance comparable to Kokkos in ExaMiniMD, a ~3k lines of code molecular dynamics mini-application. The demo video for PyKokkos can be found at https://youtu.be/1oFvhlhoDaY.

DOI: 10.1145/3510454.3516827

Pynguin： automated unit test generation for Python

作者: Lukasczyk, Stephan and Fraser, Gordon
关键词: Python, automated test generation

Abstract

Automated unit test generation is a well-known methodology aiming to reduce the developers’ effort of writing tests manually. Prior research focused mainly on statically typed programming languages like Java. In practice, however, dynamically typed languages have received a huge gain in popularity over the last decade. This introduces the need for tools and research on test generation for these languages, too. We introduce Pynguin, an extendable testgeneration framework for Python, which generates regression tests with high code coverage. Pynguin is designed to be easily usable by practitioners; it is also extensible to allow researchers to adapt it for their needs and to enable future research. We provide a demo of Pynguin at https://youtu.be/UiGrG25Vts0; further information, documentation, the tool, and its source code are available at https://www.pynguin.eu.

DOI: 10.1145/3510454.3516829

QuSBT： search-based testing of quantum programs

作者: Wang, Xinyi and Arcaini, Paolo and Yue, Tao and Ali, Shaukat
关键词: genetic algorithms, quantum programs, search-based testing

Abstract

Generating a test suite for a quantum program such that it has the maximum number of failing tests is an optimization problem. For such optimization, search-based testing has shown promising results in the context of classical programs. To this end, we present a test generation tool for quantum programs based on a genetic algorithm, called QuSBT (Search-based Testing of Quantum Programs). QuSBT automates the testing of quantum programs, with the aim of finding a test suite having the maximum number of failing test cases. QuSBT utilizes IBM’s Qiskit as the simulation framework for quantum programs. We present the tool architecture in addition to the implemented methodology (i.e., the encoding of the search individual, the definition of the fitness function expressing the search problem, and the test assessment w.r.t. two types of failures). Finally, we report results of the experiments in which we tested a set of faulty quantum programs with QuSBT to assess its effectiveness. Repository (code and experimental results): https://github.com/Simula-COMPLEX/qusbt-toolVideo: https://youtu.be/3apRCtluAn4

DOI: 10.1145/3510454.3516839

ReGVD： revisiting graph neural networks for vulnerability detection

作者: Nguyen, Van-Anh and Nguyen, Dai Quoc and Nguyen, Van and Le, Trung and Tran, Quan Hung and Phung, Dinh
关键词: graph neural networks, security, text classification, vulnerability detection

Abstract

Identifying vulnerabilities in the source code is essential to protect the software systems from cyber security attacks. It, however, is also a challenging step that requires specialized expertise in security and code representation. To this end, we aim to develop a general, practical, and programming language-independent model capable of running on various source codes and libraries without difficulty. Therefore, we consider vulnerability detection as an inductive text classification problem and propose ReGVD, a simple yet effective graph neural network-based model for the problem. In particular, ReGVD views each raw source code as a flat sequence of tokens to build a graph, wherein node features are initialized by only the token embedding layer of a pre-trained programming language (PL) model. ReGVD then leverages residual connection among GNN layers and examines a mixture of graph-level sum and max poolings to return a graph embedding for the source code. ReGVD outperforms the existing state-of-the-art models and obtains the highest accuracy on the real-world benchmark dataset from CodeXGLUE for vulnerability detection. Our code is available at: https://github.com/daiquocnguyen/GNN-ReGVD.

DOI: 10.1145/3510454.3516865

ReInstancer： automatically refactoring for instanceof pattern matching

作者: Hong, Shuai and Zhang, Yang and Li, Chaoshuai and Bai, Yu
关键词: instanceof, pattern matching, program analysis, refactoring, switch expression

Abstract

Pattern matching for instanceof is widely used with the advantage of conditionally extracting components from objects and with the disadvantage of the compulsory usage of type castings. The feature of pattern matching has been periodically previewed in the latest released version of JDK to avoid redundant type castings and to optimize pattern matching in multi-branch statements. Although pattern matching has many benefits, manual refactoring for instanceof pattern matching is time-consuming and tedious. Furthermore, no existing work can provide sufficient support for such refactoring. To this end, this paper presents a refactoring tool ReInstancer that mitigates and simplifies type castings by converting a multi-branch statement with instanceof pattern matching into a switch statement or expression automatically. ReInstancer is evaluated on 7 real-world applications, with a total of 4404 instanceof expressions. The evaluation results demonstrate that ReInstancer can successfully remove 2060 type castings and refactor 141 multi-branch statements with a total of 1972 instanceof pattern matching, which improves the code quality.

DOI: 10.1145/3510454.3516868

RM2Doc： a tool for automatic generation of requirements documents from requirements models

作者: Bao, Tianshu and Yang, Jing and Yang, Yilong and Yin, Yongfeng
关键词: automatic documentation, requirements, requirements documents, requirements model

Abstract

Automatic generation of requirements documents is an essential feature of the model-driven CASE tools such as UML and SysML designers. However, the quality of the generated documents from the current tools highly depends on the attached descriptions of models but not the quality of the model itself. Besides, if the stockholders ask to generate ISO/IEC/IEEE 29148-2018 conformed documents, extra templates are required. In this paper, we propose a CASE tool named RM2Doc, which can automatically generate ISO/IEC/IEEE 29148-2018 conformed requirements documents from UML models without any templates. In addition, the flow description can be generated from a use case without additional information. Moreover, it can automatically generate the semantic description of system operations only based on the formal expression of OCL. We have conducted four case studies with over 50 use cases. Overall, the result is satisfactory. The 95% requirements documents can be generated from the requirements model without any human interactions in 1 second. The proposed tools can be further developed for the industry of software engineering.The tool can be downloaded at http://rm2pt.com/rm2doc, and a demo video casting its features is at https://youtu.be/4z0Z5mrLfBc

DOI: 10.1145/3510454.3516850

SEbox4DL： a modular software engineering toolbox for deep learning models

作者: Wei, Zhengyuan and Wang, Haipeng and Yang, Zhen and Chan, W. K.
关键词: neural networks, repair, software engineering, testing, toolbox

Abstract

Deep learning (DL) models are widely used in software applications. Novel DL models and datasets are published from time to time. Developers may also tempt to apply new software engineering (SE) techniques on their DL models. However, no existing tool supports the applications of software testing and debugging techniques on new DL models and their datasets without modifying the code. Developers should manually write code to glue every combination of models, datasets, and SE technique and chain them together.We propose SEbox4DL, a novel and modular toolbox that automatically integrates models, datasets, and SE techniques into SE pipelines seen in developing DL models. SEbox4DL exemplifies six SE pipelines and can be extended with ease. Each user-defined task in the pipelines is to implement a SE technique within a function with a unified interface so that the whole design of SEbox4DL is generic, modular, and extensible. We have implemented several SE techniques as user-defined tasks to make SEbox4DL off-the-shelf. Our experiments demonstrate that SEbox4DL can simplify the applications of software testing and repair techniques on the latest or popular DL models and datasets. The toolbox is open-source and published at https://github.com/Wsine/SEbox4DL. A video for demonstration is available at: https://youtu.be/EYeFFi4lswc.

DOI: 10.1145/3510454.3516828

SymInfer： inferring numerical invariants using symbolic states

作者: Nguyen, ThanhVu and Nguyen, KimHao and Duong, Hai
关键词: dynamic analysis, invariant inference, symbolic execution

Abstract

We demonstrate the implementation and usage of SymInfer, a tool that automatically discovers numerical invariants using concrete and symbolic states collected from dynamic and symbolic executions. SymInfer supports expressive invariants under various forms, including nonlinear equalities, octagonal inequalities, and disjunctive min/max invariants. Experimental results show that SymInfer is effective in generating complex invariants and can often discover unknown, yet useful program properties. Video demo: https://www.youtube.com/watch?v=VEuhJw1RBUE.

DOI: 10.1145/3510454.3516833

SynTest-solidity： automated test case generation and fuzzing for smart contracts

作者: Olsthoorn, Mitchell and Stallenberg, Dimitri and van Deursen, Arie and Panichella, Annibale
关键词: fuzzing, search-based software testing, smart contracts, software testing, test case generation

Abstract

Ethereum is the largest and most prominent smart contract platform. One key property of Ethereum is that once a contract is deployed, it can not be updated anymore. This increases the importance of thoroughly testing the behavior and constraints of the smart contract before deployment. Existing approaches in related work either do not scale or are only focused on finding crashing inputs. In this tool demo, we introduce SynTest-Solidity, an automated test case generation and fuzzing framework for Solidity. SynTest-Solidity implements various metaheuristic search algorithms, including random search (traditional fuzzing) and genetic algorithms (i.e., NSGA-II, MOSA, and DynaMOSA). Finally, we performed a preliminary empirical study to assess the effectiveness of SynTest-Solidity in testing Solidity smart contracts.

DOI: 10.1145/3510454.3516869

Synthia： a generic and flexible data structure generator

作者: Plourde, Marc-Antoine and Hall'{e
关键词: fuzzing, synthetic data generation, test reduction

Abstract

Synthia is a versatile, modular and extensible Java-based data structure generation library. It is centered on the notion of “pickers”, which are objects producing values of a given type on demand. Pickers are stateful and can be given as input to other pickers; this chaining principle can generate objects whose structure follows a complex pattern. The paper describes the core principles and key features of the library, including test input shrinking, provenance tracking, and object mutation.

DOI: 10.1145/3510454.3516834

TauPad： test data augmentation of point clouds by adversarial mutation

作者: Liu, Guandi and Liu, Jiawei and Zhang, Quanjun and Fang, Chunrong and Zhang, Xufan
关键词: data augmentation, mutation testing, point clouds, software testing

Abstract

Point clouds have been widely used in a large number of application scenarios to handle with various deep learning (DL) tasks. Testing is an essential means to guarantee the robustness of DL models, which places high demands on test data. Therefore, it is crucial to design a reliable and effective test data augmentation tool of point clouds to generate high-quality data to test the robustness of the target model. However, although common mutation methods can increase the amount of point clouds, the quality of the augmented data still needs to be improved based on the specify of the spatial structure of the point clouds. In this paper, we develop a point clouds augmentation tool, namely TauPad, of which the specific mutation direction is guided by adversarial attacks. Based on the point clouds pre-processing, point clouds adversarial mutation, and spatial distribution restoration, TauPad can generate augmented test data that are significantly deceptive to the target model. Preliminary experiments show that TauPad can reliably and effectively augment point clouds for testing. Its video is at https://youtu.be/Y9nDIEW13_g/ and TauPad can be used at http://1.13.193.98:2600/.

DOI: 10.1145/3510454.3517050

TauLiM： test data augmentation of LiDAR point cloud by metamorphic relation

作者: Lin, Ju and Liu, Jiawei and Zhang, Quanjun and Zhang, Xufan and Fang, Chunrong
关键词: LiDAR point cloud, data augmentation, software testing, test data augmentation

Abstract

With the rapid development of object detection in deep learning (DL), applications on LiDAR point clouds have received much attention, such as autonomous driving. To verify the robustness of object detection models by testing, large amounts of diversified annotated LiDAR point clouds are required to be used as test data. However, considering the sparseness of objects, the diversity of the existing point cloud dataset is limited by the number and types of objects. Therefore, it is important to generate diversified point clouds by test data augmentation. In this paper, we propose a tool for LiDAR point cloud via test data augmentation, named TauLiM. A well-designed metamorphic relation (MR) [1] is proposed to augment point clouds while maintaining their physical characteristic of LiDAR. TauLiM is composed of three modules, namely point cloud configuration, coordinate filtering, and object insertion. To evaluate our tool, we employ experiments to compare the ability of testing between the existing dataset and the augmented one. The result shows that TauLiM can effectively augment diversified test data and test the object detection model. The video of TauLiM is available at https://www.youtube.com/watch?v=9S6xpRbbhtQ and TauLiM can be used at http://1.13.193.98:2601/.

DOI: 10.1145/3510454.3516860

TestKnight： an interactive assistant to stimulate test engineering

作者: Botocan, Cristian-Alexandru and Deshmukh, Piyush and Makridis, Pavlos and Huidobro, Jorge Romeu and Sundarrajan, Mathanrajan and Aniche, Maur'{\i
关键词: IDE plug-in, developer assistance, software testing

Abstract

Software testing is one of the most important aspects of modern software development. To ensure the quality of the software, developers should ideally write and execute automated tests regularly as their code-base evolves. TestKnight, a plugin for the IntelliJ IDEA integrated development environment (IDE), aims to help Java developers improve the testing process through support for creating and maintaining high-quality test suites.Github repo: https://github.com/SERG-Delft/testknightJetbrains Marketplace: https://plugins.jetbrains.com/plugin/17072-testknightYouTube video: https://www.youtube.com/watch?v=BSaL-K7ug6M

DOI: 10.1145/3510454.3517052

UIPDroid： unrooted dynamic monitor of Android app UIs for fine-grained permission control

作者: Duan, Mulin and Jiang, Lingxiao and Shar, Lwin Khin and Gao, Debin
关键词: Android, VirtualXposed, permission management, rootless

Abstract

Proper permission controls in Android systems are important for protecting users’ private data when running applications installed on the devices. Currently Android systems require apps to obtain authorization from users at the first time when they try to access users’ sensitive data, but every permission is only managed at the application level, allowing apps to (mis)use permissions granted by users at the beginning for different purposes subsequently without informing users. Based on privacy-by-design principles, this paper develops a new permission manager, named UIPDroid, that (1) enforces the users’ basic right-to-know through user interfaces whenever an app uses permissions, and (2) provides a more fine-grained UI widget-level permission control that can allow, deny, or produce fake private data dynamically for each permission use in the app at the choice of users, even if the permissions may have been granted to the app at the application level. In addition, to make the tool easier for end users to use, unlike some other root-based solutions, our solution is root-free, developed as a module on top of a virtualization framework that can be installed onto users’ device as a usual app. Our preliminary evaluation results show that UIPDroid works well for fine-grained, per-widget control of contact and location permissions implemented in the prototype tool, improving users’ privacy awareness and their protection. The tool is available at https://github.com/pangdingzhang/Anti-Beholder; A demo video is at: https://youtu.be/dT-mq4oasNU

DOI: 10.1145/3510454.3516844

VRTest： an extensible framework for automatic testing of virtual reality scenes

作者: Wang, Xiaoyin
关键词: scene exploration, software testing, virtual reality

Abstract

Virtual Reality (VR) is an emerging technique that attracts interest from various application domains such as training, education, remote communication, gaming, and navigation. Despite the ever growing number of VR software projects, the quality assurance techniques for VR software has not been well studied. Therefore, the validation of VR software largely rely on pure manual testing. In this paper, we present a novel testing framework called VRTest to automate the testing of scenes in VR software. In particular, VRTest extracts information from a VR scene and controls the user camera to explore the scene and interact with the virtual objects with certain testing strategies. VRTest currently supports two built-in testing strategies: VRMonkey and VRGreed, which use pure random exploration and greedy algorithm to explore interact-able objects in VR scenes. The video of our tool is available on Youtube at https://www.youtube.com/watch?v=TARqTEaa7_Q

DOI: 10.1145/3510454.3516870

WhyGen： explaining ML-powered code generation by referring to training examples

作者: Yan, Weixiang and Li, Yuanchun
关键词: code generation, intellectual property, machine learning, recitation

Abstract

Deep learning has demonstrated great abilities in various code generation tasks. However, despite the great convenience for some developers, many are concerned that the code generators may recite or closely mimic copyrighted training data without user awareness, leading to legal and ethical concerns. To ease this problem, we introduce a tool, named WhyGen, to explain the generated code by referring to training examples. Specifically, we first introduce a data structure, named inference fingerprint, to represent the decision process of the model when generating a prediction. The fingerprints of all training examples are collected offline and saved to a database. When the model is used at runtime for code generation, the most relevant training examples can be retrieved by querying the fingerprint database. Our experiments have shown that WhyGen is able to precisely notify the users about possible recitations and highly similar imitations with a top-10 accuracy of 81.21%. The demo video can be found at https://youtu.be/EtoQP6850To.

DOI: 10.1145/3510454.3516866

A DevSecOps-enabled framework for risk management of critical infrastructures

作者: Ramaj, Xhesika
关键词: risk management, critical infrastructures, DevSecOps

Abstract

This paper presents a Ph.D. research plan that focuses on solving the existing problems in risk management of critical infrastructures, by means of a novel DevSecOps-enabled framework. Critical infrastructures are complex physical and cyber-based systems that form the lifeline of a modern society, and their reliable and secure operation is of paramount importance to national security and economic vitality. Therefore, this paper proposes DevSecOps technology for managing risk throughout the entire development life cycle of such systems.

DOI: 10.1145/3510454.3517053

A framework to support software developers in implementing privacy features

作者: Mazeli, Anthony
关键词: software systems, regulatory standard, regulatory compliance, mental models, developer centered privacy, SLR, GDPR, CCPA

Abstract

Software developers are inundated with responsibility to incorporate privacy artifacts into software design from the onset in line with best practices. However, little is understood about the struggles developers face implementing privacy into software design. This PhD will undertake: (1) a Systematic Literature Review (SLR) to understand developers interpretation or lack thereof of privacy regulations while incorporating privacy into software systems; (2) two task-based studies to analyze software developers’ privacy compliance to ascertain whether or not they are able to comply with regulatory standards in implementing privacy into software design; (3) analyze mental models adopted by developers when trying to ameliorate their struggles, and (4) then design and evaluate a framework that helps developers make informed privacy decisions.

DOI: 10.1145/3510454.3517054

Applying reconfiguration cost and control pattern modeling to self-adaptive systems

作者: Matth'{e
关键词: self-adaptive systems, edge computing, context-aware systems

Abstract

Self-adaptive systems have become a popular research topic to overcome challenges of developing highly complex, interconnected, and heterogeneous systems and networks. These systems aim to autonomously adapt to a changing environment by adapting system behavior or composition to improve performance. Many self-adaptive systems are designed in a purely reactive way and without considering costs that may be incurred by performing adaptation. This thesis therefore aims to develop an approach for proactive self-adaptive systems and evaluate the impact of reconfiguration cost and proactive adaptation in an edge computing system. Additionally the autonomous adaptation of different control patterns for centralized or decentralized control will be explored and evaluated. This thesis proposes to extend feature models, as used in dynamic software product lines, with modeling for reconfiguration cost and of uncertainty in the system’s environment.

DOI: 10.1145/3510454.3517056

Architecture synthesis for optimized and flexible production

作者: Terzimehi'{c
关键词: service composition, optimization, model-based development, industry 4.0, design space exploration, deployment, architecture synthesis

Abstract

The fourth industrial revolution (Industry 4.0) anticipates frequent synthesis and optimization of different architectural design decisions (ADDs) - such as deployment of software components to hardware components, service composition, production planning, and topology (plant layout) synthesis. The frequent manual search for valid and optimal architectural designs is a time- and cognition-consuming task for an engineer. This asks for automating the process of deriving different ADDs. Although automating different ADDs is intensely investigated in other domains, the current research works 1) require higher engineering effort for specifying architecture optimization problems; 2) conduct (only) sequential ADDs, leading to lower solution quality (i.e., sub-optimal production); 3) neglect reconfigurability and reliability of architectures, and, thereby, offer no solution for production downtime; 4) neglect event-based execution semantics while considering timing-related issues. Therefore, I propose a Satisfiability Modulo Theories (SMT)-based framework for joint synthesis and optimization of multi-dimensional ADDs using industrial automation domain models (e.g., plant topology, product recipes, stations capabilities, etc.). This research should bring following benefits for the practitioners and researchers: 1) reduction of engineering effort for conducting different ADDs; 2) improvement of different quality attributes (e.g., production performance, reconfigurability, reliability, etc.); 3) guideline/support for a practitioner in choosing ADDs workflow to improve given quality attributes.

DOI: 10.1145/3510454.3517057

Assessing the quality of computational notebooks for a frictionless transition from exploration to production

作者: Quaranta, Luigi
关键词: static analysis tools, software engineering, machine learning, linters, data science, computational notebooks, artificial intelligence

Abstract

The massive trend of integrating data-driven AI capabilities into traditional software systems is rising new intriguing challenges. One of such challenges is achieving a smooth transition from the explorative phase of Machine Learning projects - in which data scientists build prototypical models in the lab - to their production phase - in which software engineers translate prototypes into production-ready AI components. To narrow down the gap between these two phases, tools and practices adopted by data scientists might be improved by incorporating consolidated software engineering solutions. In particular, computational notebooks have a prominent role in determining the quality of data science prototypes. In my research project, I address this challenge by studying the best practices for collaboration with computational notebooks and proposing proof-of-concept tools to foster guidelines compliance.

DOI: 10.1145/3510454.3517055

Behavior-based test smells refactoring： toward an automatic approach to refactoring eager test and lazy test smells

作者: Pizzini, Adriano
关键词: testing, test smell refactoring, software quality

Abstract

Software testing is an essential part of the development process, and like many software artifacts, tests are affected by smells, harming comprehension and maintainability. Several studies are related to test smell identification, but few studies are related to refactoring. Most proposed approaches are semi-automated, with the developer as a safety net. This paper presents a proposal for automatic refactoring of Eager Test and Lazy Test smells based on identifying the behavior of tests and, consequently, the behavior of the System Under Test (SUT). The approach will be evaluated with private source code repositories to identify its impact on quality attributes.

DOI: 10.1145/3510454.3517059

Completeness of composite refactorings for smell removal

作者: Bibiano, Ana Carla
关键词: code smells, composite completeness, composite refactoring, refactoring

Abstract

Code smells are problems in the internal structural quality. Refactoring is a technique commonly used to remove code smells. A single refactoring rarely suffices to assist developers in achieving a full removal of a code smell. Thus, developers frequently apply composite refactorings (or, simply, composites) with the goal of fully removing a smell. A composite is formed by two or more interrelated single refactorings. Studies report that developers often fail in fully removing code smells through composite refactoring. In this context, a composite refactoring is considered incomplete whenever it does not fully remove a smell. Either incomplete or complete composite is formed by several refactorings; thus, both may inadvertently degrade other parts of the software. However, the literature on (in)complete composites and their effects on structural quality is still scarce. This lack of knowledge hampers the design of empirically-based recommendations to properly assist developers in performing effective complete composites, i.e., those not causing any harm in related parts of the program. This doctoral research investigates the effect of composite (in)completeness on structural quality and proposes a catalog with composite recommendations for the full removal of popular code smell types. We investigated 618 composites in 20 software projects. We found that 58% of incomplete composites did not change the internal structural quality, and 64% of complete composites are formed by refactoring types that were not actually previously recommended in the literature or elsewhere. The expected contributions are a list of findings, guidelines, and a catalog to support developers on how to successfully perform complete composites.

DOI: 10.1145/3510454.3517060

Cross-platform testing of quantum computing platforms

作者: Paltenghi, Matteo
关键词: differential testing, program synthesis, quantum computing platforms, statistical test

Abstract

Quantum computing has been attracting the attention of both applied research and companies. Continuous progress on fundamental hardware technology promises to bring us more reliable and large-scale quantum computers on which to run the next generation of quantum algorithms. These programs are compiled and executed on dedicated platforms, and similarly to classical programs, a large effort is required to test these platforms and create a robust software toolchain.Unlike previous studies which focused on cross-optimization and cross-backend testing, this dissertation aims to create the first approach for cross-platform testing which compares execution on diverse quantum computing platforms. To inform the design of the method, we will first perform an empirical study of bugs in quantum computing platforms and a review of the characteristics of realistic quantum programs.The final approach for cross-platform testing will include three components: a learning-based method to generate realistic quantum programs, an approach to map and run them on multiple platforms, and finally a quantum-specific statistical test to compare two multivariate binary distributions returned as the output of quantum programs.

DOI: 10.1145/3510454.3517061

Diversity in programming education： help underrepresented groups learn programming

作者: Gra\ss{
关键词: diversity, programming education, software engineering

Abstract

Computer science (CS) and especially software engineering (SE) are still predominantly male-dominated domains at all levels—education, academia and industry [4, 6, 7]. The impact of this very homogeneous group, involved in development and decision-making processes in CS, has become increasingly apparent. Consequently, reports of discrimination against females and other underrepresented groups in the field of CS as well as biases in algorithms have recently received public and research attention [9, 26, 40]. In contrast, research shows that SE benefits from diversity since diverse teams have a higher performance and collaboration as well as being more creative and innovative [8, 10, 37].

DOI: 10.1145/3510454.3517062

Enabling automatic repair of source code vulnerabilities using data-driven methods

作者: Grishina, Anastasiia
关键词: static analysis, software security, natural language processing, graph-based machine learning, automatic program repair, ML4Code

Abstract

Users around the world rely on software-intensive systems in their day-to-day activities. These systems regularly contain bugs and security vulnerabilities. To facilitate bug fixing, data-driven models of automatic program repair use pairs of buggy and fixed code to learn transformations that fix errors in code. However, automatic repair of security vulnerabilities remains under-explored. In this work, we propose ways to improve code representations for vulnerability repair from three perspectives: input data type, data-driven models, and downstream tasks. The expected results of this work are improved code representations for automatic program repair and, specifically, fixing security vulnerabilities.

DOI: 10.1145/3510454.3517063

Improving automated crash reproduction

作者: Oliver, Philip
关键词: symbolic execution, software testing, evolutionary computing, crash reproduction, automation

Abstract

Fixing bugs is a lengthy process which currently requires several manual steps to be undertaken by a developer. Reproducing a crash often takes a significant amount of time during this process, as it requires a developer to identify where the crash occurred and where the taint began, thus leading to the crash. Several tools, such as EvoCrash, STAR, and Beacon, have been created to automate this process. Proposed research includes creating a benchmark dataset, performing an empirical evaluation of fitness functions for crash reproduction, and combining both evolutionary and static approaches to reduce search spaces and increase the effectiveness of automated crash reproduction tools.

DOI: 10.1145/3510454.3517064

Lean software startup practices and software engineering education

作者: Cico, Orges
关键词: software engineering education, software engineering, lean startup, empirical studies

Abstract

In the modern economy, software drives innovation and economic growth. Studies show how software increasingly influences all industry sectors. Over the past 5 decades, software engineering has also changed significantly to advance the development of various types and scales of software products. Software engineering education plays an essential role in apprising students of software technologies, processes, and practices popular in industries. Furthermore, approaches to teaching software engineering are becoming more interdisciplinary and team-centered, comparable to startup contexts. In this PhD work, I want to answer the following research questions: (1) To what extent are software engineering trends present in software engineering education research? (2) What set of common software engineering practices employed in lean software startups is transferable to the software engineering education context? (3) What is the impact of lean startup practices on software engineering students and curricula? I utilize (1) a literature review, (2) mixed-methods approaches in gathering empirical evidence, and (3) design-based research. In the first phase of the research, I pinpoint the relevance of the lean startup in software engineering education through an extensive literature review. I gather empirical evidence on lean startup practices and assess their potential transferability to software engineering education during the second research phase. I demonstrate that the lean startup is an emerging trend in software engineering education research. I demonstrate that students can acquire soft, hard, and project management skills in a more realistic context in the introduction of the growth phase of lean startup practices throughout external course activities. I expect software engineering curricula to benefit from the model and framework that I propose and validate, thus facilitating lean startup practice transfer to software engineering curricula.

DOI: 10.1145/3510454.3517065

More effective test case generation with multiple tribes of AI

作者: Olsthoorn, Mitchell
关键词: test case generation, software testing, search-based software testing, machine learning, fuzzing

Abstract

Software testing is a critical activity in the software development life cycle for quality assurance. Automated Test Case Generation (TCG) can assist developers by speeding up this process. It accomplishes this by evolving an initial set of randomly generated test cases over time to optimize for predefined coverage criteria. One of the key challenges for automated TCG approaches is navigating the large input space. Existing state-of-the-art TCG algorithms struggle with generating highly-structured input data and preserving patterns in test structures, among others. I hypothesize that combining multiple tribes of AI can improve the effectiveness and efficiency of automated TCG. To test this hypothesis, I propose using grammar-based fuzzing and machine learning to augment evolutionary algorithms for generating more structured input data and preserving promising patterns within test cases. Additionally, I propose to use behavioral modeling and interprocedural control dependency analysis to improve test effectiveness. Finally, I propose integrating these novel approaches into a testing framework to promote the adoption of automated TCG in industry.

DOI: 10.1145/3510454.3517066

Quality-driven machine learning-based data science pipeline realization： a software engineering approach

作者: d’Aloisio, Giordano
关键词: software quality, product-line architecture, pipelines, model-driven, machine learning

Abstract

The recently wide adoption of data science approaches to decision making in several application domains (such as health, business and even education) open new challenges in engineering and implementation of this systems. Considering the big picture of data science, Machine learning is the wider used technique and due to its characteristics, we believe that a better engineering methodology and tools are needed to realize innovative data-driven systems able to satisfy the emerging quality attributes (such as, debias and fariness, explainability, privacy and ethics, sustainability). This research project will explore the following three pillars: i) identify key quality attributes, formalize them in the context of data science pipelines and study their relationships; ii) define a new software engineering approach for data-science systems development that assures compliance with quality requirements; iii) implement tools that guide IT professionals and researchers in the realization of ML-based data science pipelines since the requirement engineering. Moreover, in this paper we also presents some details of the project showing how the feature models and model-driven engineering can be leveraged to realize our project.

DOI: 10.1145/3510454.3517067

Students vs. professionals： improving the learning of software testing

作者: Chen, Zhongyan
关键词: test smell, software unit test, computer science education

Abstract

Software testing is a crucial phase of software development. Educators now are assessing tests written by students. Methods have been proposed to assess the completeness of student-written tests. However, there is more to good test quality than completeness, and these additional quality are not assessed by the previous work. Test smells are patterns of poorly designed tests that may negatively affect the quality of test and production code. We propose to assess test smells in students’ code to improve the quality of software tests they write. In the early stages of this research, we will examine whether practitioners actually spend time and energy to remove those test smells summarized by researchers backed by real-world software projects. Then, we will develop and evaluate our new assessment method.

DOI: 10.1145/3510454.3517058

Topology of the documentation landscape

作者: Raglianti, Marco
关键词: visualization, software documentation, communication platforms

Abstract

Every software system (ideally) comes with one or more forms of documentation. Beside source code comments, other structured and unstructured sources (e.g., design documents, API references, wikis, usage examples, tutorials) constitute critical assets. Cloud-based repositories for collaborative development (e.g., GitHub, Bitbucket, GitLab) provide many functionalities to create, persist, and version documentation artifacts. On the other hand, the last decade has seen the rise of rich instant messaging clients used as global software community platforms (e.g., Slack, Discord). Although completely detached from a specific versioning system or development workflow, they allow developers to discuss implementation issues, report bugs, and, in general, interact with one another.We refer to this evolving heterogeneous collection of information sources and documentation artifacts as the documentation landscape. It is important to have tools to extract information from these sources and integrate them in a topological visualization, to ease comprehension of a software system. How can we automatically generate this topology? How can we link elements in the topology back to the source code they refer to?The goal of this PhD research is to automatically mine the documentation landscape of a system by disclosing pieces of information to aid, for example, in program maintenance tasks. We present our classification of possible documentation sources. The long term vision is to provide a domain model of the documentation landscape to build, visualize, and explore its instances for real software systems and evaluate the usefulness of the metaphor we propose.

DOI: 10.1145/3510454.3517068

Towards a theory of shared understanding of non-functional requirements in continuous software engineering

作者: Werner, Colin
关键词: shared understanding, non-functional requirements, continuous software engineering

Abstract

Building shared understanding of requirements is key to ensuring downstream software activities are efficient and effective. Nonfunctional requirements (NFR), which include performance, availability, and maintainability, are vitally important to overall software quality. Research has shown NFRs are, in practice, poorly defined and difficult to verify, especially in agile environments. Continuous software engineering (CSE) practices, which extend agile practices, emphasize fast paced, automated, and rapid release of software that poses additional challenges to NFRs. However, the level of shared understanding achieved across an organization is not well-understood. This dissertation builds the foundations towards a theory of the complex and intricate relationship between shared understanding of NFRs and CSE.

DOI: 10.1145/3510454.3517069

Towards facilitating software engineering for production systems in industry 4.0 with behavior models

作者: Wiesmayr, Bianca
关键词: IEC 61499, control software, model-driven software engineering

Abstract

With the growing adoption of Industry 4.0 concepts in production systems, new challenges arise in engineering control software. Highly distributed control with tight real-time constraints and safety regulations results in increasingly complex software. Current research focuses on increasing the abstraction with new architectures and modularization of software. The presented PhD research addresses modeling of the interactions between control software components, and of the emergent behavior of these compositions. Such behavior models can support the initial implementation, and facilitate (semi-)automated testing and monitoring of control software. Finally, visualizing behavior in a model can enhance under-standability of existing control software, when software developers need not access abstracted hierarchy levels to deduct their functionality. This work aims at optimizing the benefit of behavior models in developing control software: Modeling the expected behavior directly for new software will allow using them throughout the software life-cycle. For legacy software, the initial development effort of behavior models will be minimized by automatically capturing behavior models from the implementation. The approach is evaluated in case studies and user studies to integrate experiences from the industrial domain into this software engineering research.

DOI: 10.1145/3510454.3517070

An empirical study on the current adoption of quantum programming

作者: De Stefano, Manuel
关键词: No keywords

Abstract

Quantum computing is no longer just a scientific curiosity; it is rapidly evolving into a commercially viable technology that has the potential to surpass the limitations of classical computation. As a result of this transition, a new discipline known as quantum software engineering has emerged, which is needed to describe unique methodologies for developing large-scale quantum applications. In the pursue of building this new body of knowledge, we undertake a mining study to elicit the purposes quantum programming is being used for, and steer further research.

DOI: 10.1145/3510454.3522679

Efficiently and precisely searching for code changes with diffsearch

作者: Di Grazia, Luca
关键词: No keywords

Abstract

Version histories of code contain useful information and these data are public, thanks to open source software. However, searching through large repository histories can be complex, because there is no specific tool to search for code changes. This paper presents DiffSearch, the first efficient and scalable search engine for code changes. Given a list of repositories and a query, DiffSearch can retrieve specific code changes in a few seconds. We design a language-agnostic approach that we test on three popular programming languages: Java, JavaScript, and Python, and we design a query language that is an extension of the supported languages. We evaluate DiffSearch in three steps. First, we measure a recall of 81.8%, 89.6%, and 90,4% for Java, Python, and JavaScript, respectively, and an average response time lower than five seconds. Second, we demonstrate its scalability with a large dataset of one million code changes. Last, we perform a case study to show one of its possible applications, where DiffSearch gathers a dataset of 74,903 Java bug fixes.

DOI: 10.1145/3510454.3522678

Finding appropriate user feedback analysis techniques for multiple data domains

作者: Devine, Peter
关键词: No keywords

Abstract

Software products now have more users than ever. This means more people to please, more use-cases to consider, and more requirements to fulfill. These users can then write feedback on software in any number of public or private online repositories. Many tools have been proposed for classifying, embedding, clustering, and characterizing this feedback in aid of generating requirements from it. I am investigating which techniques and machine learning models are most appropriate for enabling these analyses across multiple feedback platforms and data domains.

DOI: 10.1145/3510454.3522677

Is GitHub copilot a substitute for human pair-programming? an empirical study

作者: Imai, Saki
关键词: AI, GitHub, copilot, software development

Abstract

This empirical study investigates the effectiveness of pair programming with GitHub Copilot in comparison to human pair-programming. Through an experiment with 21 participants we focus on code productivity and code quality. For experimental design, a participant was given a project to code, under three conditions presented in a randomized order. The conditions are pair-programming with Copilot, human pair-programming as a driver, and as a navigator. The codes generated from the three trials were analyzed to determine how many lines of code on average were added in each condition and how many lines of code on average were removed in the subsequent stage. The former measures the productivity of each condition while the latter measures the quality of the produced code. The results suggest that although Copilot increases productivity as measured by lines of code added, the quality of code produced is inferior by having more lines of code deleted in the subsequent trial.

DOI: 10.1145/3510454.3522684

Let’s talk open-source： an analysis of conference talks and community dynamics

作者: Truong, Kimberly
关键词: No keywords

Abstract

Open-source software has integrated itself into our daily lives, impacting 78% of US companies in 2015 [11]. Past studies of open-source community dynamics have found motivations behind contributions [3, 14, 18, 19] and the significance of community engagement [8, 17], but there are still many aspects not well understood. There’s a direct correlation between the success of an open-source project and the social interactions within its community [7, 9, 17]. Most projects depend on a small group. A study by Avelino et al. [4] on the 133 most popular GitHub projects found that 86% will fail if one or two of its core contributors leave. To sustain open-source, we need to better understand how contributors interact, what information is shared, and what concerns practitioners have. We study common topics, how these have changed over time (2011 - 2021), and what social issues have appeared within open-source communities. Our research is guided by the following questions: (1) How is open-source changing/evolving? (2) What changes do practitioners believe are necessary for open-source to be sustainable?

DOI: 10.1145/3510454.3522683

Static test flakiness prediction

作者: Pontillo, Valeria
关键词: No keywords

Abstract

The problem of flakiness occurs when a test case is non-deterministic and exhibits both a passing and failing behavior when run against the same code. Over the last years, the software engineering research community has been working toward defining approaches for detecting and addressing test flakiness, but most of these approaches suffer from scalability issues. Recently, this limitation has been targeted through machine learning solutions that could predict flaky tests using various features, both static and dynamic. Unfortunately, the proposed solutions involve features that could be costly to compute. In this paper, I perform a step forward and predict test flakiness only using statically computable metrics. I conducted an experiment on 18 Java projects coming from the FlakeFlagger dataset. First, I statistically assess the differences between flaky and non-flaky tests in terms of 25 static metrics in an individual and combined way. Then, I experimented with a machine learning approach that predicts flakiness based on the previously evaluated factors. The results show that static features can be used to characterize flaky tests: this is especially true for metrics and smells connected to source code complexity. In addition, this new static approach has performance comparable to the machine learning models already in the literature in terms of F-Measure.

DOI: 10.1145/3510454.3522680

To disengage or not to disengage： a look at contributor disengagement in open source software

作者: Gray, Philip
关键词: open source, grey literature, disengagement

Abstract

Contributors are vital to the sustainability of open source ecosystems, and disengagement threatens that sustainability. We seek to both strengthen and protect open source communities by creating a more robust way of defining and identifying contributor disengagement in these communities. To do this, we collected a large amount of grey literature relating to contributor disengagement and performed a qualitative analysis in order to better our understanding of why contributors disengage.

DOI: 10.1145/3510454.3522685

μ2： using mutation analysis to guide mutation-based fuzzing

作者: Laybourn, Isabella
关键词: differential testing, mutation testing, mutation-based fuzzing

Abstract

Coverage-guided fuzzing is a popular tool for finding bugs. This paper introduces μ2, a strategy for extending coverage-guided fuzzing with mutation analysis, which previous work has found to be better correlated with test effectiveness than code coverage. μ2 was implemented in Java using the JQF framework and the default mutations used by PIT. Initial evaluation shows increased performance when using μ2 as compared to coverage-guided fuzzing.

DOI: 10.1145/3510454.3522682

Woodpecker： identifying and fixing Android UI display issues

作者: Liu, Zhe
关键词: No keywords

Abstract

The complexity of GUI and some combinations of personalized settings make the UI display issues occur frequently. Unfortunately, little is known about the causes of UI display issues. The Android fragmentation and variety of UI components post a great challenge to repair the issue. Based on the our empirical study, this paper proposes Woodpecker to automatically detect, localize and repair UI display issues in Android apps. It detects screenshots with issues with computer vision technology, localizes buggy source code from screenshot, and repairs issues with pre-defined templates automatically. We evaluate Woodpecker with 30 real-world UI display issues, it can successfully detect 87% and repair 77% issues. We further apply Woodpecker to another 256 popular open-source Android apps, and successfully uncover 112 previously-undetected issues. It can automatically repair 106 (94%) issues, with 76 of them accepted by developers so far, while others pending (none of them is rejected).

DOI: 10.1145/3510454.3522681

A static analyzer for detecting tensor shape errors in deep neural network training code

作者: Jhoo, Ho Young and Kim, Sehoon and Song, Woosung and Park, Kyuyeon and Lee, DongKwon and Yi, Kwangkeun
关键词: PyTorch, Python, SMT solver, error detection, static analysis, tensor shape mismatch

Abstract

We present an automatic static analyzer PyTea that detects tensor-shape errors in PyTorch code. The tensor-shape error is critical in the deep neural net code; much of the training cost and intermediate results are to be lost once a tensor shape mismatch occurs in the midst of the training phase. Given the input PyTorch source, PyTea statically traces every possible execution path, collects tensor shape constraints required by the tensor operation sequence of the path, and decides if the constraints are unsatisfiable (hence a shape error can occur). PyTea’s scalability and precision hinges on the characteristics of real-world PyTorch applications: the number of execution paths after PyTea’s conservative pruning rarely explodes and loops are simple enough to be circumscribed by our symbolic abstraction. We tested PyTea against the projects in the official PyTorch repository and some tensor-error code questioned in the StackOverflow. PyTea successfully detects tensor shape errors in these codes, each within a few seconds.

DOI: 10.1145/3510454.3528638

CRISCE： towards generating test cases from accident sketches

作者: Nguyen, Vuong and Gambi, Alessio and Ahmed, Jasim and Fraser, Gordon
关键词: No keywords

Abstract

Cyber-Physical Systems are increasingly deployed to perform safety-critical tasks, such as autonomously driving a vehicle. Therefore, thoroughly testing them is paramount to avoid accidents and fatalities. Driving simulators allow developers to address this challenge by testing autonomous vehicles in many driving scenarios; nevertheless, systematically generating scenarios that effectively stress the software controlling the vehicles remains an open challenge. Recent work has shown that effective test cases can be derived from simulations of critical driving scenarios such as car crashes. Hence, generating those simulations is a stepping stone for thoroughly testing autonomous vehicles. Towards this end, we propose CRISCE (CRItical SketChEs), an approach that leverages image processing (e.g., contour analysis) to automatically generate simulations of critical driving scenarios from accident sketches. Preliminary results show that CRISCE is efficient and can generate accurate simulations; hence, it has the potential to support developers in effectively achieving high-quality autonomous vehicles.

DOI: 10.1145/3510454.3528642

CrystalBLEU： precisely and efficiently measuring the similarity of code

作者: Eghbali, Aryaz and Pradel, Michael
关键词: No keywords

Abstract

Recent work has focused on using machine learning to automate software engineering processes, such as code completion, code migration, and generating code from natural language description. One of the challenges faced in these tasks is evaluating the quality of the predictions, which is usually done by comparing the prediction to a reference solution. BLEU score has been adopted for programming languages as it can be easily computed for any programming language and even incomplete source code, while enabling fast automated evaluation. However, programming languages are more verbose and have strict syntax when compared to natural languages. This feature causes BLEU to find common n-grams in unrelated programs, which makes distinguishing similar pairs of programs from dissimilar pairs hard. This work presents CrystalBLEU, an evaluation metric based on BLEU, that mitigates the distinguishability problem. Our metric maintains the desirable properties of BLEU, such as handling partial code, applicability to all programming languages, high correlation with human judgement, and efficiency, in addition to reducing the effects of the trivially matched n-grams. We evaluate CrystalBLEU on two datasets from previous work and a new dataset of human-written code. Our results show that CrystalBLEU differentiates similar and unrelated programs better than the original BLEU score and also a variant designed specifically for source code, CodeBLEU.

DOI: 10.1145/3510454.3528648

Deep learning-based production and test bug report classification using source files

作者: Kim, Misoo and Kim, Youngkyoung and Lee, Eunseok
关键词: test bug, production bug, information retrieval-based bug localization, deep learning, bug report classification

Abstract

Classifying production and test bug reports can significantly improve not only the accuracy of performance evaluation but also the performance of information retrieval-based bug localization (IRBL). However, it is time-consuming for developers to classify these bug reports manually. This study proposes a production and test bug report classification method based on deep learning. Our method uses a set of source files and model tuning to solve the problem of insufficient and sparse bug reports when applying deep learning. Our experimental results reveal that the macro f1-score of our method is 0.84 and can improve the IRBL performance by 20%.

DOI: 10.1145/3510454.3528646

Deriving semantics-aware fuzzers from web API schemas

作者: Hatfield-Dodds, Zac and Dygalo, Dmitry
关键词: No keywords

Abstract

We present Schemathesis, a tool for finding semantic errors and crashes in OpenAPI or GraphQL web APIs through property-based testing. Our evaluation, thirty independent runs of eight tools against sixteen containerized open-source web services, shows that Schemathesis wildly outperforms all previous tools.It is the only tool to find defects in four targets, finds 1.4\texttimes{

DOI: 10.1145/3510454.3528637

Enabling end-users to implement larger block-based programs

作者: Ritschel, Nico and Fronchetti, Felipe and Holmes, Reid and Garcia, Ronald and Shepherd, David C.
关键词: No keywords

Abstract

Block-based programming, already popular in computer science education, has been successfully used to make programming accessible to end-users in applied domains such as the field of robotics. Most prior work in these domains has examined smaller programs that are usually simple and fit a single screen. However, real block-based programs often grow larger and, because end-users are unlikely to break them down into separate functions, they often become unwieldy. In our study, we introduce a function-centric block-based environment to help end-users write programs that require a large number of blocks. Through a user study with 92 users, we evaluated our approach and found that while users could successfully complete smaller tasks with and without our approach, they were both quicker and more successful with our function-centric method when tackling larger tasks. This work demonstrates that adding scaffolding can encourage the systematic use of functions, enabling end-users to write larger programs with block-based programming environments, which can contribute to the solution of more complex tasks in applied domains.

DOI: 10.1145/3510454.3528644

Flexible model-driven runtime monitoring support for cyber-physical systems

作者: Stadler, Marco and Vierhauser, Michael and Garmendia, Antonio and Wimmer, Manuel and Cleland-Huang, Jane
关键词: runtime monitoring, cyber-physical systems, MDE

Abstract

Providing adequate runtime monitoring is critical for ensuring safe operation and for enabling self-adaptive behavior of Cyber-Physical Systems. This requires identifying runtime properties of interest, creating Probes to instrument the system, and defining constraints to be checked at runtime. Implementing and setting up a monitoring framework for a system is typically a challenging task, and most existing approaches lack support for the automated generation and setup of monitors. GRuM significantly eases the task of creating monitors and maintaining them throughout the lifetime of the system by automatically generating runtime models and providing support for updating and adapting them when needed.

DOI: 10.1145/3510454.3528647

GARUDA： heap aware symbolic execution

作者: Rajput, Ajinkya and Gopinath, Kanchi
关键词: vulnerability, symbolic execution, software testing

Abstract

Symbolic execution is a widely employed technique in vulnerability detection. However, it faces an acute problem of state space explosion when analyzing programs that dynamically allocate memory. In this work we present GARUDA that makes the symbolic execution heap-aware to mitigate the state space explosion problem. We show that GARUDA can detect vulnerabilities in real-world software and can generate inputs to trigger two more safety violations than the winner of the TestComp2021 testing competition in the heap safety category of TestComp2021 benchmarks.

DOI: 10.1145/3510454.3528650

In rust we trust： a transpiler from unsafe C to safer rust

作者: Ling, Michael and Yu, Yijun and Wu, Haitao and Wang, Yuan and Cordy, James R. and Hassan, Ahmed E.
关键词: transpiler, safety, refactoring, measurement, code transformation

Abstract

Rust is a type-safe system programming language with a compiler checking memory and concurrency safety. For a smooth transition from existing C projects, a source-to-source transpiler can autotransform C programs into Rust using program transformation. However, existing C-to-Rust transformation tools (e.g. the open-source C2Rust transpiler1 project) have the drawback of preserving the unsafe semantics of C, while rewriting them in Rust’s syntax. The work by Emre et el. [2] acknowledged these drawbacks, and used rustc compiler feedback to refactor one certain type of raw pointers to Rust references to improve overall safety and idiomaticness of C2Rust output. Focusing on improving API-safeness (i.e. lowering unsafe keyword usage in function signatures), we apply source-to-source transformation technique to auto-refactor C2Rust output using code structure pattern matching and transformation, which does not rely on rustc compiler feedback. And by relaxing the semantics-preserving constraints of transformations, we present CRustS2 a fully-automated source-to-source transformation approach that increases the ratio of the transformed code passing the safety checks of the rustc compiler. Our method uses 220 new TXL [1] source-to-source transformation rules, of which 198 are strictly semantics-preserving and 22 are semantics-approximating, thus reducing the scope of unsafe expressions and exposing more opportunities for safe Rust refactoring. Our method has been evaluated on both open-source and commercial C projects, and demonstrates significantly higher safe code ratios after the transformations, with function-level safe code ratios comparable to the average level of idiomatic Rust projects.

DOI: 10.1145/3510454.3528640

作者: Callan, James and Petke, Justyna
关键词: responsiveness, mobile, genetic improvement, SBSE, Android

Abstract

Responsiveness issues are one of the key reasons why mobile phone users abandon an app or leave bad reviews. In this work, we explore the use of Genetic Improvement to automatically refactor applications to reduce the time taken to move between and within Android activities, without affecting their functionality. This particular Android responsiveness issue has not previously been tackled before. With its application directly to source code, our approach can be used to complement previous work, which modifies the operating system, or focuses on detection of specific coding patterns. We present a fully automated technique for finding improvements to this responsiveness, which does not require the use of an Android device or emulator. We apply our approach to 7 real-world open source applications and find improvements of up to 30% in navigation response time.

DOI: 10.1145/3510454.3528643

Mutation testing of quantum programs written in QISKit

作者: Fortunato, Daniel and Campos, Jos'{e
关键词: quantum computing, quantum mutation testing, quantum software engineering, quantum software testing

Abstract

There is an inherent lack of knowledge and technology to test a quantum program properly. In this paper, building on the definition of syntactically equivalent quantum operations, we investigated a novel set of mutation operators to generate mutants based on qubit measurements and quantum gates. To ease the adoption of quantum mutation testing, we further discuss QMutPy, an extension of the well-known and fully automated open-source mutation tool MutPy. To evaluate QMutPy’s performance we conducted a case study on 11 real quantum programs written in the IBM’s QISKit library. QMutPy has proven to be an effective quantum mutation tool, providing insight on the current state of quantum tests.

DOI: 10.1145/3510454.3528649

Comprehensive comparisons of embedding approaches for cryptographic API completion： poster

作者: Xiao, Ya and Ahmed, Salman and Ge, Xinyang and Viswanath, Bimal and Meng, Na and Yao, Danfeng (Daphne)
关键词: No keywords

Abstract

In this paper, we conduct a measurement study to comprehensively compare the accuracy of Cryptographic API completion tasks trained with multiple API embedding options. Embedding is the process of automatically learning to represent program elements as low-dimensional vectors. Our measurement aims to uncover the impacts of applying program analysis, token-level embedding, and sequence-level embedding on the Cryptographic API completion accuracies. Our findings show that program analysis is necessary even under advanced embedding. The results show 36.10% accuracy improvement on average when program analysis preprocessing is applied to transfer byte code sequences into API dependence paths. The best accuracy (93.52%) is achieved on API dependence paths with embedding techniques. On the contrary, the pure data-driven approach without program analysis only achieves a low accuracy (around 57.60%), even after the powerful sequence-level embedding is applied. Although sequence-level embedding shows slight accuracy advantages (0.55% on average) over token-level embedding in our basic data split setting, it is not recommended considering its expensive training cost. A more obvious accuracy improvement (5.10%) from sequence-level embedding is observed under the cross-project learning scenario when task data is insufficient. Hence, we recommend applying sequence-level embedding for cross-project learning with limited task-specific data.

DOI: 10.1145/3510454.3528645

Program translation using model-driven engineering

作者: Lano, K.
关键词: No keywords

Abstract

The porting or translation of software applications from one programming language to another is a common requirement of organisations that utilise software, and the increasing number and diversity of programming languages makes this capability as relevant today as in previous decades.Several approaches have been used to address this challenge, including machine learning and the manual definition of explicit translation rules. We define a novel approach using model-driven engineering (MDE) techniques: reverse-engineering source programs into specifications in the UML and OCL formalisms, and then forward-engineering the specifications to the required target language. This approach has the additional advantage of extracting specifications of software from code. We provide an evaluation based on a comprehensive dataset of examples, including industrial cases, and compare our results to those of other approaches and tools.

DOI: 10.1145/3510454.3528639

The symptoms, causes, and repairs of workarounds in apache issue trackers

作者: Yan, Aoyang and Zhong, Hao and Song, Daohan and Jia, Li
关键词: No keywords

Abstract

In issue tracker systems, a bug report can be resolved as workaround. Since the definition of workarounds is vague, many research questions on workarounds are still open. For example, what are the symptoms of bugs which are resolved as workarounds? Why does a bug report have to be resolved as workarounds? What are the repairs and impacts of workarounds? In this paper, we conduct an empirical study to explore the above research questions. In particular, we analyzed 221 real workarounds that were collected from 88 Apache projects. Our results lead to ten findings and our answers to all the above questions. Our findings are useful to understand workarounds and to improve software projects and issue trackers.

DOI: 10.1145/3510454.3528641

μAFL： non-intrusive feedback-driven fuzzing for microcontroller firmware

Abstract

A grounded theory of coordination in remote-first and hybrid software teams

Abstract

A scalable t-wise coverage estimator

Abstract

A universal data augmentation approach for fault localization

Abstract

Adaptive performance anomaly detection for online service systems via pattern sketching

Abstract

Adaptive test selection for deep neural networks

Abstract

An exploratory study of deep learning supply chain

Abstract

An exploratory study of productivity perceptions in software teams

Abstract

Analyzing user perspectives on mobile app privacy at scale

Abstract

Aper： evolution-aware runtime permission misuse detection for Android apps

Abstract

ARCLIN： automated API mention resolution for unformatted texts

Abstract

AST-trans： code summarization with efficient tree-structured attention

Abstract

Automated assertion generation via information retrieval and its integration with deep learning

Abstract

Automated detection of password leakage from public GitHub repositories

Abstract

Automated handling of anaphoric ambiguity in requirements： a multi-solution study

Abstract

Automated patching for unreproducible builds

Abstract

Automated testing of software that uses machine learning APIs

Abstract

Automatic detection of performance bugs in database systems using equivalent queries

Abstract

AutoTransform： automated code transformation to support modern code review process

Abstract

BeDivFuzz： integrating behavioral diversity into generator-based fuzzing

Abstract

Big data = big insights? operationalising brooks’ law in a massive GitHub data set

Abstract

Bots for pull requests： the good, the bad, and the promising

Abstract

Bridging pre-trained models and downstream tasks for source code understanding

Abstract

BugListener： identifying and synthesizing bug reports from collaborative live chats

Abstract

BuildSheriff： change-aware test failure triage for continuous integration builds

Abstract

Causality in configurable software systems

Abstract

Causality-based neural network repair

Abstract

Change is the only constant： dynamic updates for workflows

Abstract

Characterizing and detecting bugs in WeChat mini-programs

Abstract

ACID： an API compatibility issue detector for Android apps

Abstract

A dynamic analysis tool for memory safety based on smart status and source-level instrumentation

Abstract

ARSearch： searching for API related resources from stack overflow and GitHub

Abstract

Asymob： a platform for measuring and clustering chatbots

Abstract

A tool for rejuvenating feature logging levels via git histories and degree of interest

Abstract

CIDER： concept-based interactive design recovery

Abstract

Code implementation recommendation for Android GUI components

Abstract

Common data guided crash injection for cloud systems

Abstract

COSPEX： a program comprehension tool for novice programmers

Abstract

DiffWatch： watch out for the evolving differential testing in deep learning libraries

Abstract

DistFax： a toolkit for measuring interprocess communications and quality of distributed systems

Abstract

M3triCity： visualizing evolving software & data cities

ML-quadrat & DriotData： a model-driven engineering tool and a low-code platform for smart IoT services