FSE 2023

Towards AI-Driven Software Development： Challenges and Lessons from the Field (Keynote)

作者: Yahav, Eran
关键词: No keywords

Abstract

AI is changing the way we develop software. AI is becoming powerful enough to change the nature of interaction between humans and machines and not only to raise the level of abstraction. AI-driven software development is poised to transform the entire software development lifecycle (SDLC). As we move towards AI-driven software development, we must revisit some fundamental assumptions and address the following challenges:

• How does the SDLC change when autonomous agents can handle some tasks? What is the role of code and version control?

• Interaction model: What is the right human-machine interaction? How do we best communicate intent to the AI? How to best consume results?

• Contextual awareness: How do we make the AI contextually aware of our development environment? Can we make the AI hyper-local and tailored to our problem and solution domains?

• Trust: How can we trust the suggested results? How can we trust results that are not provided as code?

In this talk, we will start with practical AI-assisted software development, including lessons from the field, based on our experience serving millions of users with Tabnine. We will cover different tasks in the SDLC and various techniques for addressing them in the face of the challenges above.

DOI: 10.1145/3611643.3633451

Getting Outside the Bug Boxes (Keynote)

作者: Burnett, Margaret
关键词: No keywords

Abstract

Sometimes, we humans find ourselves a bit slow to abandon the comfort of sitting “inside the box”, and this can detract from our ability to innovate. In this talk, I’ll share some outside-the-box perspectives, gleaned from decades of software engineering work, on boxes I’ve seen when thinking about bugs — from failures to faults, from finding to fixing, and from traditional to very non-traditional notions of “what counts” as a bug. I’ll consider the intellectually freeing perspectives that can come from moving outside the “mechanisms” box to policies; the enhancement to applicability from moving outside sub-sub-area boxes to the whole software lifecycle; the differences revealed when moving outside the “typical developer” box to diverse humans; and the plethora of possibilities arising from moving outside the “buggy code” box to a wide range of bug types.

DOI: 10.1145/3611643.3633452

A Four-Year Study of Student Contributions to OSS vs. OSS4SG with a Lightweight Intervention

作者: Fang, Zihan and Endres, Madeline and Zimmermann, Thomas and Ford, Denae and Weimer, Westley and Leach, Kevin and Huang, Yu
关键词: CS Education, Open Source Software, Social Good

Abstract

Modern software engineering practice and training increasingly rely on Open Source Software (OSS). The recent growth in demand for professional software engineers has led to increased contributions to, and usage of, OSS. However, there is limited understanding of the factors affecting how developers, and how new or student developers in particular, decide which OSS projects to contribute to, a process critical to OSS sustainability, access, adoption, and growth. To better understand OSS contributions from the developers of tomorrow, we conducted a four-year study with 1,361 students investigating the life cycle of their contributions (from project selection to pull request acceptance). During the study, we also delivered a lightweight intervention to promote the awareness of open source projects for social good (OSS4SG), OSS projects that have positive impacts in other domains. Using both quantitative and qualitative methods, we analyze student experience reports and the pull requests they submit. Compared to general OSS projects, we find significant differences in project selection (𝑝 < 0.0001, effect size = 0.84), student motivation (𝑝 < 0.01, effect size = 0.13), and increased pull-request acceptance rates for OSS4SG contributions. We also find that our intervention correlates with increased student contributions to OSS4SG (𝑝 < 0.0001, effect size = 0.38). Finally, we analyze correlations of factors such as gender or working with a partner. Our findings may help improve the experience for new developers participating in OSS4SG and the quality of their contributions. We also hope our work helps educators, project leaders, and contributors to build a mutually-beneficial framework for the future growth of OSS4SG.

DOI: 10.1145/3611643.3616250

Do CONTRIBUTING Files Provide Information about OSS Newcomers’ Onboarding Barriers?

作者: Fronchetti, Felipe and Shepherd, David C. and Wiese, Igor and Treude, Christoph and Gerosa, Marco Aur'{e
关键词: FLOSS, novices, onboarding, open source, software engineering

Abstract

Effectively onboarding newcomers is essential for the success of open source projects. These projects often provide onboarding guidelines in their ’CONTRIBUTING’ files (e.g., CONTRIBUTING.md on GitHub). These files explain, for example, how to find open tasks, implement solutions, and submit code for review. However, these files often do not follow a standard structure, can be too large, and miss barriers commonly found by newcomers. In this paper, we propose an automated approach to parse these CONTRIBUTING files and assess how they address onboarding barriers. We manually classified a sample of files according to a model of onboarding barriers from the literature, trained a machine learning classifier that automatically predicts the categories of each paragraph (precision: 0.655, recall: 0.662), and surveyed developers to investigate their perspective of the predictions’ adequacy (75% of the predictions were considered adequate). We found that CONTRIBUTING files typically do not cover the barriers newcomers face (52% of the analyzed projects missed at least 3 out of the 6 barriers faced by newcomers; 84% missed at least 2). Our analysis also revealed that information about choosing a task and talking with the community, two of the most recurrent barriers newcomers face, are neglected in more than 75% of the projects. We made available our classifier as an online service that analyzes the content of a given CONTRIBUTING file. Our approach may help community builders identify missing information in the project ecosystem they maintain and newcomers can understand what to expect in CONTRIBUTING files.

DOI: 10.1145/3611643.3616288

How Early Participation Determines Long-Term Sustained Activity in GitHub Projects?

作者: Xiao, Wenxin and He, Hao and Xu, Weiwei and Zhang, Yuxia and Zhou, Minghui
关键词: early participation, open-source software, sustained activity

Abstract

Although the open source model bears many advantages in software development, open source projects are always hard to sustain. Previous research on open source sustainability mainly focuses on projects that have already reached a certain level of maturity (e.g., with communities, releases, and downstream projects). However, limited attention is paid to the development of (sustainable) open source projects in their infancy, and we believe an understanding of early sustainability determinants is crucial for project initiators, incubators, newcomers, and users.

In this paper, we aim to explore the relationship between early participation factors and long-term project sustainability. We leverage a novel methodology combining the Blumberg model of performance and machine learning to predict the sustainability of 290,255 GitHub projects. Specificially, we train an XGBoost model based on early participation (first three months of activity) in 290,255 GitHub projects and we interpret the model using LIME. We quantitatively show that early participants have a positive effect on project’s future sustained activity if they have prior experience in OSS project incubation and demonstrate concentrated focus and steady commitment. Participation from non-code contributors and detailed contribution documentation also promote project’s sustained activity. Compared with individual projects, building a community that consists of more experienced core developers and more active peripheral developers is important for organizational projects. This study provides unique insights into the incubation and recognition of sustainable open source projects, and our interpretable prediction approach can also offer guidance to open source project initiators and newcomers.

DOI: 10.1145/3611643.3616349

Matching Skills, Past Collaboration, and Limited Competition： Modeling When Open-Source Projects Attract Contributors

作者: Fang, Hongbo and Herbsleb, James and Vasilescu, Bogdan
关键词: labor pool, open source, sustainability

Abstract

Attracting and retaining new developers is often at the heart of open-source project sustainability and success.
Previous research found many intrinsic (or endogenous) project characteristics associated with the attractiveness of projects to new developers, but the impact of factors external to the project itself have largely been overlooked.
In this work, we focus on one such external factor, a project’s labor pool, which is defined as the set of contributors active in the overall open-source ecosystem that the project could plausibly attempt to recruit from at a given time. How are the size and characteristics of the labor pool associated with a project’s attractiveness to new contributors?
Through an empirical study of over 516,893 Python projects, we found that the size of the project’s labor pool, the technical skill match, and the social connection between the project’s labor pool and members of the focal project all significantly influence the number of new developers that the focal project attracts, with the competition between projects with overlapping labor pools also playing a role.
Overall, the labor pool factors add considerable explanatory power compared to models with only project-level characteristics.

DOI: 10.1145/3611643.3616282

Accelerating Continuous Integration with Parallel Batch Testing

作者: Fallahzadeh, Emad and Bavand, Amir Hossein and Rigby, Peter C.
关键词: Batch Testing, Execution Reduction, Feedback, Large-Scale, Parallel

Abstract

Continuous integration at scale is costly but essential to software development. Various test optimization techniques including test selection and prioritization aim to reduce the cost. Test batching is an effective alternative, but overlooked technique. This study evaluates parallelization’s effect by adjusting machine count for test batching and introduces two novel approaches. We establish TestAll as a baseline to study the impact of parallelism and machine count on feedback time. We re-evaluate ConstantBatching and introduce DynamicBatching, which adapts batch size based on the remaining changes in the queue. We also propose TestCaseBatching, enabling new builds to join a batch before full test execution, thus speeding up continuous integration. Our evaluations utilize Ericsson’s results and 276 million test outcomes from open-source Chrome, assessing feedback time, execution reduction, and providing access to Chrome project scripts and data. The results reveal a non-linear impact of test parallelization on feedback time, as each test delay compounds across the entire test queue. ConstantBatching, with a batch size of 4, utilizes up to 72% fewer machines to maintain the actual average feedback time and provides a constant execution reduction of up to 75%. Similarly, DynamicBatching maintains the actual average feedback time with up to 91% fewer machines and exhibits variable execution reduction of up to 99%. TestCaseBatching holds the line of the actual average feedback time with up to 81% fewer machines and demonstrates variable execution reduction of up to 67%. We recommend practitioners use DynamicBatching and TestCaseBatching to reduce the required testing machines efficiently. Analyzing historical data to find the threshold where adding more machines has minimal impact on feedback time is also crucial for resource-effective testing.

DOI: 10.1145/3611643.3616255

DistXplore： Distribution-Guided Testing for Evaluating and Enhancing Deep Learning Systems

作者: Wang, Longtian and Xie, Xiaofei and Du, Xiaoning and Tian, Meng and Guo, Qing and Yang, Zheng and Shen, Chao
关键词: Deep learning, distribution diversity, model enhancement, software testing

Abstract

Deep learning (DL) models are trained on sampled data, where the distribution of training data differs from that of real-world data (i.e., the distribution shift), which reduces the model’s robustness. Various testing techniques have been proposed, including distribution-unaware and distribution-aware methods. However, distribution-unaware testing lacks effectiveness by not explicitly considering the distribution of test cases and may generate redundant errors (within same distribution). Distribution-aware testing techniques primarily focus on generating test cases that follow the training distribution, missing out-of-distribution data that may also be valid and should be considered in the testing process.
In this paper, we propose a novel distribution-guided approach for generating valid test cases with diverse distributions, which can better evaluate the model’s robustness (i.e., generating hard-to-detect errors) and enhance the model’s robustness (i.e., enriching training data). Unlike existing testing techniques that optimize individual test cases, DistXplore optimizes test suites that represent specific distributions. To evaluate and enhance the model’s robustness, we design two metrics: distribution difference, which maximizes the similarity in distribution between two different classes of data to generate hard-to-detect errors, and distribution diversity, which increase the distribution diversity of generated test cases for enhancing the model’s robustness. To evaluate the effectiveness of DistXplore in model evaluation and enhancement, we compare DistXplore with 14 state-of-the-art baselines on 10 models across 4 datasets. The evaluation results show that DisXplore not only detects a larger number of errors (e.g., 2\texttimes{

DOI: 10.1145/3611643.3616266

Artifact for ESEC/FSE 2023 Article `CAmpactor： A Novel and Effective Local Search Algorithm for Optimizing Pairwise Covering Arrays’

作者: Zhao, Qiyuan and Luo, Chuan and Cai, Shaowei and Wu, Wei and Lin, Jinkun and Zhang, Hongyu and Hu, Chunming
关键词: Covering array, Local search, Software testing

Abstract

Combinatorial interaction testing (CIT) stands as a widely adopted testing technique for testing interactions among options within highly configurable systems. Within the realm of CIT, covering arrays refer to the test suites that are able to cover all such interactions, usually subject to certain hard constraints. Specifically, pairwise covering arrays (PCAs) are extensively utilized, because they are capable of obtaining a good balance between testing costs and the capability to disclose faults.

CAmpactor is a novel and effective local search algorithm for compacting given PCAs into smaller-sized ones, and it significantly advances the state of the art in building PCAs. In this artifact, we provide the implementation of CAmpactor, the testing instances adopted in the experiments and the detailed evaluation results.

DOI: 10.1145/3611643.3616284

Replication Package of the ESEC/FSE 2023 Paper Entitled “Design by Contract for Deep Learning APIs”

作者: Ahmed, Shibbir and Imtiaz, Sayem Mohammad and Khairunnesa, Samantha Syeda and Cruz, Breno Dantas and Rajan, Hridesh
关键词: API contracts, Deep learning, specification language

Abstract

This repository contains the reproducibility package, source code, benchmark, and results for the paper - “Design by Contract for Deep Learning APIs”, which appeared in ESEC/FSE’2023: The 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering at San Francisco, California.

DOI: 10.1145/3611643.3616247

Reproduction package for artical “Testing Coreference Resolution Systems without Labeled Test Sets”

作者: Cao, Jialun and Lu, Yaojie and Wen, Ming and Cheung, Shing-Chi
关键词: Coreference resolution testing, Metamorphic testing, SE4AI

Abstract

This artifact is a reproduction package for artical “Testing Coreference Resolution Systems without Labeled Test Sets”. The package includes (1) source code of CREST, (2) experimental results of comparisons with baselines, and (3) labeling results of human evaluation. The purpose of this artifact is for reference, reproduce and reuse components of CREST.

DOI: 10.1145/3611643.3616258

Neural-Based Test Oracle Generation： A Large-Scale Evaluation and Lessons Learned

作者: Hossain, Soneya Binta and Filieri, Antonio and Dwyer, Matthew B. and Elbaum, Sebastian and Visser, Willem
关键词: EvoSuite, Mutation Testing, Neural Test Oracle Generation, TOGA

Abstract

The artifact contains all required data, tools, scripts, and complete documentation for replication.

README provides all necessary details about each directory under this artifact. Additionally, we have provided README for individual RQ.

The structure of the artifact is as follows:

evosuite-artifacts – contains the original test suite generated by EvoSuite for all 25 subjects
RQ1 – replication package for RQ1, includes data, scripts and documentation
RQ2 – replication package for RQ2, includes data, scripts and documentation
RQ3 – replication package for RQ3, includes data, scripts and documentation

DOI: 10.1145/3611643.3616265

Implementation of paper “Revisiting Neural Program Smoothing for Fuzzing”

作者: Nicolae, Maria-Irina and Eisele, Max and Zeller, Andreas
关键词: fuzzing, machine learning, neural networks, neural program smoothing, Python

Abstract

The package contains two Python artifacts: - Neuzz++: the implementation of neural program smoothing for fuzzing designed in the paper - MLFuzz: a benchmarking framework for fuzzing with machine learning.

DOI: 10.1145/3611643.3616308

RAP-Gen： Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair

作者: Wang, Weishi and Wang, Yue and Joty, Shafiq and Hoi, Steven C.H.
关键词: Automated program repair, Neural networks, Pretrained language models, Retrieval-augmented generation

Abstract

Automatic program repair (APR) is crucial to reduce manual debugging efforts for developers and improve software reliability. While conventional search-based techniques typically rely on heuristic rules or a redundancy assumption to mine fix patterns, recent years have witnessed the surge of deep learning (DL) based approaches to automate the program repair process in a data-driven manner. However, their performance is often limited by a fixed set of parameters to model the highly complex search space of APR. To ease such burden on the parametric models, in this work, we propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) by explicitly leveraging relevant fix patterns retrieved from a codebase of previous bug-fix pairs. Specifically, we build a hybrid patch retriever to account for both lexical and semantic matching based on the raw source code in a language-agnostic manner, which does not rely on any code-specific features. In addition, we adapt a code-aware language model CodeT5 as our foundation model to facilitate both patch retrieval and generation tasks in a unified manner. We adopt a stage-wise approach where the patch retriever first retrieves a relevant external bug-fix pair to augment the buggy input for the CodeT5 patch generator, which synthesizes a ranked list of repair patch candidates. Notably, RAP-Gen is a generic APR framework that can flexibly integrate different patch retrievers and generators to repair various types of bugs. We thoroughly evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java, where the bug localization information may or may not be provided. Experimental results show that RAP-Gen significantly outperforms previous state-of-the-art (SoTA) approaches on all benchmarks, e.g., boosting the accuracy of T5-large on TFix from 49.70% to 54.15% (repairing 478 more bugs) and repairing 15 more bugs on 818 Defects4J bugs. Further analysis reveals that our patch retriever can search for relevant fix patterns to guide the APR systems.

DOI: 10.1145/3611643.3616256

Artifact for FSE’23 paper “From Leaks to Fixes： Automated Repairs for Resource Leak Warnings”

作者: Utture, Akshay and Palsberg, Jens
关键词: Automated Repair, Resource Leaks, Static Analysis

Abstract

The artifact includes the source code, experimental results, and detailed documentation. It also includes a VM image that comes with pre-installed dependencies, and can be used to quickly reproduce the results of the paper by running a few simple scripts.

DOI: 10.1145/3611643.3616267

Reproduction Package (Docker Image) for the ESEC/FSE 2023 Paper “Copiloting the Copilots： Fusing Large Language Models with Completion Engines for Automated Program Repair”

作者: Wei, Yuxiang and Xia, Chunqiu Steven and Zhang, Lingming
关键词: Artifact, Docker Image, Repilot

Abstract

This is the artifact accompanying our ESEC/FSE’23 paper “Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair”. For user convenience, we deliver our artifact in the form of a Docker image that has resolved all the software dependencies beforehand. The Docker image comprises (1) the source code of Repilot, the patch generation tool introduced in the paper, (2) all the data needed to reproduce the experiments done for the paper, (3) a detailed documentation on how to achieve the experimental results step-by-step, and (4) the Dockerfile we use to create this image.

DOI: 10.1145/3611643.3616271

Artifact for “SmartFix： Fixing Vulnerable Smart Contracts by Accelerating Generate-and-Verify Repair using Statistical Models”

作者: So, Sunbeom and Oh, Hakjoo
关键词: generate-and-verify repair, smart contract, statistical model

Abstract

This artifact contains the package for reproducing the main experimental results in our paper accepted to ESEC/FSE 2023: “SmartFix: Fixing Vulnerable Smart Contracts by Accelerating Generate-and-Verify Repair using Statistical Models”

DOI: 10.1145/3611643.3616341

Automatically Resolving Dependency-Conflict Building Failures via Behavior-Consistent Loosening of Library Version Constraints

作者: Wang, Huiyan and Liu, Shuguan and Zhang, Lingyu and Xu, Chang
关键词: Dependency conflict, loosening resolution, version constraint

Abstract

Python projects grow quickly by code reuse and building automation based on third-party libraries. However, the version constraints associated with these libraries are prone to mal-configuration, and this forms a major obstacle to correct project building (known as dependency-conflict (DC) building failure). Our empirical findings suggest that such mal-configured version constraints were mainly prepared manually, and could essentially be refined for better quality to improve the chance of successful project building. We propose a LooCo approach to refining Python projects’ library version constraints by automatically loosening them to maximize their solutions, while keeping the libraries to observe their original behaviors. Our experimental results with real-life Python projects report that LooCo could efficiently refine library version constraints (0.4s per version loosening) by effective loosening (5.5 new versions expanded on average) automatically, and transform 54.8% originally unsolvable cases into solvable ones (i.e., successful building) and significantly increase solutions (21 more on average) for originally solvable cases.

DOI: 10.1145/3611643.3616264

Replication package for the paper： “On the Relationship Between Code Verifiability and Understandability”

作者: Feldman, Kobi and Kellogg, Martin and Chaparro, Oscar
关键词: code comprehension, meta-analysis, static analysis, Verification

Abstract

This is the FSE’23 replication package of our meta-analysis that assesses the relationship between code verifiability and understandability. The package includes code snippets, human-based comprehensibility measurements, verification tools, scripts to process tool output and produce the study results, the raw study results, and documentation for replication.

DOI: 10.1145/3611643.3616242

Towards Greener Yet Powerful Code Generation via Quantization： An Empirical Study

作者: Wei, Xiaokai and Gonugondla, Sujan Kumar and Wang, Shiqi and Ahmad, Wasi and Ray, Baishakhi and Qian, Haifeng and Li, Xiaopeng and Kumar, Varun and Wang, Zijian and Tian, Yuchen and Sun, Qing and Athiwaratkun, Ben and Shang, Mingyue and Ramanathan, Murali Krishna and Bhatia, Parminder and Xiang, Bing
关键词: Code Generation, Generative AI, Large Language Models, Model Hosting, Quantization

Abstract

ML-powered code generation aims to assist developers to write code in a more productive manner by intelligently generating code blocks based on natural language prompts. Recently, large pretrained deep learning models have pushed the boundary of code generation and achieved impressive performance. However, the huge number of model parameters poses a significant challenge to their adoption in a typical software development environment, where a developer might use a standard laptop or mid-size server to develop code. Such large models cost significant resources in terms of memory, latency, dollars, as well as carbon footprint. Model compression is a promising approach to address these challenges. We have identified quantization as one of the most promising compression techniques for code-generation as it avoids expensive retraining costs. As quantization represents model parameters with lower-bit integer (e.g., int8), the model size and runtime latency would both benefit. We empirically evaluate quantized models on code generation tasks across different dimensions: (i) resource usage and carbon footprint, (ii) accuracy, and (iii) robustness. Through systematic experiments we find a code-aware quantization recipe that could run even a 6-billion-parameter model in a regular laptop without significant accuracy or robustness degradation. We find that the recipe is readily applicable to code summarization task as well.

DOI: 10.1145/3611643.3616302

Statfier： Automated Testing of Static Analyzers via Semantic-Preserving Program Transformations

作者: Zhang, Huaien and Pei, Yu and Chen, Junjie and Tan, Shin Hwei
关键词: program transformation, rule-based static analysis, software testing

Abstract

Static analyzers reason about the behaviors of programs without executing them and report issues when they violate pre-defined desirable properties. One of the key limitations of static analyzers is their tendency to produce inaccurate and incomplete analysis results, i.e., they often generate too many spurious warnings and miss important issues. To help enhance the reliability of a static analyzer, developers usually manually write tests involving input programs and the corresponding expected analysis results for the analyzers. Meanwhile, a static analyzer often includes example programs in its documentation to demonstrate the desirable properties and/or their violations. Our key insight is that we can reuse programs extracted either from the official test suite or documentation and apply semantic-preserving transformations to them to generate variants. We studied the quality of input programs from these two sources and found that most rules in static analyzers are covered by at least one input program, implying the potential of using these programs as the basis for test generation. We present Statfier, a heuristic-based automated testing approach for static analyzers that generates program variants via semantic-preserving transformations and detects inconsistencies between the original program and variants (indicate inaccurate analysis results in the static analyzer). To select variants that are more likely to reveal new bugs, Statfier uses two key heuristics: (1) analysis report guided location selection that uses program locations in the reports produced by static analyzers to perform transformations and (2) structure diversity driven variant selection that chooses variants with different program contexts and diverse types of transformations. Our experiments with five popular static analyzers show that Statfier can find 79 bugs in these analyzers, of which 46 have been confirmed.

DOI: 10.1145/3611643.3616272

Contextual Predictive Mutation Testing

作者: Jain, Kush and Alon, Uri and Groce, Alex and Le Goues, Claire
关键词: code coverage, mutation analysis, test oracles

Abstract

Mutation testing is a powerful technique for assessing and improving test suite quality that artificially introduces bugs and checks whether the test suites catch them. However, it is also computationally expensive and thus does not scale to large systems and projects. One promising recent approach to tackling this scalability problem uses machine learning to predict whether the tests will detect the synthetic bugs, without actually running those tests. However, existing predictive mutation testing approaches still misclassify 33% of detection outcomes on a randomly sampled set of mutant-test suite pairs. We introduce MutationBERT, an approach for predictive mutation testing that simultaneously encodes the source method mutation and test method, capturing key context in the input representation. Thanks to its higher precision, MutationBERT saves 33% of the time spent by a prior approach on checking/verifying live mutants. MutationBERT, also outperforms the state-of-the-art in both same project and cross project settings, with meaningful improvements in precision, recall, and F1 score. We validate our input representation, and aggregation approaches for lifting predictions from the test matrix level to the test suite level, finding similar improvements in performance. MutationBERT not only enhances the state-of-the-art in predictive mutation testing, but also presents practical benefits for real-world applications, both in saving developer time and finding hard to detect mutants that prior approaches do not.

DOI: 10.1145/3611643.3616289

𝜇Akka： Mutation Testing for Actor Concurrency in Akka using Real-World Bugs

作者: Moradi Moghadam, Mohsen and Bagherzadeh, Mehdi and Khatchadourian, Raffi and Bagheri, Hamid
关键词: Actor concurrency, Akka, Mutant quality, Mutation opertors, Mutation testing, Test effectiveness, 𝜇Akka

Abstract

Actor concurrency is becoming increasingly important in the real world and mission-critical software. This requires these applications to be free from actor bugs, that occur in the real world, and have tests that are effective in finding these bugs. Mutation testing is a well-established technique that transforms an application to induce its likely bugs and evaluate the effectiveness of its tests in finding these bugs. Mutation testing is available for a broad spectrum of applications and their bugs, ranging from web to mobile to machine learning, and is used at scale in companies like Google and Facebook. However, there still is no mutation testing for actor concurrency that uses real-world actor bugs. In this paper, we propose 𝜇Akka, a framework for mutation testing of Akka actor concurrency using real actor bugs. Akka is a popular industrial-strength implementation of actor concurrency. To design, implement, and evaluate 𝜇Akka, we take the following major steps: (1) manually analyze a recent set of 186 real Akka bugs from Stack Overflow and GitHub to understand their causes; (2) design a set of 32 mutation operators, with 138 source code changes in Akka API, to emulate these causes and induce their bugs; (3) implement these operators in an Eclipse plugin for Java Akka; (4) use the plugin to generate 11.7k mutants of 10 real GitHub applications, with 446.4k lines of code and 7.9k tests; (5) run these tests on these mutants to measure the quality of mutants and effectiveness of tests; (6) use PIT to generate 26.2k mutants to compare 𝜇Akka and PIT mutant quality and test effectiveness. PIT is a popular mutation testing tool with traditional operators; (7) manually analyze the bug coverage and overlap of 𝜇Akka, PIT, and actor operators in a previous work; and (8) discuss a few implications of our findings. Among others, we find that 𝜇Akka mutants are higher quality, cover more bugs, and tests are less effective in detecting them.

DOI: 10.1145/3611643.3616362

EvaCRC： Evaluating Code Review Comments

作者: Yang, Lanxin and Xu, Jinwei and Zhang, Yifan and Zhang, He and Bacchelli, Alberto
关键词: Code review, quality evaluation, review comments

Abstract

In code reviews, developers examine code changes authored by peers and provide feedback through comments. Despite the importance of these comments, no accepted approach currently exists for assessing their quality. Therefore, this study has two main objectives: (1) to devise a conceptual model for an explainable evaluation of review comment quality, and (2) to develop models for the automated evaluation of comments according to the conceptual model. To do so, we conduct mixed-method studies and propose a new approach: EvaCRC (Evaluating Code Review Comments). To achieve the first goal, we collect and synthesize quality attributes of review comments, by triangulating data from both authoritative documentation on code review standards and academic literature. We then validate these attributes using real-world instances. Finally, we establish mappings between quality attributes and grades by inquiring domain experts, thus defining our final explainable conceptual model. To achieve the second goal, EvaCRC leverages multi-label learning. To evaluate and refine EvaCRC, we conduct an industrial case study with a global ICT enterprise. The results indicate that EvaCRC can effectively evaluate review comments while offering reasons for the grades. Data and materials: https://doi.org/10.5281/zenodo.8297481

DOI: 10.1145/3611643.3616245

Reproduction packge for HyperDiff

作者: Le Dilavrec, Quentin and Khelladi, Djamel Eddine and Blouin, Arnaud and J'{e
关键词: Code history mining, Diff, Edit script, Temporal code analysis

Abstract

The artifact allow to reproduce results from associated article. It contains : - the implementation of our approach, - the baseline tool, - the scripts to run the experiments, and - the notebooks to compute plot the figures.

DOI: 10.1145/3611643.3616312

Understanding Solidity Event Logging Practices in the Wild

作者: Li, Lantian and Liang, Yejian and Liu, Zhihao and Yu, Zhongxing
关键词: Ethereum, Solidity, empirical study, event, logging

Abstract

Writing logging messages is a well-established conventional programming practice, and it is of vital importance for a wide variety of software development activities. The logging mechanism in Solidity programming is enabled by the high-level event feature, but up to now there lacks study for understanding Solidity event logging practices in the wild. To fill this gap, we in this paper provide the first quantitative characteristic study of the current Solidity event logging practices using 2,915 popular Solidity projects hosted on GitHub. The study methodically explores the pervasiveness of event logging, the goodness of current event logging practices, and in particular the reasons for event logging code evolution, and delivers 8 original and important findings. The findings notably include the existence of a large percentage of independent event logging code modifications, and the underlying reasons for different categories of independent event logging code modifications are diverse (for instance, bug fixing and gas saving). We additionally give the implications of our findings, and these implications can enlighten developers, researchers, tool builders, and language designers to improve the event logging practices. To illustrate the potential benefits of our study, we develop a proof-of-concept checker on top of one of our findings and the checker effectively detects problematic event logging code that consumes extra gas in 35 popular GitHub projects and 9 project owners have already confirmed the detected issues.

DOI: 10.1145/3611643.3616342

Replication package for paper “An Automated Approach to Extracting Local Variables”

作者: Chi, Xiaye and Liu, Hui and Li, Guangjie and Wang, Weixiao and Xia, Yunni and Jiang, Yanjie and Zhang, Yuxia and Ji, Weixing
关键词: Bugs, Extract Local Variable, Reliable, Software Refactoring

Abstract

This is the replication package for FSE submission, containing both tools and data that are requested by the replication. It also provides detailed instructions to replicate the evaluation.

DOI: 10.1145/3611643.3616261

Reproduction Package for Article `Statistical Reachability Analysis’

作者: Lee, Seongmin and B"{o
关键词: Markov chain, Quantitative reachability analysis, Reaching probability, Statistical reachability analysis

Abstract

Artifact for the project “Statistical Reachability Analysis”

This repository contains the artifact of the paper “Statistical Reachability Analysis” submitted to the FSE/ESEC 2023 conference.q

Artifact structure

The artifact is structured as follows:

├── README.md (this file)
├── rq1 (folder containing the data for the results of RQ1)
│   ├── laplace (folder containing the data for the Laplace estimator)
|   |   └── RQ1-Laplace.ipynb (Jupyter notebook to generate the RQ1 results for the Laplace estimator)
│   ├── preach (folder containing the data for the PReach)
│   └── pse (folder containing the data for the PSE)
├── rq2 (folder containing the data for the results of RQ2)
│   ├── fuzz-data (folder containing the fuzzing data)
│   ├── figures (folder containing the figures)
|   ├── esti-result (folder containing the estimation results of statistical reachability estimators)
|   ├── scripts (folder containing the scripts to generate the estimation results)
|   ├── sra (folder containing the source code of the SRA tool)
|   RQ2-estimate.ipynb (Jupyter notebook to generate the RQ2 estimation results)
└── RQ2-timespent.ipynb (Jupyter notebook to generate the RQ2 time spent results)

DOI: 10.1145/3611643.3616268

Artifact for “PPR： Pairwise Program Reduction”

作者: Zhang, Mengxiao and Xu, Zhenyang and Tian, Yongqiang and Jiang, Yu and Sun, Chengnian
关键词: Bug Isolation, Delta Debugging, Program Reduction

Abstract

This artifact contains the source code, benchmarks, scripts, and documentation for reproduce the evaluation results described in the paper “PPR: Pairwise Program Reduction” accepted at ESEC/FSE 2023.

DOI: 10.1145/3611643.3616275

Dataset and Experiment Scripts for Article “When Function Inlining Meets WebAssembly： Counterintuitive Impacts on Runtime Performance”

作者: Romano, Alan and Wang, Weihang
关键词: Binaryen, Emscripten, Function Inlining, LLVM, WebAssembly

Abstract

In this artifact, we provide the experiment results and scripts used to run the experiments described in our accompanying paper. We present the raw runtime results from our Baseline experiment, Experiments 1-5 and the Libsodium.js case study as CSV files. These results include runtime measurements from four optimization levels, O0-O3 and the two browsers analyzed, Chromium and Firefox. This artifact also contains the scripts that we used to run our compile the samples and run our experiments. We also include the Emscripten-generated WebAssembly, HTML, and JS files used to run the samples for each experiment.

DOI: 10.1145/3611643.3616311

Self-Supervised Query Reformulation for Code Search

作者: Mao, Yuetian and Wan, Chengcheng and Jiang, Yuze and Gu, Xiaodong
关键词: Code Search, Query Reformulation, Self-supervised Learning

Abstract

Automatic query reformulation is a widely utilized technology for enriching user requirements and enhancing the outcomes of code search. It can be conceptualized as a machine translation task, wherein the objective is to rephrase a given query into a more comprehensive alternative. While showing promising results, training such a model typically requires a large parallel corpus of query pairs (i.e., the original query and a reformulated query) that are confidential and unpublished by online code search engines. This restricts its practicality in software development processes. In this paper, we propose SSQR, a self-supervised query reformulation method that does not rely on any parallel query corpus. Inspired by pre-trained models, SSQR treats query reformulation as a masked language modeling task conducted on an extensive unannotated corpus of queries. SSQR extends T5 (a sequence-to-sequence model based on Transformer) with a new pre-training objective named corrupted query completion (CQC), which randomly masks words within a complete query and trains T5 to predict the masked content. Subsequently, for a given query to be reformulated, SSQR identifies potential locations for expansion and leverages the pre-trained T5 model to generate appropriate content to fill these gaps. The selection of expansions is then based on the information gain associated with each candidate. Evaluation results demonstrate that SSQR outperforms unsupervised baselines significantly and achieves competitive performance compared to supervised methods.

DOI: 10.1145/3611643.3616306

The Artifact of the ESEC/FSE 2023 Paper Titled “Natural Language to Code： How Far are We?”

作者: Wang, Shangwen and Geng, Mingyang and Lin, Bo and Sun, Zhensu and Wen, Ming and Liu, Yepang and Li, Li and Bissyand'{e
关键词: code generation, code search

Abstract

In this online repository, we release the source code of each of the selected techniques as well as the experiment results from each technique (which are stored in the Results.zip file).

DOI: 10.1145/3611643.3616323

Efficient Text-to-Code Retrieval with Cascaded Fast and Slow Transformer Models

作者: Gotmare, Akhilesh Deepak and Li, Junnan and Joty, Shafiq and Hoi, Steven C.H.
关键词: Cascaded retrieval schemes, Code retrieval, Developer Productivity, Text to Code Search, Transformer models, top K retrieval

Abstract

The goal of semantic code search or text-to-code search is to retrieve a semantically relevant code snippet from an existing code database using a natural language query. When constructing a practical semantic code search system, existing approaches fail to provide an optimal balance between retrieval speed and the relevance of the retrieved results. We propose an efficient and effective text-to-code search framework with cascaded fast and slow models, in which a fast transformer encoder model is learned to optimize a scalable index for fast retrieval followed by learning a slow classification-based re-ranking model to improve the accuracy of the top K results from the fast retrieval. To further reduce the high memory cost of deploying two separate models in practice, we propose to jointly train the fast and slow model based on a single transformer encoder with shared parameters. Empirically our cascaded method is not only efficient and scalable, but also achieves state-of-the-art results with an average mean reciprocal ranking (MRR) score of 0.7795 (across 6 programming languages) on the CodeSearchNet benchmark as opposed to the prior state-of-the-art result of 0.744 MRR. Our codebase can be found at this link.

DOI: 10.1145/3611643.3616369

PEM

作者: Xu, Xiangzhe and Xuan, Zhou and Feng, Shiwei and Cheng, Siyuan and Ye, Yapeng and Shi, Qingkai and Tao, Guanhong and Yu, Le and Zhang, Zhuo and Zhang, Xiangyu
关键词: Binary Similarity, Dynamic Analysis

Abstract

This repository contains the artifacts of PEM. We provide a runnable docker image for the artifact evaluation. In addition, for future research and development, we provide the source code of PEM and a detailed instruction on how to compile it from the source code.

DOI: 10.1145/3611643.3616301

Hue： A User-Adaptive Parser for Hybrid Logs

作者: Xu, Junjielong and Fu, Qiuai and Zhu, Zhouruixing and Cheng, Yutong and Li, Zhijing and Ma, Yuchi and He, Pinjia
关键词: Hybrid Logs, Log Analysis, Log Parsing

Abstract

This is the artifact of " Hue: A User-Adaptive Parser for Hybrid Logs" (ESEC/FSE’23). Please refer to README.md for more details.

DOI: 10.1145/3611643.3616260

Log Parsing with Generalization Ability under New Log Types

作者: Yu, Siyu and Wu, Yifan and Li, Zhijing and He, Pinjia and Chen, Ningjiang and Liu, Changjian
关键词: generalization, log parsing, self supervised, test-time training

Abstract

Log parsing, which converts semi-structured logs into structured logs, is the first step for automated log analysis.
Existing parsers are still unsatisfactory in real-world systems due to new log types in new-coming logs.
In practice, available logs collected during system runtime often do not contain all the possible log types of a system because log types related to infrequently activated system states are unlikely to be recorded and new log types are frequently introduced with system updates.
Meanwhile, most existing parsers require preprocessing to extract variables in advance, but preprocessing is based on the operator’s prior knowledge of available logs and therefore may not work well on new log types.
In addition, parser parameters set based on available logs are difficult to generalize to new log types.
To support new log types, we propose a variable generation imitation strategy to craft a novel log parsing approach with generalization ability, called Log3T. Log3T employs a pre-trained transformer encoder-based model to extract log templates and can update parameters at parsing time to adapt to new log types by a modified test-time training.
Experimental results on 16 benchmark datasets show that Log3T outperforms the state-of-the-art parsers in terms of parsing accuracy. In addition, Log3T can automatically adapt to new log types in new-coming logs.

DOI: 10.1145/3611643.3616355

Replication Package for “Semantic Debugging”

作者: Eberlein, Martin and Smytzek, Marius and Steinh"{o
关键词: behavior explanation, debugging, testing

Abstract

This Replication Package contains the code to execute, develop and test our debugging prototype Avicenna.

Avicenna is a debugging tool designed to automatically determine the causes and conditions of program failures. It leverages both generative and predictive models to satisfy constraints over grammar elements and detect relations of input elements. Our tool uses the ISLa specification language to express complex failure circumstances as predicates over input elements. Avicenna learns input properties that are common across failing inputs and employs a feedback loop to refine the current debugging diagnoses by systematic experimentation. The result is crisp and precise diagnoses that closely match those determined by human experts, offering a significant advancement in the realm of automated debugging.

DOI: 10.1145/3611643.3616296

Demystifying Dependency Bugs in Deep Learning Stack

作者: Huang, Kaifeng and Chen, Bihuan and Wu, Susheng and Cao, Junming and Ma, Lei and Peng, Xin
关键词: deep learning stack, dependency bug, empirical study

Abstract

Deep learning (DL) applications, built upon a heterogeneous and complex DL stack (e.g., Nvidia GPU, Linux, CUDA driver, Python runtime, and TensorFlow), are subject to software and hardware dependencies across the DL stack. One challenge in dependency management across the entire engineering lifecycle is posed by the asynchronous and radical evolution and the complex version constraints among dependencies. Developers may introduce dependency bugs (DBs) in selecting, using and maintaining dependencies. However, the characteristics of DBs in DL stack is still under-investigated, hindering practical solutions to dependency management in DL stack.
To bridge this gap, this paper presents the first comprehensive study to characterize symptoms, root causes and fix patterns of DBs across the whole DL stack with 446 DBs collected from StackOverflow posts and GitHub issues. For each DB, we first investigate the symptom as well as the lifecycle stage and dependency where the symptom is exposed. Then, we analyze the root cause as well as the lifecycle stage and dependency where the root cause is introduced. Finally, we explore the fix pattern and the knowledge sources that are used to fix it. Our findings from this study shed light on practical implications on dependency management.

DOI: 10.1145/3611643.3616325

Can Machine Learning Pipelines Be Better Configured?

作者: Wang, Yibo and Wang, Ying and Zhang, Tingwei and Yu, Yue and Cheung, Shing-Chi and Yu, Hai and Zhu, Zhiliang
关键词: Empirical Study, Machine Learning Libraries

Abstract

A Machine Learning (ML) pipeline configures the workflow of a learning task using the APIs provided by ML libraries. However, a pipeline’s performance can vary significantly across different configurations of ML library versions. Misconfigured pipelines can result in inferior performance, such as poor execution time and memory usage, numeric errors and even crashes. A pipeline is subject to misconfiguration if it exhibits significantly inconsistent performance upon changes in the versions of its configured libraries or the combination of these libraries. We refer to such performance inconsistency as a pipeline configuration (PLC) issue. There is no prior systematic study on the pervasiveness, impact and root causes of PLC issues. A systematic understanding of these issues helps configure effective ML pipelines and identify misconfigured ones. In this paper, we conduct the first empirical study of PLC issues. To better dig into the problem, we propose Piecer, an infrastructure that automatically generates a set of pipeline variants by varying different version combinations of ML libraries and compares their performance inconsistencies. We apply Piecer to the 3,380 pipelines that can be deployed out of the 11,363 ML pipelines collected from multiple ML competitions at Kaggle platform. The empirical study results show that 1,092 (32.3

DOI: 10.1145/3611643.3616352

Replication Package for “Compatibility Issues in Deep Learning Systems： Problems and Opportunities”

作者: Wang, Jun and Xiao, Guanping and Zhang, Shuai and Lei, Huashan and Liu, Yepang and Sui, Yulei
关键词: compatibility issues, deep learning, empirical study

Abstract

This dataset contains scripts and data used to generate relevant results for this paper. Detailed information and procedure to reproduce our results are described in README.md.

code

This folder contains two Python scripts: soextractor.py is used to extract 3,072 high-quality StackOverflow (SO) posts and soextractor_tags.py is used to extract the number of posts for the tags on SO. For detailed data collection criteria, please refer to Section 3.1 of our paper.

DL compatibility issues.xlsx

This file provides all the collected 3,072 issues, in which each line indicates whether the issue is a DL compatibility issue. Among them, 352 are DL compatibility issues. We also provide information on the library, stage, symptom, type, solution, root cause, and exception type for the DL compatibility issues. For the type CORE-TPL, we also provide backward-incompatible or forward-incompatible as well as API evolution patterns. For detailed manual classification of DL compatibility issues, please refer to Section 3.2 of our paper.

Tool Survey.xlsx

This file includes all the papers collected from the three top SE conferences (i.e., ICSE, FSE, and ASE) in recent five years (18-22). Each line of each sheet provides the following information: (a) Title, (b) Year, (c) Conference, and (d) Type. For the detailed paper collection procedure, please refer to Section 5 of our paper.

DOI: 10.1145/3611643.3616321

Reproduction Package for Article “An Extensive Study on Adversarial Attack against Pre-trained Models of Code”

作者: Du, Xiaohu and Wen, Ming and Wei, Zichao and Wang, Shangwen and Jin, Hai
关键词: Adversarial Attack, Deep Learning, Pre-Trained Model

Abstract

The data, source code, and the results of this paper.

DOI: 10.1145/3611643.3616356

Replication Package of the ESEC/FSE 2023 Paper Entitled “Fix Fairness, Don’t Ruin Accuracy： Performance Aware Fairness Repair using AutoML”

作者: Nguyen, Giang and Biswas, Sumon and Rajan, Hridesh
关键词: automated machine learning, bias mitigation, fairness-accuracy trade-off, machine learning software, Software fairness

Abstract

To increase transparency and encourage reproducibility, we have made our artifact publicly available. All the source code and evaluation data with detailed descriptions will be updated here: https://github.com/giangnm58/Fair-AutoML.

DOI: 10.1145/3611643.3616257

BiasAsker： Measuring the Bias in Conversational AI System

作者: Wan, Yuxuan and Wang, Wenxuan and He, Pinjia and Gu, Jiazhen and Bai, Haonan and Lyu, Michael R.
关键词: Software testing, conversational models, social bias

Abstract

Powered by advanced Artificial Intelligence (AI) techniques, conversational AI systems, such as ChatGPT, and digital assistants like Siri, have been widely deployed in daily life. However, such systems may still produce content containing biases and stereotypes, causing potential social problems. Due to modern AI techniques’ data-driven, black-box nature, comprehensively identifying and measuring biases in conversational systems remains challenging. Particularly, it is hard to generate inputs that can comprehensively trigger potential bias due to the lack of data containing both social groups and biased properties. In addition, modern conversational systems can produce diverse responses (e.g., chatting and explanation), which makes existing bias detection methods based solely on sentiment and toxicity hardly being adopted. In this paper, we propose BiasAsker, an automated framework to identify and measure social bias in conversational AI systems. To obtain social groups and biased properties, we construct a comprehensive social bias dataset containing a total of 841 groups and 5,021 biased properties. Given the dataset, BiasAsker automatically generates questions and adopts a novel method based on existence measurement to identify two types of biases (i.e., absolute bias and related bias) in conversational systems. Extensive experiments on eight commercial systems and two famous research models, such as ChatGPT and GPT-3, show that 32.83% of the questions generated by BiasAsker can trigger biased behaviors in these widely deployed conversational systems. All the code, data, and experimental results have been released to facilitate future research.

DOI: 10.1145/3611643.3616310

Repository for Article “Pitfalls in Experiments with DNN4SE： An Analysis of the State of the Practice”

作者: Vegas, Sira and Elbaum, Sebastian
关键词: deep learning, machine learning for software engineering, software engineering experimentation

Abstract

There are 3 folders in the repository:

3_Analysis_of_papers. Contains 5 Excel files with the results of the analyses of papers presented in Section 3 of the paper, plus one Word file with additional analyses not included in the paper:
- 3_1_Search_results: Data for Section 3.1 of the paper. This includes a summary of the results of the search and selection process and the list of papers retrieved from SCOPUS, along with the inclusion/exclusion criteria applied to each one.
- 3_2_Papers_characterization_ICSE, 3_2_Papers_characterization_FSE, and 3_2_Papers_characterization_TSE: Data for Section 3.2 of the paper. These files contain the information retrieved from each individual paper associated to the steps of the experimental process, for each experiment described in the papers.
- 3_3_Papers_summary: Data for Section 3.3 of the paper. It includes a description of the characterization criteria, along with its application to the 194 experiments (per venue and overall).
- Papers_per_experiment_type: Additional material (does not appear in the paper) containing a classification of experiments by type and the results of the characterization of the experiments per type. The data for this material is included in file 3_3_Papers_summary.
4_Analysis_of_artifacts. Contains 1 Excel file and 1 Word file with the results of the analysis of artifacts presented in Section 4 of the paper:
- Characterization_badges. For those papers that earned an ACM artifact badge, this file includes the information retrieved from the paper, and the information that the artifact adds to the one in the paper.
- Summary_discrepancies. A summary of the discrepancies found between papers and artifacts.
5_Implications. Contains 1 Excel file with the results of the analysis of validity presented in Section 5 of the paper:
- Validity_analysis. The raw data corresponding to Fig. 3 in the paper is provided, along with the detailed values for each experiment.

DOI: 10.1145/3611643.3616320

DecompoVision： Reliability Analysis of Machine Vision Components through Decomposition and Reuse

作者: Hu, Boyue Caroline and Marsso, Lina and Dvornik, Nikita and Shen, Huakun and Chechik, Marsha
关键词: Computer Vision, Machine Learning, Requirements Engineering, Software Analysis, Software Engineering for Artificial Intelligence

Abstract

Analyzing reliability of Machine Vision Components (MVC) against scene changes (such as rain or fog) in their operational environment is crucial for safety-critical applications. Safety analysis relies on the availability of precisely specified and, ideally, machine-verifiable requirements. The state-of-the-art reliability framework ICRAF developed machine-verifiable requirements obtained using human performance data. However, ICRAF is limited to analyzing reliability of MVCs solving simple vision tasks, such as image classification. Yet, many real-world safety-critical systems require solving more complex vision tasks, such as object detection and instance segmentation. Fortunately, many complex vision tasks (which we call “c-tasks”) can be represented as a sequence of simple vision subtasks. For instance, object detection can be decomposed as object localization followed by classification. Based on this fact, in this paper, we show that the analysis of c-tasks can also be decomposed as a sequential analysis of their simple subtasks, which allows us to apply existing techniques for analyzing simple vision tasks. Specifically, we propose a modular reliability framework, DecompoVision, that decomposes: (1) the problem of solving a c-task, (2) the reliability requirements, and (3) the reliability analysis, and, as a result, provides deeper insights into MVC reliability. DecompoVision extends ICRAF to handle complex vision tasks and enables reuse of existing artifacts across different c-tasks. We capture new reliability gaps by checking our requirements on 13 widely used object detection MVCs, and, for the first time, benchmark segmentation MVCs.

DOI: 10.1145/3611643.3616333

作者: Yu, Guangba and Chen, Pengfei and Li, Yufeng and Chen, Hongyang and Li, Xiaoyun and Zheng, Zibin
关键词: Microservice, Multi-modal Observability Data, Root Cause Analysis

Abstract

Nezha is an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multimodal data. Nezha transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of Nezha is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way.

DOI: 10.1145/3611643.3616249

Reproduction Package for Article ‘DiagConfig： Configuration Diagnosis of Performance Violations in Configurable Software Systems’

作者: Chen, Zhiming and Chen, Pengfei and Wang, Peipei and Yu, Guangba and He, Zilong and Mai, Genting
关键词: Configuration diagnosis, Performance violation, Program analysis, Taint tracking

Abstract

DiagConfig is a white-box configuration diagnosis system and is general enough to adapt to software systems under different configurations, workloads, and environments. This artifact includes the data and source code of DiagConfig for evaluation reproduction.

DOI: 10.1145/3611643.3616300

Reproduction package for article “Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization”

作者: Du, Yali and Yu, Zhongxing
关键词: bug localization, computation role, contrastive learning, pre-trained model, semantic flow graph, type

Abstract

The packaged artifact includes Installation Package, Dataset, Code, and Weights of Pre-trained Models. Moreover, we have write documentations explaining how to obtain the artifact package, how to unpack the artifact, how to get started, and how to use the artifacts in more detail with README.md, REQUIREMENTS.md, and INSTALL.md

DOI: 10.1145/3611643.3616338

Automata-Based Trace Analysis for Aiding Diagnosing GUI Testing Tools for Android

作者: Ma, Enze and Huang, Shan and He, Weigang and Su, Ting and Wang, Jue and Liu, Huiyu and Pu, Geguang and Su, Zhendong
关键词: Android GUI Testing, Runtime Verification, Trace Analysis

Abstract

Benchmarking software testing tools against known bugs is a classic approach to evaluating the tools’ bug finding abilities. However, this approach is difficult to give some clues on the tool-missed bugs to aid diagnosing the testing tools. As a result, heavy and ad hoc manual analysis is needed. In this work, in the setting of GUI testing for Android apps, we introduce an automata-based trace analysis approach to tackling the key challenge of manual analysis, i.e., how to analyze the lengthy event traces generated by a testing tool against a missed bug to find the clues. Our key idea is that, we model a bug in the form of a finite automaton which captures its bug-triggering traces; and match the event traces generated by the testing tool (which misses this bug) against this automaton to obtain the clues. Specifically, the clues are presented in the form of three designated automata-based coverage values. We apply our approach to enhance Themis, a representative benchmark suite for Android, to aid diagnosing GUI testing tools. Our extensive evaluation on nine state-of-the-art GUI testing tools and the involvement with several tool developers shows that our approach is feasible and useful. Our approach enables Themis+ (the enhanced benchmark suite) to provide the clues on the tool-missed bugs, and all the Themis+’s clues are identical or useful, compared to the manual analysis results of tool developers. Moreover, the clues have helped find several tool weaknesses, which were unknown or unclear before. Based on the clues, two actively-developing industrial testing tools in our study have quickly made several optimizations and demonstrated their improved bug finding abilities. All the tool developers give positive feedback on the usefulness and usability of Themis+’s clues. Themis+ is available at https://github.com/DDroid-Android/home.

DOI: 10.1145/3611643.3616361

Replication package for Artical `A Practical Human Labeling Method for Online Just-In-Time Software Defect Prediction’

作者: Song, Liyan and Minku, Leandro Lei and Teng, Cong and Yao, Xin
关键词: human inspection, human labeling, Just-in-time software defect prediction, online learning, verification latency, waiting time

Abstract

This repository contains the source codes together with the datasets to replicate the above paper published in FSE2023.

DOI: 10.1145/3611643.3616307

Flow Experience in Software Engineering

作者: Ritonummi, Saima and Siitonen, Valtteri and Salo, Markus and Pirkkalainen, Henri and Sivunen, Anu
关键词: Software engineering, flow experience, software development

Abstract

Software engineering (SE) requires high analytical skills and creativity, which makes it an excellent context for experiencing flow. Although previous work in the SE context has identified how positive affect and development tools can support the flow experience, there is still much to uncover about the characteristics of software developers’ flow experiences. To address this gap in knowledge, we conducted a qualitative critical incident technique (CIT) questionnaire (n = 401) on the flow-facilitating factors and characteristics of flow in the SE context. The most important flow-facilitating factors in developers’ work included optimal challenge, high motivation, positive developer experience (DX), and no distractions or interruptions. The flow experiences were characterized by absorption, effortless control, intrinsic reward, and high performance. Our study identifies the features of flow commonly addressed in flow research; however, it also highlights how IT use, especially development tools that provide positive DX, as well as being able to work without excessive distractions and interruptions are important facilitators of developers’ flow.

DOI: 10.1145/3611643.3616263

Building and Sustaining Ethnically, Racially, and Gender Diverse Software Engineering Teams： A Study at Google

作者: Dagan, Ella and Sarma, Anita and Chang, Alison and D’Angelo, Sarah and Dicker, Jill and Murphy-Hill, Emerson
关键词: diversity, inclusion, software engineering, teams

Abstract

Teams that build software are largely demographically homogeneous. Without diversity, homogeneous perspectives dominate how, why, and for whom software is designed. To understand how teams can successfully build and sustain diversity, we interviewed 11 engineers and 9 managers from some of the most gender and racially diverse teams at Google, a large software company. Qualitatively analyzing the interviews, we found shared approaches to recruiting, hiring, and promoting an inclusive environment, all of which create a positive feedback loop. Our findings produce actionable practices that every member of the team can take to increase diversity by fostering a more inclusive software engineering environment.

DOI: 10.1145/3611643.3616273

Towards Automated Detection of Unethical Behavior in Open-Source Software Projects

作者: Win, Hsu Myat and Wang, Haibo and Tan, Shin Hwei
关键词: Ethics in Software Engineering, Open-source software projects

Abstract

Given the rapid growth of Open-Source Software (OSS) projects, ethical considerations are becoming more important. Past studies focused on specific ethical issues (e.g., gender bias and fairness in OSS). There is little to no study on the different types of unethical behavior in OSS projects. We present the first study of unethical behavior in OSS projects from the stakeholders’ perspective. Our study of 316 GitHub issues provides a taxonomy of 15 types of unethical behavior guided by six ethical principles (e.g., autonomy). Examples of new unethical behavior include soft forking (copying a repository without forking) and self-promotion (promoting a repository without self-identifying as contributor to the repository). We also identify 18 types of software artifacts affected by the unethical behavior. The diverse types of unethical behavior identified in our study (1) call for attentions of developers and researchers when making contributions in GitHub, and (2) point to future research on automated detection of unethical behavior in OSS projects. From our study, we propose Etor, an approach that can automatically detect six types of unethical behavior by using ontological engineering and Semantic Web Rule Language (SWRL) rules to model GitHub attributes and software artifacts. Our evaluation on 195,621 GitHub issues (1,765 GitHub repositories) shows that Etor can automatically detect 548 unethical behavior with 74.8% average true positive rate (up to 100% true positive rate). This shows the feasibility of automated detection of unethical behavior in OSS projects.

DOI: 10.1145/3611643.3616314

ESEC/FSE’23 Artifact for “NeuRI： Diversifying DNN Generation via Inductive Rule Inference”

作者: Liu, Jiawei and Peng, Jinjun and Wang, Yuyao and Zhang, Lingming
关键词: Compiler Testing, Deep Learning Compilers, Fuzzing

Abstract

This is the artifact for the ESEC/FSE’23 paper “NeuRI: Diversifying DNN Generation via Inductive Rule Inference”.

Deep Learning (DL) is prevalently used in various industries to improve decision-making and automate processes, driven by the ever-evolving DL libraries and compilers. The correctness of DL systems is crucial for trust in DL applications.
As such, the recent wave of research has been studying the automated synthesis of test-cases (i.e., DNN models and their inputs) for fuzzing DL systems. However, existing model generators only subsume a limited number of operators, lacking the ability to pervasively model operator constraints.
To address this challenge, we propose NeuRI, a fully automated approach for generating valid and diverse DL models composed of hundreds of types of operators. NeuRI adopts a three-step process:
(i) collecting valid and invalid API traces from various sources;
(ii) applying inductive program synthesis over the traces to infer the constraints for constructing valid models; and
(iii) using hybrid model generation which incorporates both symbolic and concrete operators.
Our evaluation shows that NeuRI improves branch coverage of TensorFlow and PyTorch by 24\% and 15\% over the state-of-the-art model-level fuzzers. NeuRI finds 100 new bugs for PyTorch and TensorFlow in four months, with 81 already fixed or confirmed. Of these, 9 bugs are labelled as high priority or security vulnerability, constituting 10\% of all high-priority bugs of the period.
Open-source developers regard error-inducing tests reported by us as “high-quality” and “common in practice”.

The artifact includes evidences of real-world bug finding (RQ3) as well as procedures to replicate experiments on coverage evaluation (RQ1) and rule inference (RQ2).

For more information, please check the artifact GitHub repository: https://github.com/ise-uiuc/neuri-artifact

DOI: 10.1145/3611643.3616337

ESEC/FSE 2023 Artifact for “Heterogeneous Testing for Coverage Profilers Empowered with Debugging Support”

作者: Yang, Yibiao and Sun, Maolin and Wang, Yang and Li, Qingyang and Wen, Ming and Zhou, Yuming
关键词: bug detection, Code coverage, coverage profiler, debugging support, heterogeneous testing

Abstract

This artifact contains Decov, a testing tool for coverage profilers. Additionally, it includes C2V and Cod for comparing the effectiveness of different tools. The README.md file provides a description of how to use the artifact.

DOI: 10.1145/3611643.3616340

Outage-Watch： Early Prediction of Outages using Extreme Event Regularizer

作者: Agarwal, Shubham and Chakraborty, Sarthak and Garg, Shaddy and Bisht, Sumit and Jain, Chahat and Gonuguntla, Ashritha and Saini, Shiv
关键词: Distribution Learning, Mixture Density Network, Outage Forecasting, System reliability and monitoring

Abstract

Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to retain customers and prevent revenue loss, it is important to provide high reliability guarantees for these services. One way to do this is by predicting outages in advance, which can help in reducing the severity as well as time to recovery. It is difficult to forecast critical failures due to the rarity of these events. Moreover, critical failures are ill-defined in terms of observable data. Our proposed method, Outage-Watch, defines critical service outages as deteriorations in the Quality of Service (QoS) captured by a set of metrics. Outage-Watch detects such outages in advance by using current system state to predict whether the QoS metrics will cross a threshold and initiate an extreme event. A mixture of Gaussian is used to model the distribution of the QoS metrics for flexibility and an extreme event regularizer helps in improving learning in tail of the distribution. An outage is predicted if the probability of any one of the QoS metrics crossing threshold changes significantly. Our evaluation on a real-world SaaS company dataset shows that Outage-Watch significantly outperforms traditional methods with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages exhibiting a change in service metrics and reduces the Mean Time To Detection (MTTD) of outages by up to 88% when deployed in an enterprise cloud-service system, demonstrating efficacy of our proposed method.

DOI: 10.1145/3611643.3616316

Multilingual Code Co-evolution using Large Language Models

作者: Zhang, Jiyang and Nie, Pengyu and Li, Junyi Jessy and Gligoric, Milos
关键词: Language model, code translation, software evolution

Abstract

Many software projects implement APIs and algorithms in multiple programming languages. Maintaining such projects is tiresome, as developers have to ensure that any change (e.g., a bug fix or a new feature) is being propagated, timely and without errors, to implementations in other programming languages. In the world of ever-changing software, using rule-based translation tools (i.e., transpilers) or machine learning models for translating code from one language to another provides limited value. Translating each time the entire codebase from one language to another is not the way developers work. In this paper, we target a novel task: translating code changes from one programming language to another using large language models (LLMs). We design and implement the first LLM, dubbed Codeditor, to tackle this task. Codeditor explicitly models code changes as edit sequences and learns to correlate changes across programming languages. To evaluate Codeditor, we collect a corpus of 6,613 aligned code changes from 8 pairs of open-source software projects implementing similar functionalities in two programming languages (Java and C#). Results show that Codeditor outperforms the state-of-the-art approaches by a large margin on all commonly used automatic metrics. Our work also reveals that Codeditor is complementary to the existing generation-based models, and their combination ensures even greater performance.

DOI: 10.1145/3611643.3616350

Reproduction Package for Article “Knowledge-Based Version Incompatibility Detection for Deep Learning”

作者: Zhao, Zhongkai and Kou, Bonan and Ibrahim, Mohamed Yilmaz and Chen, Muhao and Zhang, Tianyi
关键词: Deep Learning, Knowledge Extraction, Version Compatibility

Abstract

The artifact contains the data and code of DECIDE, a version incompatibility detection tool based on pre-trained language models proposed in “Knowledge-based Version Incompatibility Detection for Deep Learning”. Meanwhile, this artifact also contains data and code to replicate experiment results in the paper. The artifact has been made publicly available on GitHub to support Open Science.

DOI: 10.1145/3611643.3616364

Reproduction Package for Article `Statistical Type Inference for Incomplete Programs’

作者: Peng, Yaohui and Xie, Jing and Yang, Qiongling and Guo, Hanwen and Li, Qingan and Xue, Jingling and Yuan, Mengting
关键词: deep learning, graph generation, structured learning, Type inference

Abstract

Stir is a novel two-stage approach for inferring types in incomplete programs that may be ill-formed, where whole-program syntactic analysis often fails. In the first stage, Stir predicts a type tag for each token by using neural networks, and consequently, infers all the simple types in the program. In the second stage, Stir refines the complex types for the tokens with predicted complex type tags. Unlike existing machine-learning-based approaches, which solve type inference as a classification problem, Stir reduces it to a sequence-to-graph parsing problem. This artifact contains the implementation and evaluation program of Stir, which can be used to reproduce the evaluation results, and can also serve as a standalone application for general use of the approach. This artifact contains the implementation and evaluation program, which can be used to reproduce the evaluation results, and can also serve as a standalone application for general use.

This artifact is organized as follows: - abstract.md: file describing the artifact itself. - README.md: main document file. - INSTALL.md: instructions for obtaining the artifact and setting up the environment. - REQUIREMENTS.md: requirements for the hardware and software environment. - STATUS.md: badges that this artifact applies for and the reasons for applying for them. - LICENSE: license (MIT License) of the artifact. - main.py: the main entry file. - first/: the source code of the first stage of STIR. - second/: the source code of the second stage of STIR. - data/: the data used in the evaluation. - pretrained/: the pretrained model used in the evaluation. - Dockerfile: Dockerfile for building the Docker image with the software environment to reproduce the evaluation results. - environment.yml: conda environment file for reproducing the evaluation results.

DOI: 10.1145/3611643.3616283

OOM-Guard： Towards Improving the Ergonomics of Rust OOM Handling via a Reservation-Based Approach

作者: Chen, Chengjun and Zhang, Zhicong and Tian, Hongliang and Yan, Shoumeng and Xu, Hui
关键词: Out Of Memory, Reservation, Software Reliability, Static Analysis

Abstract

Out of memory (OOM) is an exceptional system state where any further memory allocation requests may fail. Such allocation failures would crash the process or system if not handled properly, and they may also lead to an inconsistent program state that cannot be recovered easily. Current mechanisms for preventing such hazards highly rely on the manual effort of the programmers themselves. This paper studies the OOM issues of Rust, which is an emerging system programming language that stresses the importance of memory safety but still lacks handy mechanisms to handle OOM well. Even worse, Rust employs an infallible mode of memory allocations by default. As a result, the program written by Rust would simply abort itself when OOM occurs. Such crashes would lead to critical robustness issues for services or modules of operating systems. We propose OOM-Guard, a handy approach for Rust programmers to handle OOM. OOM-Guard is by nature a reservation-based approach that aims to convert the handlings for many possible failed memory allocations into handlings for a smaller number of reservations. In order to achieve efficient reservation, OOM-Guard incorporates a subtle cost analysis algorithm based on static analysis and a proxy allocator. We then apply OOM-Guard to two well-known Rust projects, Bento and rCore. Results show that OOM-Guard can largely reduce developers’ efforts for handling OOM and incurs trivial overhead in both memory space and execution time.

DOI: 10.1145/3611643.3616303

Reproduction Package for DeepInfer： Deep Type Inference from Smart Contract Bytecode

作者: Zhao, Kunsong and Li, Zihao and Li, Jianfeng and Ye, He and Luo, Xiapu and Chen, Ting
关键词: Deep Learning, Smart Contract, Type Inference

Abstract

In this replication package, we describe how to replicate the results of our FSE’23 paper, [DeepInfer: Deep Type Inference from Smart Contract Bytecode]. Our replication package allows for an accurate reconstruction of results presented within our paper, as well as the source code for the tool that we built to generate these results.

DOI: 10.1145/3611643.3616343

DeMinify： Neural Variable Name Recovery and Type Inference

作者: Li, Yi and Yadavally, Aashish and Zhang, Jiaxing and Wang, Shaohua and Nguyen, Tien N.
关键词: Deep Learning, Minified Code, Name Recovery, Type Inference

Abstract

To avoid the exposure of original source code, the variable names deployed in the wild are often replaced by short, meaningless names, thus making the code difficult to understand and be analyzed. We introduce DeMinify, a Deep-Learning (DL)-based approach that formulates such recovery problem as the prediction of missing features in a Graph Convolutional Network–Missing Features. The graph represents both the relations among the variables and the relations among their types, in which the names or types of some nodes are missing. Moreover, DeMinify leverages dual-task learning to propagate the mutual impact between the learning of the variable names and that of their types. We conducted experiments to evaluate DeMinify in both name recovery and type prediction on a Python dataset with 180k methods and a JavaScript (JS) dataset with 322k files. For variable name prediction, in 76.7% and 81.6% of the cases in Python and JS code respectively, DeMinify can predict correctly the variables’ names with a single suggested name. DeMinify relatively improves 15.3%–40.7% and 7.7%–49.7% in top-1 accuracy over the state-of-the-art variable name recovery approaches for Python and JS code, respectively. It also relatively improves 14.5%–51.9% in top-1 accuracy over the existing type prediction approaches. Our experimental results showed that learning of data types helps improve variable name recovery and vice versa.

DOI: 10.1145/3611643.3616368

作者: Zou, Deqing and Feng, Siyue and Wu, Yueming and Suo, Wenqi and Jin, Hai
关键词: Abstract Syntax Tree, Semantic Clones, Social Network, Triads

Abstract

Code clone detection refers to finding the functional similarities between two code fragments, which is becoming increasingly important with the evolution of software engineering. It is reasonable because code cloning can increase maintenance costs and even cause the propagation of vulnerabilities, which can have a negative impact on software security. Numbers of code clone detection methods have been proposed, including tree-based methods that are capable of detecting semantic code clones. However, since tree structure is complex, these methods are difficult to apply to large-scale clone detection. In this paper, we propose a scalable semantic code clone detector based on semantically enhanced abstract syntax tree. Specifically, we add the control flow and data flow details into the original tree and regard the enhanced tree as a social network. Then we build a social network-based triads model to collect the similarity features between the two methods by analyzing different types of triads within the network. After obtaining all features, we use them to train a machine learning-based code clone detector (i.e., Tritor). Our comparative experimental results show that Tritor is superior to SourcererCC, RtvNN, Deckard, ASTNN, TBCNN, CDLH, and SCDetector, are equally good with DeepSim and FCCA. As for scalability, Tritor is about 39 times faster than another current state-of-the-art tree-based code clone detector ASTNN.

DOI: 10.1145/3611643.3616354

Gitor： Scalable Code Clone Detection by Building Global Sample Graph

作者: Shan, Junjie and Dou, Shihan and Wu, Yueming and Wu, Hairu and Liu, Yang
关键词: Clone Detection, Global Sample Graph, Node Embedding

Abstract

Code clone detection is about finding out similar code fragments, which has drawn much attention in software engineering since it is important for software maintenance and evolution. Researchers have proposed many techniques and tools for source code clone detection, but current detection methods concentrate on analyzing or processing code samples individually without exploring the underlying connections among code samples. In this paper, we propose Gitor to capture the underlying connections among different code samples. Specifically, given a source code database, we first tokenize all code samples to extract the pre-defined individual information (keywords). After obtaining all samples’ individual information, we leverage them to build a large global sample graph where each node is a code sample or a type of individual information. Then we apply a node embedding technique on the global sample graph to extract all the samples’ vector representations. After collecting all code samples’ vectors, we can simply compare the similarity between any two samples to detect possible clone pairs. More importantly, since the obtained vector of a sample is from a global sample graph, we can combine it with its own code features to improve the code clone detection performance. To demonstrate the effectiveness of Gitor, we evaluate it on a widely used dataset namely BigCloneBench. Our experimental results show that Gitor has higher accuracy in terms of code clone detection and excellent execution time for inputs of various sizes (1–100 MLOC) compared to existing state-of-the-art tools. Moreover, we also evaluate the combination of Gitor with other traditional vector-based clone detection methods, the results show that the use of Gitor enables them detect more code clones with higher F1.

DOI: 10.1145/3611643.3616371

Demystifying the Composition and Code Reuse in Solidity Smart Contracts

作者: Sun, Kairan and Xu, Zhengzi and Liu, Chengwei and Li, Kaixuan and Liu, Yang
关键词: code reuse, development pattern, smart contract composition

Abstract

As the development of Solidity smart contracts has increased in popularity, the reliance on external sources such as third-party packages increases to reduce development costs. However, despite the use of external sources bringing flexibility and efficiency to the development, they could also complicate the process of assuring the security of downstream applications due to the lack of package managers for standardized ways and sources. While previous studies have only focused on code clones without considering how the external components are introduced, the compositions of a smart contract and their characteristics still remain puzzling. To fill these gaps, we conducted an empirical study with over 350,000 Solidity smart contracts to uncover their compositions, conduct code reuse analysis, and identify prevalent development patterns. Our findings indicate that a typical smart contract comprises approximately 10 subcontracts, with over 80% of these originating from external sources, reflecting the significant reliance on third-party packages. For self-developed subcontracts, around 50% of the subcontracts have less than 10% unique functions, suggesting that code reuse at the level of functions is also common. For external subcontracts, though around 35% of the subcontracts are interfaces to provide templates for standards or protocols, an inconsistency in the use of subcontract types is also identified. Lastly, we extracted 61 frequently reused development patterns, offering valuable insights for secure and efficient smart contract development.

DOI: 10.1145/3611643.3616270

Artifacts - Scalable Program Clone Search through Spectral Analysis

作者: Benoit, Tristan and Marion, Jean-Yves and Bardin, S'{e
关键词: binary code analysis, clone search, cyber security, dataset, software, software engineering, spectral analysis

Abstract

Artifacts - Scalable Program Clone Search through Spectral Analysis

We focus on the problem of program clone search, which involves finding the program in a repository most similar to a target program. Program clone search has important applications, including malware detection and program clustering.

In solving this problem, the inherent workflow involves disassembly, feature extraction (or preprocessing), clone searches, and subsequent generation of tables.

A good similarity metric is crucial to finding the repository’s closest program. It has to be precise and robust even in cross-architecture scenarios and fast even when dealing with huge repositories. This artifact encompasses 21 distinctive clone search methods. Each method is different, and therefore, their workflow may be slightly different. Overall, the artifact is a purposely-built framework for clone search method comparison. It is easily extensible and can be tweaked to carry out new measurements.

The artifact includes four datasets with vast numbers of programs: Basic (1K), BinKit (98K), IoT (20K), and Windows (85K). Due to the enormous scale of these datasets, this artifact demands significant time consumption. To offer a perspective, the disassembly process on these considerable datasets can take days even when operating on 20 cores. The subsequent steps, such as preprocessing and clone searches, can also demand hundreds of hours. Note that we have gathered 2 TB of disassembled files throughout accumulating this data.

To tackle these time and space constraints, we have ensured that precomputed data are available within this artifact at multiple workflow phases. This enables a quick transition from reproducing one workflow phase to another. However, we could not include all disassembled files, so we mainly focused on the last phases, such as a clone search.

Examples of Use

conda activate PSS_Base
python3 MakeTables.py
python3 MakeAblationTables.py

The above will produce in a few minutes the Tables of our article using precomputed results.

See EXAMPLES.md for five quick examples of replications using this artifact.

Usage - Basic Dataset

Replication Script

To replicate clone searches on the Basic dataset with all methods without any preprocessing phases, use the script provided:

conda activate PSS_Base
python3 SetAbsolutePath.py
bash ReplicateCloneSearchesBasic.py

It requires 40 cores and at least 100 GB of memory and should run for between 140 hours and 350 hours.

Generalities

For Basic dataset computations, ensure you have run python3 SetAbsolutePath.py.

Inside a method folder: - RunMakeMD3.py will compute all similarity indices using precomputed features. - RunMakeMD.py will utilize these indices to compute the test field results.

To reproduce the feature extraction, usually a script called Preprocess.py can be run.

Some frameworks have a more complex feature extraction workflow that can take a certain amount of computation.

For instance, a function embedding such as AlphaDiff requires a learning phase of around 60 hours with 100 GB of RAM.

conda activate PSS_Base
cd AlphaDiff/Train
unzip datasetAD.Py
python3 main.py
rm datasetAD.h5

It is followed, by an embedding computation phase of 5 hours.

cd AlphaDiff/Embeds/
python3 MakeEmbeds.py

Then, a distance computation phase of between 18 and 40 hours using 40 cores and 100 GB of RAM.

cd AlphaDiff/AD_gDist/
python3 Run.py

After that, similarity indices can be made from these computations.

cd AlphaDiff/makeResults/
python3 RunMakeMD3.py
python3 RunMakeMD.py

Usage - BinKit Dataset

Replication Script

To replicate clone searches on the BinKit dataset without any preprocessing phases, use the script provided:

conda activate PSS_Base
bash ReplicateCloneSearchesBinKit.py

It requires 40 cores and at least 100 GB of memory and should run for between 80 hours and 200 hours.

Generalities

The BinKit directory has two subdirectories, namely, Obfus, which deals with obfuscated programs, and Normal. Each subdirectory entails a DataGeneration folder which holds the disassembly scripts, and a unique folder for each method. These method folders have scripts to extract features and embeds from samples.

Each subdirectory contains three significant scripts: 1. Run.py: This script reproduces clone searches using precomputed features stored in folders like NORMAL_EMBEDS_2. 2. Read.py: It converts the results into a readable output. 3. ReadElapsed.py: It converts the results into a dictionary storing runtimes.

The Redaction subdirectory within BinKit holds scripts that compute tables based on results obtained within each subdataset.

Usage - IoT and Windows Datasets

Replication Script - IoT

To replicate clone searches on the IoT malware dataset without any preprocessing phases, use the script provided:

conda activate PSS_Base
bash ReplicateCloneSearchesIoT.py

It requires 40 cores and at least 100 GB of memory and should run for between 1 hours and 3 hours.

Replication Script - Windows

To replicate clone searches on the Windows dataset without any preprocessing phases, use the script provided:

conda activate PSS_Base
bash ReplicateCloneSearchesWindows.py

It requires 40 cores and at least 100 GB of memory and should run for between 55 hours and 140 hours.

Generalities

Both IoT and Windows folders contain a DataGeneration subdirectory with disassembly scripts and scripts for each method to extract features and embeddings from samples. Additionally, each dataset has a DataLabelling subdirectory, which contains scripts for labeling data.

Experiment folders such as XP include Run.py scripts for conducting clone searches using precomputed embeddings. Lastly, the Redaction subdirectory in each dataset includes scripts for computing tables from the results of experiment folders.

PSSO Study

To replicate clone searches for the PSSO Study on the Windows dataset, without any preprocessing phases, use the script provided:

conda activate PSS_Base
bash ReplicateCloneSearchesPSSOStudy.py

It requires 40 cores and at least 100 GB of memory and should run for between 4 hours and 10 hours.

Ablation Study

To replicate clone searches for the Ablation Study, without any preprocessing phases, use the script provided:

conda activate PSS_Base
bash ReplicateCloneSearchesAblation.py

It requires 40 cores and at least 100 GB of memory and should run for between 7 hours and 18 hours.

DOI: 10.1145/3611643.3616279

A Highly Scalable, Hybrid, Cross-Platform Timing Analysis Framework Providing Accurate Differential Throughput Estimation via Instruction-Level Tracing

作者: Hsu, Min-Yih and Hetzelt, Felicitas and Gens, David and Maitland, Michael and Franz, Michael
关键词: combining static and dynamic analyses, differential throughput analysis, performance, throughput analysis

Abstract

Differential throughput estimation, i.e., predicting the performance impact of software changes, is critical when developing applications that rely on accurate timing bounds, such as automotive, avionic, or industrial control systems. However, developers often lack access to the target hardware to perform on-device measurements, and hence rely on instruction throughput estimation tools to evaluate performance impacts. State-of-the-art techniques broadly fall into two categories: dynamic and static. Dynamic approaches emulate program execution using cycle-accurate microarchitectural simulators resulting in high precision at the cost of long turnaround times and convoluted setups. Static approaches reduce overhead by predicting cycle counts outside of a concrete runtime environment. However, they are limited by the lack of dynamic runtime information and mostly focus on predictions over single basic blocks which requires developers to manually construct critical instruction sequences. We present MCAD, a hybrid timing analysis framework that combines the advantages of dynamic and static approaches. Instead of relying on heavyweight cycle-accurate emulation, MCAD collects instruction traces along with dynamic runtime information from QEMU and streams them to a static throughput estimator. This allows developers to accurately estimate the performance impact of software changes for complete programs within minutes, reducing turnaround times by orders of magnitude compared to existing approaches with similar accuracy. Our evaluation shows that MCAD scales to real-world applications such as FFmpeg and Clang with millions of instructions, achieving < ‍3% geo. ‍mean error compared to ground truth timings from hardware-performance counters on x86 and ‍ARM machines.

DOI: 10.1145/3611643.3616246

Tool and Reproduction Package for Paper ‘Discovering Parallelisms in Python Programs’

作者: Wei, Siwei and Song, Guyang and Zhu, Senlin and Ruan, Ruoyi and Zhu, Shihao and Cai, Yan
关键词: Parallelism, Python, Ray

Abstract

A tool for automatically discovering parallelisms in Python programs

DOI: 10.1145/3611643.3616259

IoPV： On Inconsistent Option Performance Variations

作者: Chen, Jinfu and Ding, Zishuo and Tang, Yiming and Sayagh, Mohammed and Li, Heng and Adams, Bram and Shang, Weiyi
关键词: Configurable software systems, Performance variation, Software performance

Abstract

Maintaining a good performance of a software system is a primordial task when evolving a software system. The performance regression issues are among the dominant problems that large software systems face. In addition, these large systems tend to be highly configurable, which allows users to change the behaviour of these systems by simply altering the values of certain configuration options. However, such flexibility comes with a cost. Such software systems suffer throughout their evolution from what we refer to as “Inconsistent Option Performance Variation” (IoPV ). An IoPV indicates, for a given commit, that the performance regression or improvement of different values of the same configuration option is inconsistent compared to the prior commit. For instance, a new change might not suffer from any performance regression under the default configuration (i.e., when all the options are set to their default values), while altering one option’s value manifests a regression, which we refer to as a hidden regression as it is not manifested under the default configuration. Similarly, when developers improve the performance of their systems, performance regression might be manifested under a subset of the existing configurations. Unfortunately, such hidden regressions are harmful as they can go unseen to the production environment. In this paper, we first quantify how prevalent (in)consistent performance regression or improvement is among the values of an option. In particular, we study over 803 Hadoop and 502 Cassandra commits, for which we execute a total of 4,902 and 4,197 tests, respectively, amounting to 12,536 machine hours of testing. We observe that IoPV is a common problem that is difficult to manually predict. 69% and 93% of the Hadoop and Cassandra commits have at least one configuration that hides a performance regression. Worse, most of the commits have different options or tests leading to IoPV and hiding performance regressions. Therefore, we propose a prediction model that identifies whether a given combination of commit, test, and option (CTO) manifests an IoPV. Our evaluation for different models shows that random forest is the best performing classifier, with a median AUC of 0.91 and 0.82 for Hadoop and Cassandra, respectively. Our paper defines and provides scientific evidence about the IoPV problem and its prevalence, which can be explored by future work. In addition, we provide an initial machine learning model for predicting IoPV.

DOI: 10.1145/3611643.3616319

Predicting Software Performance with Divide-and-Learn

作者: Gong, Jingzhi and Chen, Tao
关键词: Configurable System, Configuration Learning, Deep Learning, Machine Learning, Performance Learning, Performance Prediction

Abstract

Predicting the performance of highly configurable software systems is the foundation for performance testing and quality assurance. To that end, recent work has been relying on machine/deep learning to model software performance. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose an approach based on the concept of “divide-and-learn”, dubbed DaL. The basic idea is that, to handle sample sparsity, we divide the samples from the configuration landscape into distant divisions, for each of which we build a regularized Deep Neural Network as the local model to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Experiment results from eight real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, DaL performs no worse than the best counterpart on 33 out of 40 cases (within which 26 cases are significantly better) with up to 1.94\texttimes{

DOI: 10.1145/3611643.3616334

Replication Package for Article “Benchmarking Robustness of AI-Enabled Multi-sensor Fusion Systems： Challenges and Opportunities”

作者: Gao, Xinyu and Wang, Zhijie and Feng, Yang and Ma, Lei and Chen, Zhenyu and Xu, Baowen
关键词: AI Systems, Benchmarks, Multi-Sensor Fusion, Perception Systems

Abstract

This replication package contains the implementation of our benchmarks, including the fusion system, corruption patterns, evaluation metrics and data generation scripts. More details can be found in https://sites.google.com/view/ai-msf-benchmark .

DOI: 10.1145/3611643.3616278

Automated Testing and Improvement of Named Entity Recognition Systems

作者: Yu, Boxi and Hu, Yiyan and Mang, Qiuyang and Hu, Wenhan and He, Pinjia
关键词: AI software, Metamorphic testing, named entity recognition, software repairing

Abstract

Named entity recognition (NER) systems have seen rapid progress in recent years due to the development of deep neural networks. These systems are widely used in various natural language processing applications, such as information extraction, question answering, and sentiment analysis. However, the complexity and intractability of deep neural networks can make NER systems unreliable in certain circumstances, resulting in incorrect predictions. For example, NER systems may misidentify female names as chemicals or fail to recognize the names of minority groups, leading to user dissatisfaction. To tackle this problem, we introduce TIN, a novel, widely applicable approach for automatically testing and repairing various NER systems. The key idea for automated testing is that the NER predictions of the same named entities under similar contexts should be identical. The core idea for automated repairing is that similar named entities should have the same NER prediction under the same context. We use TIN to test two SOTA NER models and two commercial NER APIs, i.e., Azure NER and AWS NER. We manually verify 784 of the suspicious issues reported by TIN and find that 702 are erroneous issues, leading to high precision (85.0%-93.4%) across four categories of NER errors: omission, over-labeling, incorrect category, and range error. For automated repairing, TIN achieves a high error reduction rate (26.8%-50.6%) over the four systems under test, which successfully repairs 1,056 out of the 1,877 reported NER errors.

DOI: 10.1145/3611643.3616295

Replication package for “The EarlyBIRD Catches the Bug： On Exploiting Early Layers of Encoder Models for More Efficient Code Classification”

作者: Grishina, Anastasiia and Hort, Max and Moonen, Leon
关键词: AI4Code, AI4SE, code classification, ML4SE, model optimization, sustainability, transformer, vulnerability detection

Abstract

We refer to the description at https://doi.org/10.5281/zenodo.7608802

DOI: 10.1145/3611643.3616304

Reproduction Package for Ariticle `Deep Learning Based Feature Envy Detection Boosted by Real-World Examples’

作者: Liu, Bo and Liu, Hui and Li, Guangjie and Niu, Nan and Xu, Zimao and Wang, Yifan and Xia, Yunni and Zhang, Yuxia and Jiang, Yanjie
关键词: Code Smells, Feature Envy, Software Refactoring

Abstract

feTruth is a tool written in Python that can detect feature envy smells in Java projects.

DOI: 10.1145/3611643.3616353

Comparison and Evaluation on Static Application Security Testing (SAST) Tools for Java

作者: Li, Kaixuan and Chen, Sen and Fan, Lingling and Feng, Ruitao and Liu, Han and Liu, Chengwei and Liu, Yang and Chen, Yixiang
关键词: Benchmarks, Empirical study, Static application security testing

Abstract

Static application security testing (SAST) takes a significant role in the software development life cycle (SDLC). However, it is challenging to comprehensively evaluate the effectiveness of SAST tools to determine which is the better one for detecting vulnerabilities. In this paper, based on well-defined criteria, we first selected seven free or open-source SAST tools from 161 existing tools for further evaluation. Owing to the synthetic and newly-constructed real-world benchmarks, we evaluated and compared these SAST tools from different and comprehensive perspectives such as effectiveness, consistency, and performance. While SAST tools perform well on synthetic benchmarks, our results indicate that only 12.7% of real-world vulnerabilities can be detected by the selected tools. Even combining the detection capability of all tools, most vulnerabilities (70.9%) remain undetected, especially those beyond resource control and insufficiently neutralized input/output vulnerabilities. The fact is that although they have already built the corresponding detecting rules and integrated them into their capabilities, the detection result still did not meet the expectations. All useful findings unveiled in our comprehensive study indeed help to provide guidance on tool development, improvement, evaluation, and selection for developers, researchers, and potential users.

DOI: 10.1145/3611643.3616262

Input-Driven Dynamic Program Debloating for Code-Reuse Attack Mitigation

作者: Wang, Xiaoke and Hui, Tao and Zhao, Lei and Cheng, Yueqiang
关键词: attack mitigation, code debloating, software security

Abstract

Modern software is bloated, especially for libraries. The unnecessary code not only brings severe vulnerabilities, but also assists attackers to construct exploits. To mitigate the damage of bloated libraries, researchers have proposed several debloating techniques to remove or restrict the invocation of unused code in a library. However, existing approaches either statically keep code for all expected inputs, which leave unused code for each concrete input, or rely on runtime context to dynamically determine the necessary code, which could be manipulated by attackers.

In this paper, we propose Picup, a practical approach that dynamically customizes libraries for each input. Based on the observation that the behavior of a program mainly depends on the given input, we design Picup to predict the necessary library functions immediately after we get the input, which erases the unused code before attackers can affect the decision-making data. To achieve an effective prediction, we adopt a convolutional neural network (CNN) with attention mechanism to extract key bytes from the input and map them to library functions. We evaluate Picup on real-world benchmarks and popular applications. The results show that we can predict the necessary library functions with 97.56% accuracy, and reduce the code size by 87.55% on average with low overheads. These results indicate that Picup is a practical solution for secure and effective library debloating.

DOI: 10.1145/3611643.3616274

Reproduction Package for Article ‘TransRacer： Function Dependence-Guided Transaction Race Detection for Smart Contracts’

作者: Ma, Chenyang and Song, Wei and Huang, Jeff
关键词: data race, Ethereum, smart contract, symbolic execution

Abstract

A symbolic analysis tool that detects transaction races for Ethereum smart contracts. Install dependencies: pip install -r requirements.txt. When using the “pip install” command to install web3 on Windows, you may encounter an error if you haven’t installed gcc. To resolve this, you can install Microsoft Visual C++.

Dependencies source: Z3 solver: Install Z3 package from here web3 suit: Install web3 package from [here](https://pypi.org/project/web3/#files Infura account: Acquire Infura from here Etherscan api key: Acquire api key from here Contract initial storage: Acquire contract initial storage from here If this item is missed, TransRacer will attempt to access the contract’s initial storage by deploying the contract on a private network.

Steps to Run TransRacer 1) Make sure you can connect to the internet before running TransRacer. 2) After the TransRacer.zip is downloaded and the python environment is configured, one can run TransRacer with follow command: cd /SE &\& python main.py –addr [Contract address] –owner [Owner address] – agency_account [Infura account] –init_storage_path [initial storage file path] –api_key [api key]

Quick test Contract DistractedBoyfriend: cd /SE &\& python main.py –addr 0x351016D3eC753Db8E98a783CF51c8D6a4a8af151 –owner 0x4a3D25D58930f7b04E85E7946852fC2d8Fd59489 –agency_account https://mainnet.infura.io/v3/e67c4e1f139d4940a53bc61120bc3bf5 –api_key WTZ5E69T1SKACPGYF29W6ZG6CE3123APIU

The output of TransRacer is stored in a report file, which includes the following sub-files: 1) The “races” file provides information on function pairs that can lead to races and their corresponding witness transactions. 2) The “race bugs” file lists function pairs that can lead to storage and balance differences. 3) The “deps” file presents the found function dependencies. 4) The “time_cost” file reports the time duration spent by TransRacer on testing each contract. For the 50 contracts, the average time cost of the static analyzing, dependence analyzing, and race checking steps is approximately 1.0 minute, 1.5 minutes, and 2.6 minutes, respectively.

DOI: 10.1145/3611643.3616281

Software Composition Analysis for Vulnerability Detection： An Empirical Study on Java Projects

作者: Zhao, Lida and Chen, Sen and Xu, Zhengzi and Liu, Chengwei and Zhang, Lyuye and Wu, Jiahui and Sun, Jun and Liu, Yang
关键词: Package manager, SCA, Vulnerability detection

Abstract

Software composition analysis (SCA) tools are proposed to detect potential vulnerabilities introduced by open-source software (OSS) imported as third-party libraries (TPL). With the increasing complexity of software functionality, SCA tools may encounter various scenarios during the dependency resolution process, such as diverse formats of artifacts, diverse dependency imports, and diverse dependency specifications. However, there still lacks a comprehensive evaluation of SCA tools for Java that takes into account the above scenarios. This could lead to a confined interpretation of comparisons, improper use of tools, and hinder further improvements of the tools. To fill this gap, we proposed an Evaluation Model which consists of Scan Modes, Scan Methods, and SCA Scope for Maven (SSM), for comprehensive assessments of the dependency resolving capabilities and effectiveness of SCA tools. Based on the Evaluation Model, we first qualitatively examined 6 SCA tools’ capabilities. Next, the accuracy of dependency and vulnerability is quantitatively evaluated with a large-scale dataset (21,130 Maven modules with 73,499 unique dependencies) under two Scan Modes (i.e., build scan and pre-build scan). The results show that most tools do not fully support SSM, which leads to compromised accuracy. For dependency detection, the average F1-score is 0.890 and 0.692 for build and pre-build respectively, and for vulnerability accuracy, the average F1-score is 0.475. However, proper support for SSM reduces dependency detection false positives by 34.24% and false negatives by 6.91%. This further leads to a reduction of 18.28% in false positives and 8.72% in false negatives in vulnerability reports.

DOI: 10.1145/3611643.3616299

DeepDebugger： An Interactive Time-Travelling Debugging Approach for Deep Classifiers

作者: Yang, Xianglin and Lin, Yun and Zhang, Yifan and Huang, Linpeng and Dong, Jin Song and Mei, Hong
关键词: debugging, deep classifier, user study, visualization

Abstract

A deep classifier is usually trained to (i) learn the numeric representation vector of samples and (ii) classify sample representations with learned classification boundaries. Time-travelling visualization, as an explainable AI technique, is designed to transform the model training dynamics into an animation of canvas with colorful dots and territories. Despite that the training dynamics of the high-level concepts such as sample representations and classification boundaries are now observable, the model developers can still be overwhelmed by tens of thousands of moving dots across hundreds of training epochs (i.e., frames in the animation), which makes them miss important training events.

In this work, we make the first attempt to develop the model time-travelling visualizers to the model time-travelling debuggers, for its practical use in model debugging tasks. Specifically, given an animation of model training dynamics of sample representation and classification landscape, we propose DeepDebugger solution to recommend the samples of user interest in a human-in-the-loop manner. On one hand, DeepDebugger monitors the training dynamics of samples and recommends suspicious samples based on their abnormality. On the other hand, our recommendation is interactive and fault-resilient for the model developers to explore the training process. By learning users’ feedback, DeepDebugger refines its recommendation to fit their intention. Our extensive experiments on applying DeepDebugger on the known time-travelling visualizers show that DeepDebugger can (1) detect the majority of the abnormal movement of the training samples on canvas; (2) significantly boost the recommendation performance of samples of interest (5-10X more accurate than the baselines) with the runtime overhead of 0.015s per feedback; (3) be resilient under the 3%, 5%, 10% mistaken user feedback. Our user study of the tool shows that the interactive recommendation of DeepDebugger can help the participants accomplish the debugging tasks by saving 18.1% completion time and boosting the performance by 20.3%.

DOI: 10.1145/3611643.3616252

Mining Resource-Operation Knowledge to Support Resource Leak Detection

作者: Wang, Chong and Lou, Yiling and Peng, Xin and Liu, Jianan and Zou, Baihan
关键词: defect detection, knowledge mining, knowledge representation, resource leaks

Abstract

Resource leaks, which are caused by acquired resources not being released, often result in performance degradation and system crashes. Resource leak detection relies on two essential components: identifying potential Resource Acquisition and Release (RAR) API pairs, and subsequently analyze code to uncover instances where the corresponding release API call is absent after an acquisition API call. Yet, existing techniques confine themselves to an incomplete pair pool, either pre-defined manually or mined from project-specific code corpus, thus limiting coverage across libraries/APIs and po-
tentially overlooking latent resource leaks.

In this work, we propose to represent resource-operation knowledge as abstract resource acquisition/release operation pairs (Abs-RAR pairs for short), and present a novel approach called
MiROK to mine such Abs-RAR pairs to construct a better RAR pair pool. Given a large code corpus, MiROK first mines Abs-RAR pairs with rule-based pair expansion and learning-based pair identification strategies, and then instantiates these Abs-RAR pairs into concrete RAR pairs. We implement MiROK and apply it to mine RAR pairs from a large code corpus of 1,454,224 Java methods and 20,000 Maven libraries. We then perform an extensive evaluation to investigate the mining effectiveness of MiROK and the practical usage of its mined RAR pairs for supporting resource leak detection. Our results show that MiROK mines 1,313 new Abs-RAR pairs and instantiates them into 6,314 RAR pairs with a high precision (i.e., 93.3%). In addition, by feeding our mined RAR pairs, existing approaches detect more resource leak defects in both online code examples and open-source projects

DOI: 10.1145/3611643.3616315

Reproduction Package for Article `TransMap： Pinpointing Mistakes in Neural Code Translation’

作者: Wang, Bo and Li, Ruishi and Li, Mingkai and Saxena, Prateek
关键词: Code Translation, Large Language Models, Semantic Mistakes

Abstract

This is the artifact for the paper “TransMap: Pinpointing Mistakes in Neural Code Translation” published in ESEC/FSE 2023

The latest artifact can be found here: https://github.com/HALOCORE/TransMap

This artifact (TransMap) is a tool to pinpoint semantic mistakes in neural code translation by Codex or ChatGPT. More specifically, it focuses on Python to JavaScript code translation.

It takes a standalone Python program and its JavaScript translation (by Codex or ChatGPT) as input. It will first generate a source mapping between statements in the target program and the source program, using Codex or ChatGPT. Next, it will use the generated source map to aid in tracing the execution of the translated program and comparing it against the source reference program to pinpoint semantic mistakes in the translated program.

DOI: 10.1145/3611643.3616322

Dynamic Prediction of Delays in Software Projects using Delay Patterns and Bayesian Modeling

作者: Kula, Elvan and Greuter, Eric and van Deursen, Arie and Gousios, Georgios
关键词: agile methods, bayesian modeling, delay patterns, delay prediction

Abstract

Modern agile software projects are subject to constant change, making it essential to re-asses overall delay risk throughout the project life cycle. Existing effort estimation models are static and not able to incorporate changes occurring during project execution. In this paper, we propose a dynamic model for continuously predicting overall delay using delay patterns and Bayesian modeling. The model incorporates the context of the project phase and learns from changes in team performance over time. We apply the approach to real-world data from 4,040 epics and 270 teams at ING. An empirical evaluation of our approach and comparison to the state-of-the-art demonstrate significant improvements in predictive accuracy. The dynamic model consistently outperforms static approaches and the state-of-the-art, even during early project phases.

DOI: 10.1145/3611643.3616328

Commit-Level, Neural Vulnerability Detection and Assessment

作者: Li, Yi and Yadavally, Aashish and Zhang, Jiaxing and Wang, Shaohua and Nguyen, Tien N.
关键词: Deep Learning, Neural Networks, Software Security, Vulnerability Assessment, Vulnerability Detection

Abstract

Software Vulnerabilities (SVs) are security flaws that are exploitable in cyber-attacks. Delay in the detection and assessment of SVs might cause serious consequences due to the unknown impacts on the attacked systems. The state-of-the-art approaches have been proposed to work directly on the committed code changes for early detection. However, none of them could provide both commit-level vulnerability detection and assessment at once. Moreover, the assessment approaches still suffer low accuracy due to limited representations for code changes and surrounding contexts.

We propose a Context-aware, Graph-based, Commit-level Vulnerability Detection and Assessment Model, VDA, that evaluates a code change, detects any vulnerability and provides the CVSS assessment grades. To build VDA, we have key novel components. First, we design a novel context-aware, graph-based, representation learning model to learn the contextualized embeddings for the code changes that integrate program dependencies and the surrounding contexts of code changes, facilitating the automated vulnerability detection and assessment. Second, VDA considers the mutual impact of learning to detect vulnerability and learning to assess each vulnerability assessment type. To do so, it leverages multi-task learning among the vulnerability detection and vulnerability assessment tasks, improving all the tasks at the same time. Our empirical evaluation shows that on a C vulnerability dataset, VDA achieves 25.5% and 26.9% relatively higher than the baselines in vulnerability assessment regarding F-score and MCC, respectively. In a Java dataset, it achieves 31% and 33.3% relatively higher than the baselines in F-score and MCC, respectively. VDA also relatively improves the vulnerability detection over the baselines from 13.4–322% in F-score.

DOI: 10.1145/3611643.3616346

Enhancing Coverage-Guided Fuzzing via Phantom Program

作者: Wu, Mingyuan and Chen, Kunqiu and Luo, Qi and Xiang, Jiahong and Qi, Ji and Chen, Junjie and Cui, Heming and Zhang, Yuqun
关键词: Coverage Guidance, Fuzzing, Phantom Program

Abstract

For coverage-guided fuzzers, many of their adopted seeds are usually underused by exploring limited program states since essentially all their executions have to abide by rigorous program dependencies while only limited seeds are capable of accessing dependencies. Moreover, even when iteratively executing such limited seeds, the fuzzers have to repeatedly access the covered program states before uncovering new states. Such facts indicate that exploration power on program states of seeds has not been sufficiently leveraged by the existing coverage-guided fuzzing strategies. To tackle these issues, we propose a coverage-guided fuzzer, namely MirageFuzz, to mitigate the program dependencies when executing seeds for enhancing their exploration power on program states. Specifically, MirageFuzz first creates a “phantom” program of the target program by reducing its program dependencies corresponding to conditional statements while retaining their original semantics. Accordingly, MirageFuzz performs dual fuzzing, i.e., the source fuzzing to fuzz the original program and the phantom fuzzing to fuzz the phantom program simultaneously. Then, MirageFuzz applies the taint-based mutation mechanism to generate a new seed by updating the target conditional statement of a given seed from the source fuzzing with the corresponding condition value derived by the phantom fuzzing. To evaluate the effectiveness of MirageFuzz, we build a benchmark suite with 18 projects commonly adopted by recent fuzzing papers, and select seven open-source fuzzers as baselines for performance comparison with MirageFuzz. The experiment results suggest that MirageFuzz outperforms our baseline fuzzers from 13.42% to 77.96% averagely. Furthermore, MirageFuzz exposes 29 previously unknown bugs where 4 of them have been confirmed and 3 have been fixed by the corresponding developers.

DOI: 10.1145/3611643.3616294

Tool for “Co-dependence Aware Fuzzing for Dataflow-Based Big Data Analytics”

作者: Humayun, Ahmad and Kim, Miryung and Gulzar, Muhammad Ali
关键词: Fuzzing, Provenance, Taint Analysis

Abstract

This artifact contains the tool, DepFuzz, engineered as part of the paper “Co-dependence Aware Fuzzing for Dataflow-Based Big Data Analytics”. It is a fuzzer for DISC applications.

DOI: 10.1145/3611643.3616298

SJFuzz： Seed and Mutator Scheduling for JVM Fuzzing

作者: Wu, Mingyuan and Ouyang, Yicheng and Lu, Minghai and Chen, Junjie and Zhao, Yingquan and Cui, Heming and Yang, Guowei and Zhang, Yuqun
关键词: JVM Testing, Mutation-based Fuzzing

Abstract

While the Java Virtual Machine (JVM) plays a vital role in ensuring correct executions of Java applications, testing JVMs via generating and running class files on them can be rather challenging. The existing techniques, e.g., ClassFuzz and Classming, attempt to leverage the power of fuzzing and differential testing to cope with JVM intricacies by exposing discrepant execution results among different JVMs, i.e., inter-JVM discrepancies, for testing analytics. However, their adopted fuzzers are insufficiently guided since they include no well-designed seed and mutator scheduling mechanisms, leading to inefficient differential testing. To address such issues, in this paper, we propose SJFuzz, the first JVM fuzzing framework with seed and mutator scheduling mechanisms for automated JVM differential testing. Overall, SJFuzz aims to mutate class files via control flow mutators to facilitate the exposure of inter-JVM discrepancies. To this end, SJFuzz schedules seeds (class files) for mutations based on the discrepancy and diversity guidance. SJFuzz also schedules mutators for diversifying class file generation. To evaluate SJFuzz, we conduct an extensive study on multiple representative real-world JVMs, and the experimental results show that SJFuzz significantly outperforms the state-of-the-art mutation-based and generation-based JVM fuzzers in terms of the inter-JVM discrepancy exposure and the class file diversity. Moreover, SJFuzz successfully reported 46 potential JVM issues, and 20 of them have been confirmed as bugs and 16 have been fixed by the JVM developers.

DOI: 10.1145/3611643.3616277

Reproduction Package for Article “Metamong： Detecting Render-update Bugs in Web Browsers through Fuzzing”

作者: Song, Suhwan and Lee, Byoungyoung
关键词: Software testing, verification and validation

Abstract

Source code and all dataset used in the paper.

“Metamong.zip” contains the source code of Metamong.

“6.1.zip” contains the result of 6.1 Effectiveness of Render-update Oracle.

The data used in the paper. The directory ‘chrome’ contains Chrome render-update bugs. The directory ‘firefox’ contains Firefox render-update bugs.

In each directory, the file ‘issue_url’ contains the URL of a bug tracker result. The file ‘bug_list.txt’ categorizes which bugs are reproducible or not. The name of each directory in ‘chrome’ and ‘firefox’ represents the issue number. It contains a PoC HTML file (poc.html), a mutation primitive file (poc.js), and the correct and incorrect rendering outputs. “6.2.zip” contains the result of 6.2 Effectiveness of Page Mutator.

“100k_inputs” contains the HTML testcases. “output” contains the result of each mutation primitive test.

DOI: 10.1145/3611643.3616336

Property-Based Fuzzing for Finding Data Manipulation Errors in Android Apps

作者: Sun, Jingling and Su, Ting and Jiang, Jiayi and Wang, Jue and Pu, Geguang and Su, Zhendong
关键词: Android app testing, Model-based testing, Non-crashing functional bugs, Property-based testing

Abstract

Like many software applications, data manipulation functionalities( DMFs ) are prevalent in Android apps, which perform the common CRUD operations (create, read, update, delete) to handle app-specific data. Thus, ensuring the correctness of these DMFs is fundamentally important for many core app functionalities. However, the bugs related to DMFs (named as data manipulation errors, DMEs ), especially those non-crashing logic ones, are prevalent but difficult to find. To this end, inspired by property-based testing, we introduce a property-based fuzzing approach to effectively finding DMEs in Android apps. Our key idea is that, given some type of app data of interest, we randomly interleave its relevant DMFs and other possible events to explore diverse app states for thorough validation. Specifically, our approach characterizes DMFs in (data) model-based properties and leverage the consistency between the data model and the UI layouts as the handler to do property checking. The properties of DMFs are specified by human according to specific app features. To support the application of our approach, we implemented an automated GUI testing tool, PBFDroid. We evaluated PBFDroid on 20 real-world Android apps, and successfully found 30 unique and previously unknown bugs in 18 apps. Out of the 30 bugs, 29 of which are DMEs (22 are non-crashing logic bugs, and 7 are crash ones). To date, 19 have been confirmed and 9 have already been fixed. Many of these bugs are non-trivial and lead to different types of app failures. Our further evaluation confirms that none of the 22 non-crashing DMEs can be found by the state-of-the-art techniques. In addition, a user study shows that the manual cost of specifying the DMF properties with the assistance of our tool is acceptable. Overall, given accurate DMF properties, our approach can automatically find DMEs without any false positives. We have made all the artifacts publicly available at:https:// github.com/ property-based-fuzzing/ home.

DOI: 10.1145/3611643.3616286

Reproduction Package for “Leveraging Hardware Probes and Optimizations for Accelerating Fuzz Testing of Heterogeneous Applications”

作者: Wang, Jiyuan and Zhang, Qian and Rong, Hongbo and Xu, Guoqing Harry and Kim, Miryung
关键词: Fuzzing, Heterogeneous, Software testing

Abstract

This is the repo for the HFuzz. We build a cross-devices fuzz testing tool that works on devcloud with DPC++.

DOI: 10.1145/3611643.3616318

Reproduction package for article “NaNofuzz： A Usable Tool for Automatic Test Generation”

作者: Davis, Matthew C. and Choi, Sangheon and Estep, Sam and Myers, Brad A. and Sunshine, Joshua
关键词: automatic test generation, automatic test suite generation, CodeSpaces, fuzz testing, fuzzer, human study, programmer user study, Randomized controlled trial, TypeScript, usability study

Abstract

Includes the study materials, participant demographics, tutorials, task programs, NaNofuzz VS Code extension, VS Code configuration, collected study data, and the data analysis pipeline.

DOI: 10.1145/3611643.3616327

Reproduction Package for Article “A Generative and Mutational Approach for Synthesizing Bug-Exposing Test Cases to Guide Compiler Fuzzing”

作者: Ye, Guixin and Hu, Tianmin and Tang, Zhanyong and Fan, Zhenye and Tan, Shin Hwei and Zhang, Bo and Qian, Wenxiang and Wang, Zheng
关键词: Compiler, Deep learning, Fuzzing, Guided testing, Historical bug

Abstract

This artifact is a project for COMFUZZ that consists of source code and documentation. The source code contains various components, including test case generation, differential testing and mutation for focused testing. The purpose of this artifact is to provide a practical solution for users interested in building and utilizing our system in their own environments.

DOI: 10.1145/3611643.3616332

Evaluation Artifact： State Merging with Quantifiers in Symbolic Execution

作者: Trabish, David and Rinetzky, Noam and Shoham, Sharon and Sharma, Vaibhav
关键词: State Merging, Symbolic Execution

Abstract

The artifact contains a docker image with all the required resources for running the experiments from the paper.

DOI: 10.1145/3611643.3616287

Detecting Atomicity Violations in Interrupt-Driven Programs via Interruption Points Selecting and Delayed ISR-Triggering

作者: Yu, Bin and Tian, Cong and Xing, Hengrui and Yang, Zuchao and Su, Jie and Lu, Xu and Yang, Jiyu and Zhao, Liang and Li, Xiaofeng and Duan, Zhenhua
关键词: atomicity violation, concurrency bugs, interrupt service routines, interrupt-driven programs, static analysis

Abstract

Interrupt-driven programs have been widely used in safety-critical areas such as aerospace and embedded systems. However, uncertain interleaving execution of interrupt service routines (ISRs) usually causes concurrency bugs. Specifically, when one or more ISRs attempt to preempt a sequence of instructions which are expected to be atomic, a kind of concurrency bugs namely atomicity violation may occur, and it is challenging to find this kind of bugs precisely and efficiently. In this paper, we propose a static approach for detecting atomicity violations in interrupt-driven programs. First, the program model is constructed with interruption points being selected to determine the possibly influenced ISRs. After that, reachability computation is conducted to build up a whole abstract reachability tree, and a delayed ISR-triggering strategy is employed to reduce the state space. Meanwhile, unserializable interleaving patterns are recognized to achieve the goal of atomicity violation detection. The approach has been implemented as a configurable tool namely CPA4AV. Extensive experiments show that CPA4AV is much more precise than the relative tools available with little extra time overhead. In addition, more complex situations can be dealt with CPA4AV.

DOI: 10.1145/3611643.3616276

Artifact for “Engineering a Formally Verified Automated Bug Finder”

作者: Correnson, Arthur and Steinh"{o
关键词: Program Verification, Proof Assistants, Symbolic Execution, Symbolic Semantics, Testing

Abstract

This artifact comprises the Docker image with the WiSE and PyWiSE prototypes presented in the paper “’Engineering a Formally Verified Automated Bug Finder” at ESEC/FSE’23.

The artifact contains the following files:

README.md: This file provides an overview of the artifact, including information on running the examples provided in the paper and on navigating our Coq source code. REQUIREMENTS.md: The requirements for running our artifact. STATUS.md: The list of ESEC/FSE badges we apply for by submitting this artifact. LICENSE.md: The distribution rights for this artifact’s code and documentation. INSTALL.md: Installation instructions. wise-docker-20230821.tar.gz: The Docker container with our artifacts in a working environment.

DOI: 10.1145/3611643.3616290

SLOT： SMT-LLVM Optimizing Translation

作者: Mikek, Benjamin and Zhang, Qirun
关键词: compiler optimization, LLVM, SMT solving

Abstract

SLOT (SMT-LLVM Optimizing Translation) is a software tool that speeds up SMT solving in a solver-agnostic way by simplifying constraints. It converts SMT constraints to LLVM, applies the existing LLVM optimizer, and translates back.

DOI: 10.1145/3611643.3616357

Semantic Test Repair for Web Applications

作者: Qi, Xiaofang and Qian, Xiang and Li, Yanhui
关键词: GUI Testing, Semantic Similarity, Test Repair, Web Testing

Abstract

Automation testing is widely used in the functional testing of web applications. However, during the evolution of web applications, such web test scripts tend to break. It is essential to repair such broken test scripts to make regression testing run successfully. As manual repairing is time-consuming and expensive, researchers focus on automatic repairing techniques. Empirical study shows that the web element locator is the leading cause of web test breakages. Most existing repair techniques utilize Document Object Model attributes or visual appearances of elements to find their location but neglect their semantic information. This paper proposes a novel semantic repair technique called Semantic Test Repair (Semter) for web test repair. Our approach captures relevant semantic information from test executions on the application’s basic version and locates target elements by calculating semantic similarity between elements to repair tests. Our approach can also repair test workflow due to web page additions or deletions by a local exploration in the updated version. We evaluated the efficacy of our technique on six real-world web applications compared with three baselines. Experimental results show that Semter achieves an 84% average repair ratio within an acceptable time cost, significantly outperforming the state-of-the-art web test repair techniques.

DOI: 10.1145/3611643.3616324

Reproduction artifact of the paper “A Large-scale Empirical Review of Patch Correctness Checking Approaches”

作者: Yang, Jun and Wang, Yuehan and Lou, Yiling and Wen, Ming and Zhang, Lingming
关键词: Empirical assessment., Patch correctness, Program repair

Abstract

The artifact of the paper “A Large-scale Empirical Review of Patch Correctness Checking Approaches”. The artifact contains a new manually labeled dataset for Patch Correctness Checking and evaluation experiments for nine Patch Correctness Checking techniques.

DOI: 10.1145/3611643.3616331

Program Repair Guided by Datalog-Defined Static Analysis

作者: Liu, Yu and Mechtaev, Sergey and Suboti'{c
关键词: Datalog, program repair, static analysis, symbolic execution

Abstract

Automated program repair relying on static analysis complements test-driven repair, since it does not require failing tests to repair a bug, and it avoids test-overfitting by considering program properties.
Due to the rich variety and complexity of program analyses, existing static program repair techniques are tied to specific analysers, and thus repair only narrow classes of defects. To develop a general-purpose static program repair framework that targets a wide range of properties and programming languages, we propose to integrate program repair with Datalog-based analysis. Datalog solvers are programmable fixed point engines which can be used to encode many program analysis problems in a modular fashion. The program under analysis is encoded as Datalog facts, while the fixed point equations of the program analysis are expressed as recursive Datalog rules. In this context, we view repairing the program as modifying the corresponding Datalog facts. This is accomplished by a novel technique, symbolic execution of Datalog, that evaluates Datalog queries over a symbolic database of facts, instead of a concrete set of facts. The result of symbolic query evaluation allows us to infer what changes to a given set of Datalog facts repair the program so that it meets the desired analysis goals. We developed a symbolic executor for Datalog called Symlog, on top of which we built a repair tool SymlogRepair. We show the versatility of our
approach on several analysis problems — repairing null pointer exceptions in Java programs, repairing data leaks in Python notebooks, and repairing four types of security vulnerabilities in Solidity smart contracts.

DOI: 10.1145/3611643.3616363

Baldur： Whole-Proof Generation and Repair with Large Language Models

作者: First, Emily and Rabe, Markus N. and Ringer, Talia and Brun, Yuriy
关键词: Proof assistants, automated formal verification, large language models, machine learning, proof repair, proof synthesis

Abstract

Formally verifying software is a highly desirable but labor-intensive task.
Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time and using that model to search through the space of possible proofs.
This paper introduces a new method to automate formal verification: We use large language models, trained on natural language and code and fine-tuned on proofs, to generate whole proofs at once.
We then demonstrate that a model fine-tuned to repair generated proofs further increasing proving power.
This paper:
(1) Demonstrates that whole-proof generation using transformers is possible and is as effective but more efficient than search-based techniques.
(2) Demonstrates that giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair that further improves automated proof generation.
(3) Establishes, together with prior work, a new state of the art for fully automated proof synthesis.
We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs,
empirically showing the effectiveness of whole-proof generation, repair, and added context. We also show that Baldur complements the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification.

DOI: 10.1145/3611643.3616243

KG4CraSolver： Recommending Crash Solutions via Knowledge Graph

作者: Du, Xueying and Lou, Yiling and Liu, Mingwei and Peng, Xin and Yang, Tianyong
关键词: Crash Solution Recommendation, Knowledge Graph, Stack Overflow

Abstract

Fixing crashes is challenging, and developers often discuss their encountered crashes and refer to similar crashes and solutions on online Q&A forums (e.g., Stack Overflow). However, a crash often involves very complex contexts, which includes different contextual elements, e.g., purposes, environments, code, and crash traces. Existing crash solution recommendation or general solution
recommendation techniques only use an incomplete context or treat the entire context as pure texts to search relevant solutions for a given crash, resulting in inaccurate recommendation results. In this work, we propose a novel crash solution knowledge graph (KG) to summarize the complete crash context and its solution with a graph-structured representation. To construct the crash solution KG automatically, we propose to leverage prompt learning to construct the KG from SO threads with a small set of labeled data. Based on the constructed KG, we further propose a novel KG-based crash solution recommendation technique KG4CraSolver, which precisely finds the relevant SO thread for an encountered crash by finely analyzing and matching the complete crash context based on the crash
solution KG. The evaluation results show that the constructed KG is of high quality and KG4CraSolver outperforms baselines in terms of all metrics (e.g., 13.4%-113.4% MRR improvements). Moreover, we perform a user study and find that KG4CraSolver helps participants find crash solutions 34.4% faster and 63.3% more accurately.

DOI: 10.1145/3611643.3616317

作者: Zhang, Yuxin and Chen, Sen and Fan, Lingling and Chen, Chunyang and Li, Xiaohong
关键词: Accessibility issue repair, Android app, Color-related accessibility issue, Mobile accessibility

Abstract

Approximately 15% of the world’s population is suffering from various disabilities or impairments. However, many mobile UX designers and developers disregard the significance of accessibility for those with disabilities when developing apps. It is unbelievable that one in seven people might not have the same level of access that other users have, which actually violates many legal and regulatory standards. On the contrary, if the apps are developed with accessibility in mind, it will drastically improve the user experience for all users as well as maximize revenue. Thus, a large number of studies and some effective tools for detecting accessibility issues have been conducted and proposed to mitigate such a severe problem.
However, compared with detection, the repair work is obviously falling behind. Especially for the color-related accessibility issues, which is one of the top issues in apps with a greatly negative impact on vision and user experience. Apps with such issues are difficult to use for people with low vision and the elderly. Unfortunately, such an issue type cannot be directly fixed by existing repair techniques. To this end, we propose Iris, an automated and context-aware repair method to fix the color-related accessibility issues (i.e., the text contrast issues and the image contrast issues) for apps. By leveraging a novel context-aware technique that resolves the optimal colors and a vital phase of attribute-to-repair localization, Iris not only repairs the color contrast issues but also guarantees the consistency of the design style between the original UI page and repaired UI page. Our experiments unveiled that Iris can achieve a 91.38% repair success rate with high effectiveness and efficiency. The usefulness of Iris has also been evaluated by a user study with a high satisfaction rate as well as developers’ positive feedback. 9 of 40 submitted pull requests on GitHub repositories have been accepted and merged into the projects by app developers, and another 4 developers are actively discussing with us for further repair. Iris is publicly available to facilitate this new research direction.

DOI: 10.1145/3611643.3616329

Supplementary materials for the paper “A Case Study of Developer Bots： Motivations, Perceptions, and Challenges”

作者: Asthana, Sumit and Sajnani, Hitesh and Voyloshnikova, Elena and Acharya, Birendra and Herzig, Kim
关键词: interview-codes, python, questionnaires, R

Abstract

The artifact contains contains: * interview codes for the qualitative analysis. * Details questionnaires used in the paper. * Action logs for bots and script to reproduce the results for the quantitative analysis of the bots

DOI: 10.1145/3611643.3616248

“We Feel Like We’re Winging It：” A Study on Navigating Open-Source Dependency Abandonment

作者: Miller, Courtney and K"{a
关键词: Dependency Management, Human Factors in Software Engineering, Open Source Sustainability

Abstract

While lots of research has explored how to prevent maintainers from abandoning the open-source projects that serve as our digital infras- tructure, there are very few insights on addressing abandonment when it occurs. We argue open-source sustainability research must expand its focus beyond trying to keep particular projects alive, to also cover the sustainable use of open source by supporting users when they face potential or actual abandonment. We interviewed 33 developers who have experienced open-source dependency aban- donment. Often, they used multiple strategies to cope with aban- donment, for example, first reaching out to the community to find potential alternatives, then switching to a community-accepted alternative if one exists. We found many developers felt they had little to no support or guidance when facing abandonment, leaving them to figure out what to do through a trial-and-error process on their own. Abandonment introduces cost for otherwise seem- ingly free dependencies, but users can decide whether and how to prepare for abandonment through a number of different strategies, such as dependency monitoring, building abstraction layers, and community involvement. In many cases, community members can invest in resources that help others facing the same abandoned dependency, but often do not because of the many other competing demands on their time – a form of the volunteer’s dilemma. We dis- cuss cost reduction strategies and ideas to overcome this volunteer’s dilemma. Our findings can be used directly by open-source users seeking resources on dealing with dependency abandonment, or by researchers to motivate future work supporting the sustainable use of open source.

DOI: 10.1145/3611643.3616293

How Practitioners Expect Code Completion?

作者: Wang, Chaozheng and Hu, Junhao and Gao, Cuiyun and Jin, Yu and Xie, Tao and Huang, Hailiang and Lei, Zhenyu and Deng, Yuetang
关键词: Code completion, empirical study, practitioners expectations

Abstract

Code completion has become a common practice for programmers during their daily programming activities. It automatically predicts the next tokens or statements that the programmers may use. Code completion aims to substantially save keystrokes and improve the programming efficiency for programmers. Although there exists substantial research on code completion, it is still unclear what practitioner expectations are on code completion and whether these expectations are met by the existing research. To address these questions, we perform a study by first interviewing 15 professionals and then surveying 599 practitioners from 18 IT companies about their expectations on code completion. We then compare the practitioner expectations with the existing research by conducting a literature review of papers on code completion published in major publication venues from 2012 to 2022. Based on the comparison, we highlight the directions desirable for researchers to invest efforts toward developing code completion techniques for meeting practitioner expectations.

DOI: 10.1145/3611643.3616280

JScope

作者: Ganji, Mohammad and Alimadadi, Saba and Tip, Frank
关键词: Asynchronous JavaScript, Code Coverage, Dynamic Analysis

Abstract

VScode Extension to measure Asynchronous coverage for JavaScript

DOI: 10.1145/3611643.3616292

API-Knowledge Aware Search-Based Software Testing： Where, What, and How

作者: Ren, Xiaoxue and Ye, Xinyuan and Lin, Yun and Xing, Zhenchang and Li, Shuqing and Lyu, Michael R.
关键词: Knowledge Graph, Software Testing, Test Case Generation

Abstract

Search-based software testing (SBST) has proved its effectiveness in generating test cases to
achieve its defined test goals, such as branch and data-dependency coverage. However, to detect more program faults in an effective way, pre-defined goals can hardly be adaptive in diversified projects.
In this work, we propose KAT, a novel knowledge-aware SBST approach to generate on-demand assertions in the program under test (PUT) based on its used APIs. KAT constructs an API knowledge graph from the API documentation to derive the constraints that the client codes need to satisfy. Each constraint is instrumented into the PUT as a program branch, serving as a test goal to guide SBST to detect faults.
We evaluate KAT with two baselines (i.e., EvoSuite and Catcher) with a close-world and an open-world experiment to detect API bugs. The close-world experiment shows that KAT outperforms the baselines in the F1-score (0.55 vs. 0.24 and 0.30) to detect API-related bugs. The open-world experiment shows that KAT can detect 59.64% and 9.05% more bugs than the baselines in practice.

DOI: 10.1145/3611643.3616269

Replication Package for Article `EtherDiffer： Differential Testing on RPC Services of Ethereum Nodes’

作者: Kim, Shinhae and Hwang, Sungjae
关键词: blockchain, differential testing, ethereum nodes, rpc services

Abstract

Binary files of target nodes: clients/ Configuration template files for network construction: configs/ Smart contracts used for testing: contracts/ License file: LICENSE.txt Dependency packages of EtherDiffer: node_modules/ Dependency information files: package.json, package-lock.json Readme file: README.txt Main implementation of EtherDiffer: src/ Source files for multi-concurrent transactions: transactions/

DOI: 10.1145/3611643.3616251

wengshihao/DFauLo： Dfaulo V1.3

作者: Yin, Yining and Feng, Yang and Weng, Shihao and Liu, Zixi and Yao, Yuan and Zhang, Yichi and Zhao, Zhihong and Chen, Zhenyu
关键词: fault localization
deep learning testing
data quality

Abstract

This repository is the official implementation of the tool DfauLo.

DfauLo is a dynamic data fault localization tool for deep neural networks (DNNs), which can locate mislabeled and noisy data in the deep learning datasets. Inspired by conventional mutation-based code fault localization, DfauLo generates multiple DNN model mutants of the original trained DNN model and maps the extracted features into a suspiciousness score indicating the probability of the given data being a data fault. DfauLo is the first dynamic data fault localization technique, prioritizing the suspected data based on user feedback and providing the generalizability to unseen data faults during training.

DOI: 10.1145/3611643.3616345

Reproduction Package for Article “Understanding the Bug Characteristics and Fix Strategies of Federated Learning Systems”

作者: Du, Xiaohu and Chen, Xiao and Cao, Jialun and Wen, Ming and Cheung, Shing-Chi and Jin, Hai
关键词: Bug Characteristics, Empirical Study, Federated Learning

Abstract

The data, source code, and the results of this paper.

DOI: 10.1145/3611643.3616347

Learning Program Semantics for Vulnerability Detection via Vulnerability-Specific Inter-procedural Slicing

作者: Wu, Bozhi and Liu, Shangqing and Xiao, Yang and Li, Zhiming and Sun, Jun and Lin, Shang-Wei
关键词: Vulnerability detection, code representations, program semantics

Abstract

Learning-based approaches that learn code representations for software vulnerability detection have been proven to produce inspiring results. However, they still fail to capture complete and precise vulnerability semantics for code representations. To address the limitations, in this work, we propose a learning-based approach namely SnapVuln, which first utilizes multiple vulnerability-specific inter-procedural slicing algorithms to capture vulnerability semantics of various types and then employs a Gated Graph Neural Network (GGNN) with an attention mechanism to learn vulnerability semantics. We compare SnapVuln with state-of-the-art learning-based approaches on two public datasets, and confirm that SnapVuln outperforms them. We further perform an ablation study and demonstrate that the completeness and precision of vulnerability semantics captured by SnapVuln contribute to the performance improvement.

DOI: 10.1145/3611643.3616351

DeepRover： A Query-Efficient Blackbox Attack for Deep Neural Networks

作者: Zhang, Fuyuan and Hu, Xinwen and Ma, Lei and Zhao, Jianjun
关键词: Adversarial Attacks, Blackbox Fuzzing, Deep Neural Networks

Abstract

Deep neural networks (DNNs) achieved a significant performance breakthrough over the past decade and have been widely adopted in various industrial domains. However, a fundamental problem regarding DNN robustness is still not adequately addressed, which can potentially lead to many quality issues after deployment, e.g., safety, security, and reliability. An adversarial attack is one of the most commonly investigated techniques to penetrate a DNN by misleading the DNN’s decision through the generation of minor perturbations in the original inputs. More importantly, the adversarial attack is a crucial way to assess, estimate, and understand the robustness boundary of a DNN. Intuitively, a stronger adversarial attack can help obtain a tighter robustness boundary, allowing us to understand the potential worst-case scenario when a DNN is deployed. To push this further, in this paper, we propose DeepRover, a fuzzing-based blackbox attack for deep neural networks used for image classification. We show that DeepRover is more effective and query-efficient in generating adversarial examples than state-of-the-art blackbox attacks. Moreover, DeepRover can find adversarial examples at a finer-grained level than other approaches.

DOI: 10.1145/3611643.3616370

Practical Inference of Nullability Types

作者: Karimipour, Nima and Pham, Justin and Clapp, Lazaro and Sridharan, Manu
关键词: inference, java, nullability, static-code-analysis

Abstract

This upload is a docker image containing the artifact and scripts to rerun experiments.

Container Structure

This docker image contains:

source code (NullAwayAnnotator) of our tool (will be found in /var/NullAwayAnnotator)
all benchmarks (will be cloned in /var/benchmarks)
scripts to reproduce our experiments (will be found in /var/AE)

Setup

Install Docker based on your system configuration: Get Docker.
Import the artifact into Docker: docker load annotator-ae-fse-2023
Run the Docker image (give container at least 16gigs of ram): docker run --name annotator-ae annotator-ae-fse-2023 &
Access docker container shell: docker exec -it annotator-ae bash

All required packages have been already installed in the docker image, the docker can be safely executed with no internet connection.

Instructions for how to run the paper’s experiments are inside the container in the README.md file at /var/README.md.

DOI: 10.1145/3611643.3616326

LibKit source code and database dump

作者: Dom'{\i
关键词: iOS, library detection, mobile apps, static analysis

Abstract

The artifact contains the source code of the technique and a database dump of the MongoDB database the technique relies on

DOI: 10.1145/3611643.3616344

Artifact for `FunProbe： Probing Functions from Binary Code through Probabilistic Analysis`

作者: Kim, Soomin and Kim, Hyungseok and Cha, Sang Kil
关键词: binary code analysis, function identification, probabilistic analysis

Abstract

This is an artifact for `FunProbe: Probing Functions from Binary Code through Probabilistic Analysis’, which will be published in the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2023. FunProbe is a function identification tool based on Bayesian Network. The artifact contains the implementation of FunProbe, experimental scripts, and the dataset used to evaluate FunProbe.

DOI: 10.1145/3611643.3616366

Source Code and Data for BigDataflow

作者: Sun, Zewen and Xu, Duanchen and Zhang, Yiyu and Qi, Yun and Wang, Yueyang and Zuo, Zhiqiang and Wang, Zhaokang and Li, Yue and Li, Xuandong and Lu, Qingda and Peng, Wenwen and Guo, Shengjian
关键词: distributed computing, graph processing, interprocedural dataflow analysis

Abstract

The material through the Figshare link contains source code and experiments data of BigDataflow

DOI: 10.1145/3611643.3616348

Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow Posts

作者: Yang, Wenhua and Zhang, Chong and Pan, Minxue
关键词: GPU programming, Stack Overflow, topic taxonomy

Abstract

GPUs have cemented their position in computer systems, not restricted to graphics but also extensively used for general-purpose computing. With this comes a rapidly expanding population of developers using GPUs for programming. However, programming with GPUs is notoriously difficult due to their unique architecture and constant evolution. A large number of developers have encountered problems of one kind or another, and many of them have turned to Q&A sites for help. Unfortunately, there has been no prior work to comprehensively study the topics discussed and challenges encountered by developers in GPU programming. To fill this knowledge gap, we conduct a comprehensive study to understand the topics and challenges of GPU programming using Stack Overflow. We collect 25,269 relevant posts from Stack Overflow, propose a novel approach that combines automatic techniques and manual thematic analysis to extract topics, and build a taxonomy of topics with detailed discussions of the popularity, difficulty, and changing trends of these topics. In addition, we analyzed relevant posts through extensive manual efforts to understand the challenges of each topic and to summarize them for future research.

DOI: 10.1145/3611643.3616365

Software Architecture in Practice： Challenges and Opportunities

作者: Wan, Zhiyuan and Zhang, Yun and Xia, Xin and Jiang, Yi and Lo, David
关键词: Grounded Theory, Practice, Software Architecture, Software Development and Maintenance

Abstract

Software architecture has been an active research field for nearly four decades, in which previous studies make significant progress such as creating methods and techniques and building tools to support software architecture practice. Despite past efforts, we have little understanding of how practitioners perform software architecture related activities, and what challenges they face. Through interviews with 32 practitioners from 21 organizations across three continents, we identified challenges that practitioners face in software architecture practice during software development and maintenance. We reported on common software architecture activities at software requirements, design, construction and testing, and maintenance stages, as well as corresponding challenges. Our study uncovers that most of these challenges center around management, documentation, tooling and process, and collects recommendations to address these challenges.

DOI: 10.1145/3611643.3616367

Replication package for the article “On the Usage of Continual Learning for Out-of-Distribution Generalization in Pre-trained Languages Models of Code”

作者: Weyssow, Martin and Zhou, Xin and Kim, Kisub and Lo, David and Sahraoui, Houari
关键词: continual learning, deep learning for code, out-of-distribution generalization, pre-trained language models

Abstract

The artifact contains the Python project implemented to conduct the experiments presented in the paper. It includes an extensive guide on how to reproduce the experiments and acquire the data for performing pre-training, fine-tuning and inference using pre-trained language models.

DOI: 10.1145/3611643.3616244

Replication Package for Grace： Language Models Meet Code Edits

作者: Gupta, Priyanshu and Khare, Avishree and Bajpai, Yasharth and Chakraborty, Saikat and Gulwani, Sumit and Kanade, Aditya and Radhakrishna, Arjun and Soares, Gustavo and Tiwari, Ashish
关键词: Associated edits, Code editing, Large language models, Pre-trained
model, Programming language processing

Abstract

The package contains: 1. Code and instructions to create input data for GrACE from the C3PO dataset. 2. Scripts to run experiments described in the paper 3. A tutorial notebook to demonstrate GrACE in action

DOI: 10.1145/3611643.3616253

Recommending Analogical APIs via Knowledge Graph Embedding

作者: Liu, Mingwei and Yang, Yanjun and Lou, Yiling and Peng, Xin and Zhou, Zhong and Du, Xueying and Yang, Tianyong
关键词: API Migration, Knowledge Graph, Knowledge Graph Embedding

Abstract

Library migration, which replaces the current library with a different one to retain the same software behavior, is common in software evolution. An essential part of this is finding an analogous API for the desired functionality. However, due to the multitude of libraries/APIs, manually finding such an API is time-consuming and error-prone. Researchers created automated analogical API recommendation techniques, notably documentation-based methods. Despite potential, these methods have limitations, e.g., incomplete semantic understanding in documentation and scalability issues.
In this study, we present KGE4AR, a novel documentation-based approach using knowledge graph (KG) embedding for recommending analogical APIs during library migration. KGE4AR introduces a unified API KG to comprehensively represent documentation knowledge, capturing high-level semantics. It further embeds this unified API KG into vectors for efficient, scalable similarity calculation. We assess KGE4AR with 35,773 Java libraries in two scenarios, with and without target libraries. KGE4AR notably outperforms state-of-the-art techniques (e.g., 47.1%-143.0% and 11.7%-80.6% MRR improvements), showcasing scalability with growing library counts.

DOI: 10.1145/3611643.3616305

Reproduction package for Article “CCT5： A Code-Change-Oriented Pre-trained Model”

作者: Lin, Bo and Wang, Shangwen and Liu, Zhongxin and Liu, Yepang and Xia, Xin and Mao, Xiaoguang
关键词: Code Change, Deep Learning, Pre-Training

Abstract

This is the reproduction package for article “CCT5: A Code-Change-Oriented Pre-trained Model”.

DOI: 10.1145/3611643.3616339

Artifact for LExecutor： Learning-Guided Execution

作者: Souza, Beatriz and Pradel, Michael
关键词: dynamic analysis, execution, neural models

Abstract

This artifact contains the implementation of LExecutor and supplementary material for the paper “LExecutor: Learning-Guided Execution” (FSE’23).

DOI: 10.1145/3611643.3616254

Software Architecture Recovery with Information Fusion

作者: Zhang, Yiran and Xu, Zhengzi and Liu, Chengwei and Chen, Hongxu and Sun, Jianwen and Qiu, Dong and Liu, Yang
关键词: architecture comparison, reverse engineering, software architecture recovery, software module clustering

Abstract

Understanding the architecture is vital for effectively maintaining and managing large software systems. However, as software systems evolve over time, their architectures inevitably change. To keep up with the change, architects need to track the implementation-level changes and update the architectural documentation accordingly, which is time-consuming and error-prone. Therefore, many automatic architecture recovery techniques have been proposed to ease this process. Despite efforts have been made to improve the accuracy of architecture recovery, existing solutions still suffer from two limitations. First, most of them only use one or two type of information for the recovery, ignoring the potential usefulness of other sources. Second, they tend to use the information in a coarse-grained manner, overlooking important details within it. To address these limitations, we propose SARIF, a fully automated architecture recovery technique, which incorporates three types of comprehensive information, including dependencies, code text and folder structure. SARIF can recover architecture more accurately by thoroughly analyzing the details of each type of information and adaptively fusing them based on their relevance and quality. To evaluate SARIF, we collected six projects with published ground-truth architectures and three open-source projects labeled by our industrial collaborators. We compared SARIF with nine state-of-the-art techniques using three commonly-used architecture similarity metrics and two new metrics. The experimental results show that SARIF is 36.1% more accurate than the best of the previous techniques on average. By providing comprehensive architecture, SARIF can help users understand systems effectively and reduce the manual effort of obtaining ground-truth architectures.

DOI: 10.1145/3611643.3616285

Evaluating Transfer Learning for Simplifying GitHub READMEs

作者: Gao, Haoyu and Treude, Christoph and Zahedi, Mansooreh
关键词: GitHub, Software Documentation, Text Simplification, Transfer Learning

Abstract

Software documentation captures detailed knowledge about a software product, e.g., code, technologies, and design. It plays an important role in the coordination of development teams and in conveying ideas to various stakeholders. However, software documentation can be hard to comprehend if it is written with jargon and complicated sentence structure. In this study, we explored the potential of text simplification techniques in the domain of software engineering to automatically simplify GitHub README files. We collected software-related pairs of GitHub README files consisting of 14,588 entries, aligned difficult sentences with their simplified counterparts, and trained a Transformer-based model to automatically simplify difficult versions. To mitigate the sparse and noisy nature of the software-related simplification dataset, we applied general text simplification knowledge to this field. Since many general-domain difficult-to-simple Wikipedia document pairs are already publicly available, we explored the potential of transfer learning by first training the model on the Wikipedia data and then fine-tuning it on the README data. Using automated BLEU scores and human evaluation, we compared the performance of different transfer learning schemes and the baseline models without transfer learning. The transfer learning model using the best checkpoint trained on a general topic corpus achieved the best performance of 34.68 BLEU score and statistically significantly higher human annotation scores compared to the rest of the schemes and baselines. We conclude that using transfer learning is a promising direction to circumvent the lack of data and drift style problem in software README files simplification and achieved a better trade-off between simplification and preservation of meaning.

DOI: 10.1145/3611643.3616291

CodeMark： Imperceptible Watermarking for Code Datasets against Neural Code Completion Models

作者: Sun, Zhensu and Du, Xiaoning and Song, Fu and Li, Li
关键词: Code dataset, Neural code completion models, Watermarking

Abstract

Code datasets are of immense value for training neural-network-based code completion models, where companies or organizations have made substantial investments to establish and process these datasets.
Unluckily, these datasets, either built for proprietary or public usage, face the high risk of unauthorized exploits, resulting from data leakages, license violations, etc.
Even worse, the “black-box” nature of neural models sets a high barrier for externals to audit their training datasets, which further connives these unauthorized usages.
Currently, watermarking methods have been proposed to prohibit inappropriate usage of image and natural language datasets.
However, due to domain specificity, they are not directly applicable to code datasets, leaving the copyright protection of this emerging and important field of code data still exposed to threats.
To fill this gap, we propose a method, named CodeMark, to embed user-defined imperceptible watermarks into code datasets to trace their usage in training neural code completion models.
CodeMark is based on adaptive semantic-preserving transformations, which preserve the exact functionality of the code data and keep the changes covert against rule-breakers.
We implement CodeMark in a toolkit and conduct an extensive evaluation of code completion models.
CodeMark is validated to fulfill all desired properties of practical watermarks, including
harmlessness to model accuracy, verifiability, robustness, and imperceptibility.

DOI: 10.1145/3611643.3616297

An Explainability-Guided Testing Framework for Robustness of Malware Detectors

作者: Sun, Ruoxi and Xue, Minhui and Tyson, Gareth and Dong, Tian and Li, Shaofeng and Wang, Shuo and Zhu, Haojin and Camtepe, Seyit and Nepal, Surya
关键词: Explainability, Malware detectors, Robustness

Abstract

Numerous open-source and commercial malware detectors are available. However, their efficacy is threatened by new adversarial attacks, whereby malware attempts to evade detection, e.g., by performing feature-space manipulation. In this work, we propose an explainability-guided and model-agnostic testing framework for robustness of malware detectors when confronted with adversarial attacks. The framework introduces the concept of Accrued Malicious Magnitude (AMM) to identify which malware features could be manipulated to maximize the likelihood of evading detection. We then use this framework to test several state-of-the-art malware detectors’ abilities to detect manipulated malware. We find that (i) commercial antivirus engines are vulnerable to AMM-guided test cases; (ii) the ability of a manipulated malware generated using one detector to evade detection by another detector (i.e., transferability) depends on the overlap of features with large AMM values between the different detectors; and (iii) AMM values effectively measure the fragility of features (i.e., capability of feature-space manipulation to flip the prediction results) and explain the robustness of malware detectors facing evasion attacks. Our findings shed light on the limitations of current malware detectors, as well as how they can be improved.

DOI: 10.1145/3611643.3616309

Reproduction Package for paper “Crystallizer： A Hybrid Path Analysis Framework To Aid in Uncovering Deserialization Vulnerabilities”

作者: Srivastava, Prashast and Toffalini, Flavio and Vorobyov, Kostyantyn and Gauthier, Fran\c{c
关键词: deserialization testing, hybrid analysis, Java

Abstract

The is a reproduction package containing source code of our framework along with scripts, auxiliary data required to run our experiments along with data to reproduce the results presented in the paper.

DOI: 10.1145/3611643.3616313

ViaLin： Path-Aware Dynamic Taint Analysis for Android

作者: Ahmed, Khaled and Wang, Yingying and Lis, Mieszko and Rubin, Julia
关键词: Android, Dynamic taint analysis, path tracking

Abstract

Dynamic taint analysis - a program analysis technique that checks whether information flows between particular source and sink locations in the program, has numerous applications in security, program comprehension, and software testing. Specifically, in mobile software, taint analysis is often used to determine whether mobile apps contain stealthy behaviors that leak user-sensitive information to unauthorized third-party servers. While a number of dynamic taint analysis techniques for Android software have been recently proposed, none of them are able to report the complete information propagation path, only reporting flow endpoints, i.e., sources and sinks of the detected information flows. This design optimizes for runtime performance and allows the techniques to run efficiently on a mobile device. Yet, it impedes the applicability and usefulness of the techniques: an analyst using the tool would need to manually identify information propagation paths, e.g., to determine whether information was properly handled before being released, which is a challenging task in large real-world applications.

In this paper, we address this problem by proposing a dynamic taint analysis technique that reports accurate taint propagation paths. We implement it in a tool, ViaLin, and evaluate it on a set of existing benchmark applications and on 16 large Android applications from the Google Play store. Our evaluation shows that ViaLin accurately detects taint flow paths while running on a mobile device with a reasonable time and memory overhead.

DOI: 10.1145/3611643.3616330

Distinguishing Look-Alike Innocent and Vulnerable Code by Subtle Semantic Representation Learning and Explanation

作者: Ni, Chao and Yin, Xin and Yang, Kaiwen and Zhao, Dehai and Xing, Zhenchang and Xia, Xin
关键词: Contrastive Learning, Developer-oriented Explanation, Subtle Semantic Difference, Vulnerability Detection

Abstract

Though many deep learning (DL)-based vulnerability detection approaches have been proposed and indeed achieved remarkable performance, they still have limitations in the generalization as well as the practical usage. More precisely, existing DL-based approaches (1) perform negatively on prediction tasks among functions that are lexically similar but have contrary semantics; (2) provide no
intuitive developer-oriented explanations to the detected results.

In this paper, we propose a novel approach named SVulD, a function-level S ubtle semantic embedding for Vulnerability Detection along with intuitive explanations, to alleviate the above limitations.
Specifically, SVulD firstly trains a model to learn distinguishing semantic representations of functions regardless of their lexical similarity. Then, for the detected vulnerable functions, SVulD provides natural language explanations (e.g., root cause) of results to help developers intuitively understand the vulnerabilities. To evaluate the effectiveness of SVulD, we conduct large-scale experiments on a widely used practical vulnerability dataset and compare it with four state-of-the-art (SOTA) approaches by considering five performance measures. The experimental results indicate that SVulD outperforms all SOTAs with a substantial improvement (i.e., 23.5%-68.0% in terms of F1-score, 15.9%-134.8% in terms of PR-AUC and 7.4%-64.4% in terms of Accuracy). Besides, we conduct a user-case study to evaluate the usefulness of SVulD for developers on understanding the vulnerable code and the participants’ feedback demonstrates that SVulD is helpful for development practice.

DOI: 10.1145/3611643.3616358

A Unified Framework for Mini-game Testing： Experience on WeChat

作者: Wang, Chaozheng and Lu, Haochuan and Gao, Cuiyun and Li, Zongjie and Xiong, Ting and Deng, Yuetang
关键词: GUI widget detection, game testing

Abstract

Mobile games play an increasingly important role in our daily life. The quality of mobile games can substantially affect the user experience and game revenue. Different from traditional mobile games, the mini-games provided by our partner, Tencent, are embedded in the mobile app WeChat, so users do not need to install specific game apps and can directly play the games in the app. Due to the convenient installation, WeChat has attracted large numbers of developers to design and publish on the mini-game platform in the app. Until now, the platform has more than one hundred thousand published mini-games. Manually testing all the mini-games requires enormous effort and is impractical. There exist automated game testing methods; however, they are difficult to be applied for testing mini-games for the following reasons: 1) Effective game testing heavily relies on prior knowledge about game operations and extraction of GUI widget trees. However, this knowledge is specific and not always applicable when testing a large number of mini-games with complex game engines (e.g., Unity). 2) The highly diverse GUI widget design of mini-games deviates significantly from that of mobile apps. Such issue prevents the existing image-based GUI widget detection techniques from effectively detecting widgets in mini-games. To address the aforementioned issues, we propose a unified framework for black-box mini-game testing named iExplorer. iExplorer involves a mixed GUI widget detection approach incorporating both deep learning-based object detection and edge aggregation-based segmentation for detecting GUI widgets in mini-games. A category-aware testing strategy is then proposed for testing mini-games, with different categories of widgets (e.g., sliding and clicking widgets) considered. iExplorer has been deployed in for more than six months. In the past 30 days, iExplorer has tested large-scale mini-games (i.e., 76,000) and successfully found 22,144 real bugs.

DOI: 10.1145/3611643.3613868

作者: Si, Haotian and Pei, Changhua and Li, Zhihan and Zhao, Yadong and Li, Jingjing and Zhang, Haiming and Diao, Zulong and Li, Jianhui and Xie, Gaogang and Pei, Dan
关键词: Multivariate Time Series, Unsupervised Anomaly Detection

Abstract

Massive key performance indicators (KPIs) are monitored as multivariate time series data (MTS) to ensure the reliability of the software applications and service system. Accurately detecting the abnormality of MTS is very critical for subsequent fault elimination. The scarcity of anomalies and manual labeling has led to the development of various self-supervised MTS anomaly detection (AD) methods, which optimize an overall objective/loss encompassing all metrics’ regression objectives/losses. However, our empirical study uncovers the prevalence of conflicts among metrics’ regression objectives, causing MTS models to grapple with different losses. This critical aspect significantly impacts detection performance but has been overlooked in existing approaches. To address this problem, by mimicking the design of multi-gate mixture-of-experts (MMoE), we introduce CAD, a Conflict-aware multivariate KPI Anomaly Detection algorithm. CAD offers an exclusive structure for each metric to mitigate potential conflicts while fostering inter-metric promotions. Upon thorough investigation, we find that the poor performance of vanilla MMoE mainly comes from the input-output misalignment settings of MTS formulation and convergence issues arising from expansive tasks. To address these challenges, we propose a straightforward yet effective task-oriented metric selection and p&s (personalized and shared) gating mechanism, which establishes CAD as the first practicable multi-task learning (MTL) based MTS AD model. Evaluations on multiple public datasets reveal that CAD obtains an average F1-score of 0.943 across three public datasets, notably outperforming state-of-the-art methods. Our code is accessible at https://github.com/dawnvince/MTS_CAD.

DOI: 10.1145/3611643.3613896

InferFix： End-to-End Program Repair with LLMs

作者: Jin, Matthew and Shahriar, Syed and Tufano, Michele and Shi, Xin and Lu, Shuai and Sundaresan, Neel and Svyatkovskiy, Alexey
关键词: Program repair, finetuning, prompt augmentation, static analyses

Abstract

Software development life cycle is profoundly influenced by bugs; their introduction, identification, and eventual resolution account for a significant portion of software development cost. This has motivated software engineering researchers and practitioners to propose different approaches for automating the identification and repair of software defects. Large Language Models (LLMs) have been adapted to the program repair task through few-shot demonstration learning and instruction prompting, treating this as an infilling task. However, these models have only focused on learning general bug-fixing patterns for uncategorized bugs mined from public repositories. In this paper, we propose : a transformer-based program repair framework paired with a state-of-the-art static analyzer to fix critical security and performance bugs. combines a Retriever – transformer encoder model pretrained via contrastive learning objective, which aims at searching for semantically equivalent bugs and corresponding fixes; and a Generator – an LLM (12 billion parameter Codex Cushman model) finetuned on supervised bug-fix data with prompts augmented via adding bug type annotations and semantically similar fixes retrieved from an external non-parametric memory. To train and evaluate our approach, we curated , a novel, metadata-rich dataset of bugs extracted by executing the Infer static analyzer on the change histories of thousands of Java and C# repositories. Our evaluation demonstrates that outperforms strong LLM baselines, with a top-1 accuracy of 65.6% for generating fixes in C# and 76.8% in Java. We discuss the deployment of alongside Infer at Microsoft which offers an end-to-end solution for detection, classification, and localization of bugs, as well as fixing and validation of candidate patches, integrated in the continuous integration (CI) pipeline to automate the software development workflow.

DOI: 10.1145/3611643.3613892

Assess and Summarize： Improve Outage Understanding with Large Language Models

作者: Jin, Pengxiang and Zhang, Shenglin and Ma, Minghua and Li, Haozhe and Kang, Yu and Li, Liqun and Liu, Yudong and Qiao, Bo and Zhang, Chaoyun and Zhao, Pu and He, Shilin and Sarro, Federica and Dang, Yingnong and Rajmohan, Saravan and Lin, Qingwei and Zhang, Dongmei
关键词: Cloud Systems, Large Language Model, Outage Understanding

Abstract

Cloud systems have become increasingly popular in recent years due to their flexibility and scalability. Each time cloud computing applications and services hosted on the cloud are affected by a cloud outage, users can experience slow response times, connection issues or total service disruption, resulting in a significant negative business impact. Outages are usually comprised of several concurring events/source causes, and therefore understanding the context of outages is a very challenging yet crucial first step toward mitigating and resolving outages. In current practice, on-call engineers with in-depth domain knowledge, have to manually assess and summarize outages when they happen, which is time-consuming and labor-intensive. In this paper, we first present a large-scale empirical study investigating the way on-call engineers currently deal with cloud outages at Microsoft, and then present and empirically validate a novel approach (dubbed Oasis) to help the engineers in this task. Oasis is able to automatically assess the impact scope of outages as well as to produce human-readable summarization. Specifically, Oasis first assesses the impact scope of an outage by aggregating relevant incidents via multiple techniques. Then, it generates a human-readable summary by leveraging fine-tuned large language models like GPT-3.x. The impact assessment component of Oasis was introduced in Microsoft over three years ago, and it is now widely adopted, while the outage summarization component has been recently introduced, and in this article we present the results of an empirical evaluation we carried out on 18 real-world cloud systems as well as a human-based evaluation with outage owners. The results obtained show that Oasis can effectively and efficiently summarize outages, and lead Microsoft to deploy its first prototype which is currently under experimental adoption by some of the incident teams.

DOI: 10.1145/3611643.3613891

Supplemental/Meta Interview Data

作者: Happe, Andreas and Cito, J"{u
关键词: interview guide, research plan

Abstract

Interview Question Guide
Research Plan
Ethical Approval

DOI: 10.1145/3611643.3613900

Towards Efficient Record and Replay： A Case Study in WeChat

作者: Feng, Sidong and Lu, Haochuan and Xiong, Ting and Deng, Yuetang and Chen, Chunyang
关键词: Efficient record and replay, GUI rendering, Machine Learning

Abstract

WeChat, a widely-used messenger app boasting over 1 billion monthly active users, requires effective app quality assurance for its complex features. Record-and-replay tools are crucial in achieving this goal. Despite the extensive development of these tools, the impact of waiting time between replay events has been largely overlooked. On one hand, a long waiting time for executing replay events on fully-rendered GUIs slows down the process. On the other hand, a short waiting time can lead to events executing on partially-rendered GUIs, negatively affecting replay effectiveness. An optimal waiting time should strike a balance between effectiveness and efficiency. We introduce WeReplay, a lightweight image-based approach that dynamically adjusts inter-event time based on the GUI rendering state. Given the real-time streaming on the GUI, WeReplay employs a deep learning model to infer the rendering state and synchronize with the replaying tool, scheduling the next event when the GUI is fully rendered. Our evaluation shows that our model achieves 92.1% precision and 93.3% recall in discerning GUI rendering states in the WeChat app. Through assessing the performance in replaying 23 common WeChat usage scenarios, WeReplay successfully replays all scenarios on the same and different devices more efficiently than the state-of-the-practice baselines.

DOI: 10.1145/3611643.3613880

Reproduction Package for Article `Last Diff Analyzer： Multi-language Automated Approver for Behavior-Preserving Code Revisions`

作者: Wang, Yuxin and Welc, Adam and Clapp, Lazaro and Chen, Lingchao
关键词: automated code approver, code reviews, static analysis

Abstract

This artifact contains the source code for the tool we published in the paper Last Diff Analyzer: Multi-language Automated Approver for Behavior-Preserving Code Revisions for re-usability within the research and engineering communities.

DOI: 10.1145/3611643.3613870

Dead Code Removal at Meta： Automatically Deleting Millions of Lines of Code and Petabytes of Deprecated Data

作者: Shackleton, Will and Cohn-Gordon, Katriel and Rigby, Peter C. and Abreu, Rui and Gill, James and Nagappan, Nachiappan and Nakad, Karim and Papagiannis, Ioannis and Petre, Luke and Megreli, Giorgi and Riggs, Patrick and Saindon, James
关键词: Automated refactoring, Code transformation, Data cleanup, Data purging

Abstract

Software constantly evolves in response to user needs: new features are built, deployed, mature and grow old, and eventually their usage drops enough to merit switching them off. In any large codebase, this feature lifecycle can naturally lead to retaining unnecessary code and data. Removing these respects users’ privacy expectations, as well as helping engineers to work efficiently. In prior software engineering research, we have found little evidence of code deprecation or dead-code removal at industrial scale. We describe Systematic Code and Asset Removal Framework (SCARF), a product deprecation system to assist engineers working in large codebases. SCARF identifies unused code and data assets and safely removes them. It operates fully automatically, including committing code and dropping database tables. It also gathers developer input where it cannot take automated actions, leading to further removals. Dead code removal increases the quality and consistency of large codebases, aids with knowledge management and improves reliability. SCARF has had an important impact at Meta. In the last year alone, it has removed petabytes of data across 12.8 million distinct assets, and deleted over 104 million lines of code.

DOI: 10.1145/3611643.3613871

Incrementalizing Production CodeQL Analyses

作者: Szab'{o
关键词: CodeQL, Datalog, Incremental Computing, Static Analysis

Abstract

Instead of repeatedly re-analyzing from scratch, an incremental static analysis only analyzes a
codebase once completely, and then it updates the previous results based on the code changes. While this sounds promising to achieve speed-ups, the reality is that sophisticated static analyses typically employ features that can ruin incremental performance, such as inter-procedurality or context-sensitivity. In this study, we set out to explore whether incrementalization can help to achieve speed-ups for production CodeQL analyses that provide automated feedback on pull requests on GitHub. We first empirically validate the idea by measuring the potential for reuse on real-world
codebases, and then we create a prototype incremental solver for CodeQL that exploits
incrementality. We report on experimental results showing that we can indeed achieve update times proportional to the size of the code change, and we also discuss the limitations of our prototype.

DOI: 10.1145/3611643.3613860

xASTNN： Improved Code Representations for Industrial Practice

作者: Xu, Zhiwei and Zhou, Min and Zhao, Xibin and Chen, Yang and Cheng, Xi and Zhang, Hongyu
关键词: big code, code feature learning, neural code representation

Abstract

The application of deep learning techniques in software engineering becomes increasingly popular. One key problem is developing high-quality and easy-to-use source code representations for code-related tasks. The research community has acquired impressive results in recent years. However, due to the deployment difficulties and performance bottlenecks, seldom these approaches are applied to the industry. In this paper, we present xASTNN, an eXtreme Abstract Syntax Tree (AST)-based Neural Network for source code representation, aiming to push this technique to industrial practice. The proposed xASTNN has three advantages. First, xASTNN is completely based on widely-used ASTs and does not require complicated data pre-processing, making it applicable to various programming languages and practical scenarios. Second, three closely-related designs are proposed to guarantee the effectiveness of xASTNN, including statement subtree sequence for code naturalness, gated recursive unit for syntactical information, and gated recurrent unit for sequential information. Third, a dynamic batching algorithm is introduced to significantly reduce the time complexity of xASTNN. Two code comprehension downstream tasks, code classification and code clone detection, are adopted for evaluation. The results demonstrate that our xASTNN can improve the state-of-the-art while being faster than the baselines.

DOI: 10.1145/3611643.3613869

From Point-wise to Group-wise： A Fast and Accurate Microservice Trace Anomaly Detection Approach

作者: Xie, Zhe and Pei, Changhua and Li, Wanxue and Jiang, Huai and Su, Liangfei and Li, Jianhui and Xie, Gaogang and Pei, Dan
关键词: anomaly detection, microservice trace, variational autoencoder

Abstract

As Internet applications continue to scale up, microservice architecture has become increasingly popular due to its flexibility and logical structure. Anomaly detection in traces that record inter-microservice invocations is essential for diagnosing system failures. Deep learning-based approaches allow for accurate modeling of structural features (i.e., call paths) and latency features (i.e., call response time), which can determine the anomaly of a particular trace sample. However, the point-wise manner employed by these methods results in substantial system detection overhead and impracticality, given the massive volume of traces (billion-level). Furthermore, the point-wise approach lacks high-level information, as identical sub-structures across multiple traces may be encoded differently. In this paper, we introduce the first Group-wise Trace anomaly detection algorithm, named GTrace. This method categorizes the traces into distinct groups based on their shared sub-structure, such as the entire tree or sub-tree structure. A group-wise Variational AutoEncoder (VAE) is then employed to obtain structural representations. Moreover, the innovative “predicting latency with structure” learning paradigm facilitates the association between the grouped structure and the latency distribution within each group. With the group-wise design, representation caching, and batched inference strategies can be implemented, which significantly reduces the burden of detection on the system. Our comprehensive evaluation reveals that GTrace outperforms state-of-the-art methods in both performances (2.64% to 195.45% improvement in AUC metrics and 2.31% to 40.92% improvement in best F-Score) and efficiency (21.9x to 28.2x speedup). We have deployed and assessed the proposed algorithm on eBay’s microservices cluster, and our code is available at https://github.com/NetManAIOps/GTrace.git.

DOI: 10.1145/3611643.3613861

STEAM： Observability-Preserving Trace Sampling

作者: He, Shilin and Feng, Botao and Li, Liqun and Zhang, Xu and Kang, Yu and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei
关键词: distributed tracing, graph neural network, trace sampling

Abstract

In distributed systems and microservice applications, tracing is a crucial observability signal employed for comprehending their internal states. To mitigate the overhead associated with distributed tracing, most tracing frameworks utilize a uniform sampling strategy, which retains only a subset of traces. However, this approach is insufficient for preserving system observability. This is primarily attributed to the long-tail distribution of traces in practice, which results in the omission or rarity of minority yet critical traces after sampling. In this study, we introduce an observability-preserving trace sampling method, denoted as STEAM, which aims to retain as much information as possible in the sampled traces. We employ Graph Neural Networks (GNN) for trace representation, while incorporating domain knowledge of trace comparison through logical clauses. Subsequently, we employ a scalable approach to sample traces, emphasizing mutually dissimilar traces. STEAM has been implemented on top of OpenTelemetry, comprising approximately 1.6K lines of Golang code and 2K lines of Python code. Evaluation on four benchmark microservice applications and a production system demonstrates the superior performance of our approach compared to baseline methods. Furthermore, STEAM is capable of processing 15,000 traces in approximately 4 seconds.

DOI: 10.1145/3611643.3613881

TraceDiag： Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems

作者: Ding, Ruomeng and Zhang, Chaoyun and Wang, Lu and Xu, Yong and Ma, Minghua and Wu, Xiaomin and Zhang, Meng and Chen, Qingjun and Gao, Xin and Gao, Xuedong and Fan, Hao and Rajmohan, Saravan and Lin, Qingwei and Zhang, Dongmei
关键词: Reinforcement Learning, Root Cause Analysis, Trace data

Abstract

Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the reliability of microservice systems. However, performing RCA on modern microservice systems can be challenging due to their large scale, as they usually comprise hundreds of components, leading significant human effort. This paper proposes TraceDiag, an end-to-end RCA framework that addresses the challenges for large-scale microservice systems. It leverages reinforcement learning to learn a pruning policy for the service dependency graph to automatically eliminates redundant components, thereby significantly improving the RCA efficiency. The learned pruning policy is interpretable and fully adaptive to new RCA instances. With the pruned graph, a causal-based method can be executed with high accuracy and efficiency. The proposed TraceDiag framework is evaluated on real data traces collected from the Microsoft Exchange system, and demonstrates superior performance compared to state-of-the-art RCA approaches. Notably, TraceDiag has been integrated as a critical component in the Microsoft M365 Exchange, resulting in a significant improvement in the system’s reliability and a considerable reduction in the human effort required for RCA.

DOI: 10.1145/3611643.3613864

Triggering Modes in Spectrum-Based Multi-location Fault Localization

作者: Dao, Tung and Meng, Na and Nguyen, ThanhVu
关键词: CI/CD, Industrial Study, Multi-Location Bugs, SBFL Triggering Modes, Spectrum-based Fault Localization

Abstract

Spectrum-based fault localization (SBFL) techniques can aid in debugging, but their practicality in industrial settings has been limited due to the large number of tests needed to execute before applying SBFL. Previous research has explored different trigger modes for SBFL and found that applying it immediately after the first test failure is also effective. However, this study only considered single-location bugs, while multi-location bugs are prevalent in real-world scenarios and especially at our company Cvent, which is interested in integrating SBFL to its CI/CD workflow.

In this work, we investigate the effectiveness of SBFL on multi-location bugs and propose a framework called Instant Fault Localization for Multi-location Bugs (IFLM). We compare and evaluate four trigger modes of IFLM using open-source (Defects4J) and close-source (Cvent) bug datasets.

Our study showed that it is not necessary to execute all test cases before applying SBFL. However, we also found that that applying SBFL right after the first failed test is less effective than applying it after executing all tests for multi-location bugs, which is contrary to the single-location bug study. We also observe differences in performance between real and artificial bugs. Our contributions include the development of IFLM and CVent bug datasets, analysis of SBFL effectiveness for multi-location bugs, and practical implications for integrating SBFL in industrial environments.

DOI: 10.1145/3611643.3613884

作者: Hu, Yongxiang and Gu, Jiazhen and Hu, Shuqing and Zhang, Yu and Tian, Wenjie and Guo, Shiyu and Chen, Chaoyi and Zhou, Yangfan
关键词: GUI Interaction, Mobile Apps, Testing

Abstract

In industrial practice, GUI (Graphic User Interface) testing of mobile apps still inevitably relies on huge manual efforts. The major efforts are those on understanding the GUIs, so that testing scripts can be written accordingly. Quality assurance could therefore be very labor-intensive, especially for modern commercial mobile apps, where one may include tremendous, diverse, and complex GUIs, e.g., those for placing orders of different commercial items. To reduce such human efforts, we propose Appaction, a learning-based automatic GUI interaction approach we developed for Meituan, one of the largest E-commerce providers with over 600 million users. Appaction can automatically analyze the target GUI and understand what each input of the GUI is about, so that corresponding valid inputs can be entered accordingly. To this end, Appaction adopts a multi-modal model to learn from human experiences in perceiving a GUI. This allows it to infer corresponding valid input events that can properly interact with the GUI. In this way, the target app can be effectively exercised. We present our experiences in Meituan on applying Appaction to popular commercial apps. We demonstrate the effectiveness of Appaction in GUI analysis, and it can perform correct interactions for numerous form pages.

DOI: 10.1145/3611643.3613885

MuRS： Mutant Ranking and Suppression using Identifier Templates

作者: Chen, Zimin and Salawa, Ma\l{
关键词: Code Review, Developer Feedback, Mutation Testing

Abstract

Diff-based mutation testing is a mutation testing approach that only mutates lines affected by a code change under review. This approach scales independently of the code-base size and introduces test goals (mutants) that are directly relevant to an engineer’s goal such as fixing a bug, adding a new feature, or refactoring existing functionality. Google’s mutation testing service integrates diff-based mutation testing into the code review process and continuously gathers developer feedback on mutants surfaced during code review. To enhance the developer experience, the mutation testing service uses a number of manually-written rules that suppress not-useful mutants—mutants that have consistently received negative developer feedback. However, while effective, manually implementing suppression rules requires significant engineering time.

This paper proposes and evaluates MuRS, an automated approach that groups mutants by patterns in the source code under test and uses these patterns to rank and suppress future mutants based on historical developer feedback on mutants in the same group. To evaluate MuRS, we conducted an A/B testing study, comparing MuRS to the existing mutation testing service. Despite the strong baseline, which uses manually-written suppression rules, the results show a statistically significantly lower negative feedback ratio of 11.45% for MuRS versus 12.41% for the baseline. The results also show that MuRS is able to recover existing suppression rules implemented in the baseline. Finally, the results show that statement-deletion mutant groups received both the most positive and negative developer feedback, suggesting a need for additional context that can distinguish between useful and not-useful mutants in these groups. Overall, MuRS is able to recover existing suppression rules and automatically learn additional, finer-grained suppression rules from developer feedback.

DOI: 10.1145/3611643.3613901

Modeling the Centrality of Developer Output with Software Supply Chains

作者: Mockus, Audris and Rigby, Peter C. and Abreu, Rui and Suresh, Parth and Chen, Yifen and Nagappan, Nachiappan
关键词: Developer productivity, Software supply chains

Abstract

Raw developer output, as measured by the number of changes a developer makes to the system, is simplistic and potentially misleading measure of productivity as new developers tend to work on peripheral and experienced developers on more central parts of the system. In this work, we use Software Supply Chain (SSC) networks and Katz centrality and PageRank on these networks to suggest a more nuanced measure of developer productivity. Our SSC is a network that represents the relationships between developers and artifacts that make up a system. We combine author-to-file, co-changing files, call hierarchies, and reporting structure into a single SSC and calculate the centrality of each node. The measures of centrality can be used to better understand variations in the impact of developer output at Meta. We start by partially replicating prior work and show that the raw number of developer commits plateaus over a project-specific period. However, the centrality of developer work grows for the entire period of study, but the growth slows after one year. This implies that while raw output might plateau, more experienced developers work on more central parts of the system. Finally, we investigate the incremental contribution of SSC attributes in modeling developer output. We find that local attributes such as the number of reports and the specific project do not explain much variation (𝑅2 = 5.8%). In contrast, adding Katz centrality or PageRank produces a model with an 𝑅2 above 30%. SSCs and their centrality provide valuable insights into the centrality and importance of a developer’s work.

DOI: 10.1145/3611643.3613873

On-Premise AIOps Infrastructure for a Software Editor SME： An Experience Report

作者: Bendimerad, Anes and Remil, Youcef and Mathonat, Romain and Kaytoue, Mehdi
关键词: AI, AIOps, Enterprise Resource Planning, Predictive Maintenance

Abstract

Information Technology has become a critical component in various industries, leading to an increased focus on software maintenance and monitoring. With the complexities of modern software systems, traditional maintenance approaches have become insufficient. The concept of AIOps has emerged to enhance predictive maintenance using Big Data and Machine Learning capabilities. However, exploiting AIOps requires addressing several challenges related to the complexity of data and incident management. Commercial solutions exist, but they may not be suitable for certain companies due to high costs, data governance issues, and limitations in covering private software. This paper investigates the feasibility of implementing on-premise AIOps solutions by leveraging open-source tools. We introduce a comprehensive AIOps infrastructure that we have successfully deployed in our company, and we provide the rationale behind different choices that we made to build its various components. Particularly, we provide insights into our approach and criteria for selecting a data management system and we explain its integration. Our experience can be beneficial for companies seeking to internally manage their software maintenance processes with a modern AIOps approach.

DOI: 10.1145/3611643.3613876

C³： Code Clone-Based Identification of Duplicated Components

作者: Yang, Yanming and Zou, Ying and Hu, Xing and Lo, David and Ni, Chao and Grundy, John and Xia, Xin
关键词: Community Detection Algorithm, Component-level Clone Detection, Component-level Clone Metric

Abstract

Reinventing the wheel is a detrimental programming practice in software development that frequently results in the introduction of duplicated components. This practice not only leads to increased maintenance and labor costs but also poses a higher risk of propagating bugs throughout the system. Despite numerous issues introduced by duplicated components in software, the identification of component-level clones remains a significant challenge that existing studies struggle to effectively tackle. Specifically, existing methods face two primary limitations that are challenging to overcome: 1) Measuring the similarity between different components presents a challenge due to the significant size differences among them; 2) Identifying functional clones is a complex task as determining the primary functionality of components proves to be difficult. To overcome the aforementioned challenges, we present a novel approach named C3 (Component-level Code Clone detector) to effectively identify both textual and functional cloned components. In addition, to enhance the efficiency of eliminating cloned components, we develop an assessment method based on six component-level clone features, which assists developers in prioritizing the cloned components based on the refactoring necessity. To validate the effectiveness of C3, we employ a large-scale industrial product developed by Huawei, a prominent global ICT company, as our dataset and apply C3 to this dataset to identify the cloned components. Our experimental results demonstrate that C3 is capable of accurately detecting cloned components, achieving impressive performance in terms of precision (0.93), recall (0.91), and F1-score (0.9). Besides, we conduct a comprehensive user study to further validate the effectiveness and practicality of our assessment method and the proposed clone features in assessing the refactoring necessity of different cloned components. Our study establishes solid alignment between assessment outcomes and participant responses, indicating the accurate prioritization of clone components with a high refactoring necessity through our method. This finding further confirms the usefulness of the six “golden features” in our assessment.

DOI: 10.1145/3611643.3613883

AdaptivePaste： Intelligent Copy-Paste in IDE

作者: Liu, Xiaoyu and Jang, Jinu and Sundaresan, Neel and Allamanis, Miltiadis and Svyatkovskiy, Alexey
关键词: Code adaptation, Machine learning

Abstract

In software development, it is common for programmers to copy-paste or port code snippets and then adapt them to their use case. This scenario motivates the code adaptation task – a variant of program repair which aims to adapt variable identifiers in a pasted snippet of code to the surrounding, preexisting context. However, no existing approach has been shown to effectively address this
task. In this paper, we introduce AdaptivePaste, a learning-based approach to source code adaptation, based on transformers and a
dedicated dataflow-aware deobfuscation pre-training task to learn meaningful representations of variable usage patterns. We demonstrate that AdaptivePaste can learn to adapt Python source code snippets with 67.8% exact match accuracy. We study the impact of confidence thresholds on the model predictions, showing the model precision can be further improved to 85.9% with our parallel-decoder transformer model in a selective code adaptation setting. To assess the practical use of AdaptivePaste we perform a user study among Python software developers on real-world copy-paste instances. The results show that AdaptivePaste reduces dwell time to nearly half the time it takes to port code manually, and helps to avoid bugs. In addition, we utilize the participant feedback to identify potential avenues for improvement.

DOI: 10.1145/3611643.3613895

Adapting Performance Analytic Techniques in a Real-World Database-Centric System： An Industrial Experience Report

作者: Liao, Lizhi and Li, Heng and Shang, Weiyi and Sporea, Catalin and Toma, Andrei and Sajedi, Sarah
关键词: Performance analysis, database-centric system, field testing, performance issue, performance regression, root cause analysis

Abstract

Database-centric architectures have been widely adopted in large-scale software systems in various domains to deal with the ever-increasing amount and complexity of data. Prior studies have proposed a wide range of performance analytic techniques aimed at assisting developers in pinpointing software performance inefficiencies and diagnosing performance issues. However, directly applying these existing techniques to large-scale database-centric systems can be challenging and may not perform well due to the unique nature of such systems. In particular, compared to typical database-based systems like online shopping systems, in database-centric systems, a majority of the business logic and calculations reside in the database instead of the application. As the calculations in the database typically use domain-specific languages such as SQL, the performance issues of such systems and their diagnosis may be significantly different from the systems dominated by traditional programming languages such as Java. In this paper, we share our experience of adapting performance analytic techniques in a large-scale database-centric system from our industrial collaborator. Our adapted performance analysis pays special attention to the database and the interactions between the database and the application with minimal reliance on expert knowledge and manual effort. Moreover, we document our encountered challenges and how they are addressed during the development and adoption of our solution in the industrial setting as well as the corresponding lessons learned. We also discuss the real-world performance issues detected by applying our analysis to the target database-centric system. We anticipate that our solution and the reported experience can be helpful for practitioners and researchers who would like to ensure and improve the performance of database-centric systems.

DOI: 10.1145/3611643.3613893

KDDT： Knowledge Distillation-Empowered Digital Twin for Anomaly Detection

作者: Xu, Qinghua and Ali, Shaukat and Yue, Tao and Nedim, Zaimovic and Singh, Inderjeet
关键词: Train Control and Management System, anomaly detection, digital twin, knowledge distillation

Abstract

Cyber-physical systems (CPSs), like train control and management systems (TCMS), are becoming ubiquitous in critical infrastructures. As safety-critical systems, ensuring their dependability during operation is crucial. Digital twins (DTs) have been increasingly studied for this purpose owing to their capability of runtime monitoring and warning, prediction and detection of anomalies, etc. However, constructing a DT for anomaly detection in TCMS necessitates sufficient training data and extracting both chronological and context features with high quality. Hence, in this paper, we propose a novel method named KDDT for TCMS anomaly detection. KDDT harnesses a language model (LM) and a long short-term memory (LSTM) network to extract contexts and chronological features, respectively. To enrich data volume, KDDT benefits from out-of-domain data with knowledge distillation (KD). We evaluated KDDT with two datasets from our industry partner Alstom and obtained the F1 scores of 0.931 and 0.915, respectively, demonstrating the effectiveness of KDDT. We also explored individual contributions of the DT model, LM, and KD to the overall performance of KDDT, via a comprehensive empirical study, and observed average F1 score improvements of 12.4%, 3%, and 6.05%, respectively.

DOI: 10.1145/3611643.3613879

AG3： Automated Game GUI Text Glitch Detection Based on Computer Vision

作者: Liang, Xiaoyun and Qi, Jiayi and Gao, Yongqiang and Peng, Chao and Yang, Ping
关键词: Deep Learning, Software Testing, Visual Test Oracle

Abstract

With the advancement of device software and hardware performance, and the evolution of game engines, an increasing number of emerging high-quality games are captivating game players from all around the world who speak different languages. However, due to the vast fragmentation of the device and platform market, a well-tested game may still experience text glitches when installed on a new device with an unseen screen resolution and system version, which can significantly impact the user experience. In our testing pipeline, current testing techniques for identifying multilingual text glitches are laborious and inefficient. In this paper, we present AG3, which offers intelligent game traversal, precise visual text glitch detection, and integrated quality report generation capabilities. Our empirical evaluation and internal industrial deployment demonstrate that AG3 can detect various real-world multilingual text glitches with minimal human involvement.

DOI: 10.1145/3611643.3613867

Detection Is Better Than Cure： A Cloud Incidents Perspective

作者: Ganatra, Vaibhav and Parayil, Anjaly and Ghosh, Supriyo and Kang, Yu and Ma, Minghua and Bansal, Chetan and Nath, Suman and Mace, Jonathan
关键词: Cloud Services, Empirical Study, Reliability

Abstract

Cloud providers use automated watchdogs or monitors to continuously observe service availability and to proactively report incidents when system performance degrades. Improper monitoring can lead to delays in the detection and mitigation of production incidents, which can be extremely expensive in terms of customer impacts and manual toil from engineering resources. Therefore, a systematic understanding of the pitfalls in current monitoring practices and how they can lead to production incidents is crucial for ensuring {continuous

DOI: 10.1145/3611643.3613898

PropProof： Free Model-Checking Harnesses from PBT

作者: Takashima, Yoshiki
关键词: Formal Verification, Property-Based Testing, Rust

Abstract

Property-based testing (PBT) is often used by Rust developers to test functional correctness properties of their code. Since PBT uses randomized testing, its guarantees are limited: it can detect bugs but provides no formal guarantees of correctness. The Kani Rust Verifier uses the CProver verification framework to verify Rust code, given a specification in a Kani verification harness. However, developers must manually write Kani harnesses while avoiding model-checking-specific pitfalls like large memory usage or timeouts. We introduce , a library that automatically converts PBT harnesses into Kani harnesses which can be formally validated using Kani. We discuss the data-structure models we developed in order to optimize performance of these Kani verification harnesses. Using this library, we identified and fixed 2 issues in , an AWS-developed protocol-buffer library with nearly 40 million downloads. is being used in ’s CI. Our evaluation on 42 PBT harnesses from top-ranked open-source Rust libraries demonstrates enabling the use of Kani to verify complex, user-defined properties on existing code with minimal user intervention.

DOI: 10.1145/3611643.3613863

LightF3 Experiment

作者: Dong, Yibo and Zhang, Xiaoyu and Xu, Yicong and Cai, Chang and Chen, Yu and Miao, Weikai and Li, Jianwen and Pu, Geguang
关键词: Interlocking systems, Model Checking

Abstract

The data for our experiments is organized here.

DOI: 10.1145/3611643.3613874

BFSig： Leveraging File Significance in Bus Factor Estimation

作者: Haratian, Vahid and Evtikhiev, Mikhail and Derakhshanfar, Pouria and T"{u
关键词: bus factor, dataset, file significance, intelligent collaboration tools, knowledge management, truck factor

Abstract

Software projects experience the departure of developers due to various reasons. As developers are one of the main sources of knowledge in software projects, their absence will inevitably result in a certain degree of knowledge depletion. Bus Factor (BF) is a metric to evaluate how this knowledge loss can affect the project’s continuity.
Conventionally, BF is calculated as the smallest set of developers, removing over half the project knowledge upon departure. Current state-of-the-art approaches measure developers’ knowledge by the number of authored files, utilizing version control system (VCS) information. However, numerous studies have shown that files in software projects have different significance.
In this study, we explore how weighting files according to their significance affects the performance of two prevailing BF estimators. We derive significance scores by computing five well-known graph metrics from the project’s dependency graph: PageRank, In-/Out-/All-Degree, and Betweenness Centralities. Furthermore, we introduce BFSig , a prototype of our approach. Finally, we present a new dataset comprising reported BF scores collected by surveying software practitioners from five prominent Github repositories.
Our results indicate that BFSig outperforms the baselines by up to an 18% reduction in terms of Normalized Mean Absolute Error (NMAE). Moreover, BFSig yields 18% fewer False Negatives in identifying potential risks associated with low BF. Besides, our respondent confirmed BFSig versatility by showing its ability to assess the BF of the project’s subfolders.
In conclusion, we believe to estimate BF from authorship, software components of higher importance should be assigned heavier weight. Currently, BFSig exclusively explores the topological characteristics of these components. Nevertheless, considering attributes such as code complexity and bug proneness could potentially enhance the performance of BFSig.

DOI: 10.1145/3611643.3613877

Automated Test Generation for Medical Rules Web Services： A Case Study at the Cancer Registry of Norway

作者: Laaber, Christoph and Yue, Tao and Ali, Shaukat and Schwitalla, Thomas and Nyg\r{a
关键词: REST APIs, automated software testing, cancer registry, electronic health records, rule engine, test generation

Abstract

The Cancer Registry of Norway (CRN) collects, curates, and manages data related to cancer patients in Norway, supported by an interactive, human-in-the-loop, socio-technical decision support software system. Automated software testing of this software system is inevitable; however, currently, it is limited in CRN’s practice. To this end, we present an industrial case study to evaluate an AI-based system-level testing tool, i.e., EvoMaster, in terms of its effectiveness in testing CRN’s software system. In particular, we focus on GURI, CRN’s medical rule engine, which is a key component at the CRN. We test GURI with EvoMaster’s black-box and white-box tools and study their test effectiveness regarding code coverage, errors found, and domain-specific rule coverage. The results show that all EvoMaster tools achieve a similar code coverage; i.e., around 19% line, 13% branch, and 20% method; and find a similar number of errors; i.e., 1 in GURI’s code. Concerning domain-specific coverage, EvoMaster’s black-box tool is the most effective in generating tests that lead to applied rules; i.e., 100% of the aggregation rules and between 12.86% and 25.81% of the validation rules; and to diverse rule execution results; i.e., 86.84% to 89.95% of the aggregation rules and 0.93% to 1.72% of the validation rules pass, and 1.70% to 3.12% of the aggregation rules and 1.58% to 3.74% of the validation rules fail. We further observe that the results are consistent across 10 versions of the rules. Based on these results, we recommend using EvoMaster’s black-box tool to test GURI since it provides good results and advances the current state of practice at the CRN. Nonetheless, EvoMaster needs to be extended to employ domain-specific optimization objectives to improve test effectiveness further. Finally, we conclude with lessons learned and potential research directions, which we believe are applicable in a general context.

DOI: 10.1145/3611643.3613882

Test Case Generation for Drivability Requirements of an Automotive Cruise Controller： An Experience with an Industrial Simulator

作者: Formica, Federico and Petrunti, Nicholas and Bruck, Lucas and Pantelic, Vera and Lawford, Mark and Menghi, Claudio
关键词: Comfort, Cruise Control, Drivability, Model Development, Search-based Software Testing, Simulink

Abstract

Automotive software development requires engineers to test their systems to detect violations of both functional and drivability requirements. Functional requirements define the functionality of the automotive software. Drivability requirements refer to the driver’s perception of the interactions with the vehicle; for example, they typically require limiting the acceleration and jerk perceived by the driver within given thresholds. While functional requirements are extensively considered by the research literature, drivability requirements garner less attention.
This industrial paper describes our experience assessing the usefulness of an automated search-based software testing (SBST) framework in generating failure-revealing test cases for functional and drivability requirements. We report on our experience with the VI-CarRealTime simulator, an industrial virtual modeling and simulation environment widely used in the automotive domain.
We designed a Cruise Control system in Simulink for a four-wheel vehicle, in an iterative fashion, by producing 21 model versions. We used the SBST framework for each version of the model to search for failure-revealing test cases revealing requirement violations.
Our results show that the SBST framework successfully identified a failure-revealing test case for 66.7% of our model versions, requiring, on average, 245.9s and 3.8 iterations. We present lessons learned, reflect on the generality of our results, and discuss how our results improve the state of practice.

DOI: 10.1145/3611643.3613894

Prioritizing Natural Language Test Cases Based on Highly-Used Game Features

作者: Viggiato, Markos and Paas, Dale and Bezemer, Cor-Paul
关键词: Feature usage, Multi-objective genetic algorithm, Test case prioritization, Zero- shot classification

Abstract

Software testing is still a manual activity in many industries, such as the gaming industry. But manually executing tests becomes impractical as the system grows and resources are restricted, mainly in a scenario with short release cycles. Test case prioritization is a commonly used technique to optimize the test execution. However, most prioritization approaches do not work for manual test cases as they require source code information or test execution history, which is often not available in a manual testing scenario. In this paper, we propose a prioritization approach for manual test cases written in natural language based on the tested application features (in particular, highly-used application features). Our approach consists of (1) identifying the tested features from natural language test cases (with zero-shot classification techniques) and (2) prioritizing test cases based on the features that they test. We leveraged the NSGA-II genetic algorithm for the multi-objective optimization of the test case ordering to maximize the coverage of highly-used features while minimizing the cumulative execution time. Our findings show that we can successfully identify the application features covered by test cases using an ensemble of pre-trained models with strong zero-shot capabilities (an F-score of 76.1%). Also, our prioritization approaches can find test case orderings that cover highly-used application features early in the test execution while keeping the time required to execute test cases short. QA engineers can use our approach to focus the test execution on test cases that cover features that are relevant to users.

DOI: 10.1145/3611643.3613872

EvoCLINICAL： Evolving Cyber-Cyber Digital Twin with Active Transfer Learning for Automated Cancer Registry System

作者: Lu, Chengjie and Xu, Qinghua and Yue, Tao and Ali, Shaukat and Schwitalla, Thomas and Nyg\r{a
关键词: active learning, cyber-cyber digital twin, digital twin, neural network, transfer learning, validation system

Abstract

The Cancer Registry of Norway (CRN) collects information on cancer patients by receiving cancer messages from different medical entities (e.g., medical labs, hospitals) in Norway. Such messages are validated by an automated cancer registry system: GURI. Its correct operation is crucial since it lays the foundation for cancer research and provides critical cancer-related statistics to its stakeholders. Constructing a cyber-cyber digital twin (CCDT) for GURI can facilitate various experiments and advanced analyses of the operational state of GURI without requiring intensive interactions with the real system. However, GURI constantly evolves due to novel medical diagnostics and treatment, technological advances, etc. Accordingly, CCDT should evolve as well to synchronize with GURI. A key challenge of achieving such synchronization is that evolving CCDT needs abundant data labelled by the new GURI. To tackle this challenge, we propose EvoCLINICAL, which considers the CCDT developed for the previous version of GURI as the pretrained model and fine-tunes it with the dataset labelled by querying a new GURI version. EvoCLINICAL employs a genetic algorithm to select an optimal subset of cancer messages from a candidate dataset and query GURI with it. We evaluate EvoCLINICAL on three evolution processes. The precision, recall, and F1 score are all greater than 91%, demonstrating the effectiveness of EvoCLINICAL. Furthermore, we replace the active learning part of EvoCLINICAL with random selection to study the contribution of transfer learning to the overall performance of EvoCLINICAL. Results show that employing active learning in EvoCLINICAL increases its performances consistently.

DOI: 10.1145/3611643.3613897

Compositional Taint Analysis for Enforcing Security Policies at Scale

作者: Banerjee, Subarno and Cui, Siwei and Emmi, Michael and Filieri, Antonio and Hadarean, Liana and Li, Peixuan and Luo, Linghui and Piskachev, Goran and Rosner, Nicol'{a
关键词: software security, static analysis in the cloud, taint analysis

Abstract

Automated static dataflow analysis is an effective technique for detecting security critical issues like sensitive data leak, and vulnerability to injection attacks. Ensuring high precision and recall requires an analysis that is context, field and object sensitive. However, it is challenging to attain high precision and recall and scale to large industrial code bases. Compositional style analyses in which individual software components are analyzed separately, independent from their usage contexts, compute reusable summaries of components. This is an essential feature when deploying such analyses in CI/CD at code-review time or when scanning deployed container images. In both these settings the majority of software components stay the same between subsequent scans. However, it is not obvious how to extend such analyses to check the kind of contextual taint specifications that arise in practice, while maintaining compositionality.

In this work we present contextual dataflow modeling, an extension to the compositional analysis to check complex taint specifications and significantly increasing recall and precision. Furthermore, we
show how such high-fidelity analysis can scale in production using three key optimizations: (i) discarding intermediate results for previously-analyzed components, an optimization exploiting the
compositional nature of our analysis; (ii) a scope-reduction analysis to reduce the scope of the taint analysis w.r.t. the taint specifications being checked, and (iii) caching of analysis models. We show
a 9.85% reduction in false positive rate on a comprehensive test suite comprising the OWASP open-source benchmarks as well as internal real-world code samples. We measure the performance and scalability impact of each individual optimization using open source JVM packages from the Maven central repository and internal AWS service codebases. This combination of high precision, recall, performance, and scalability has allowed us to enforce security policies at scale both internally within Amazon as well as for external customers.

DOI: 10.1145/3611643.3613889

A Multidimensional Analysis of Bug Density in SAP HANA

作者: Reck, Julian and Bach, Thomas and Stoess, Jan
关键词: bug density, database, empirical study, software quality

Abstract

Researchers and practitioners have been studying correlations between software metrics and defects for decades. The typical approach is to postulate a hypothesis that a certain metric correlates with the number of defects. A statistical test then utilizes historical data to accept or reject the hypothesis. Although this methodology has been widely adopted, our own experience is that such correlations are often limited in their practical relevance, particularly for large industrial projects: Interpreting and arguing about them is challenging and cumbersome; the difference between correlation and causation might not be clear; and the practical impact of a correlation is often questioned due to misconceptions between a statistical conclusion and the impact on singular events.
Instead of discussing correlations, we found that the analysis for binary testedness results in more fruitful discussions. Binary testedness, as proposed by prior work, utilizes a metric to divide the source code into two parts and verifies whether more (or less) defects appear in each part than expected. In our work, we leverage the binary testedness approach and analyze several software metrics for a large industrial project to illustrate the concept. We furthermore introduce dynamic thresholds as a novel and more practical approach for source code classification compared to the static binary classification of previous works. Our results show that some studied metrics have a significant correlation with bug distribution, but effect sizes differ by several magnitudes across metrics. Overall, our approach moves away from “metric X correlates with defects” to a more fruitful “source code with attribute X has more (or less) bugs than expected”, reframing the discussion from questioning statistics and methods towards an evidence-based root cause analysis.

DOI: 10.1145/3611643.3613875

Ownership in the Hands of Accountability at Brightsquid： A Case Study and a Developer Survey

作者: Koana, Umme Ayman and Chew, Francis and Carlson, Chris and Nayebi, Maleknaz
关键词: Accountability, Ownership, Software Engineering, Software Quality

Abstract

The COVID−19 pandemic has accelerated the adoption of digital health solutions. This has presented significant challenges for software development teams to swiftly adjust to the market needs and demand. To address these challenges, product management teams have had to adapt their approach to software development, reshaping their processes to meet the demands of the pandemic. Brighsquid implemented a new task assignment process aimed at enhancing developer accountability toward the customer. To assess the impact of this change on code ownership, we conducted a code change analysis. Additionally, we surveyed 67 developers to investigate the relationship between accountability and ownership more broadly. The findings of our case study indicate that the revised assignment model not only increased the perceived sense of accountability within the production team but also improved code resilience against ownership changes. Moreover, the survey results revealed that a majority of the participating developers (67.5%) associated perceived accountability with artifact ownership.

DOI: 10.1145/3611643.3613890

Rotten Green Tests in Google Test

作者: Robinson, Paul T.
关键词: C++ testing, Google Test, Software testing

Abstract

Executable unit tests are a key component of many software engineering methodologies. A “green” test is one that is reported as passing (successfully testing some software feature). However, it is common for a test harness to assume that a test has passed when, in fact, it has merely not reported a failure. In this gap, where the “excluded middle” lives and thrives, we find the Rotten Green Test: A test that looks like it does something useful, but in fact does not.

Google Test, a popular open-source test framework for C++ software, has been enhanced to detect rotten green tests. This enhancement follows the lead of similar work done for the Pharo language, but in a framework more applicable in industry, and with no requirement for test modifications or an external tool. The enhanced Google Test has detected 183 rotten assertions in the LLVM project’s unit tests, and even found one in Google Test’s own internal test suite. The enhancement may report false positives from parameterized tests where assertions are conditioned on the parameter, and currently does not detect rotten assertions in helper methods.

DOI: 10.1145/3611643.3613865

Issue Report Validation in an Industrial Context

作者: Aktas, Ethem Utku and Cakmak, Ebru and Inan, Mete Cihad and Yilmaz, Cemal
关键词: automated issue classification, issue report validation, text analysis

Abstract

Effective issue triaging is crucial for software development teams to improve software quality, and thus customer satisfaction. Validating issue reports manually can be time-consuming, hindering the overall efficiency of the triaging process. This paper presents an approach on automating the validation of issue reports to accelerate the issue triaging process in an industrial set-up. We work on 1,200 randomly selected issue reports in banking domain, written in Turkish, an agglutinative language, meaning that new words can be formed with linear concatenation of suffixes to express entire sentences. We manually label these reports for validity, and extract the relevant patterns indicating that they are invalid. Since the issue reports we work on are written in an agglutinative language, we use morphological analysis to extract the features. Using the proposed feature extractors, we utilize a machine learning based approach to predict the issue reports’ validity, performing a 0.77 F1-score.

DOI: 10.1145/3611643.3613887

On the Dual Nature of Necessity in Use of Rust Unsafe Code

作者: Zhang, Yuchen and Kundu, Ashish and Portokalidis, Georgios and Xu, Jun
关键词: Rust Security, Software Engineering, Unsafe Code

Abstract

Rust offers both safety guarantees and high performance. Thus, it has gained significant popularity in the industry. To extend its capability as a system programming language, Rust allows unsafe blocks where the execution has low-level controls but loses the safety guarantees. In principle, unsafe blocks should only be used when necessary. However, preliminary evidence shows a different situation. This paper aims to establish a deeper view of this matter and bring endeavors toward improvement.

We first present a study on the use of unsafe Rust in practice. We manually inspected 5946 unsafe blocks from 140 popular libraries and applications, focusing on whether the use of unsafe code is necessary (precisely, whether they have safe alternatives). The study unveils hundreds of instances of unnecessary unsafe Rust code and provides a taxonomy together with detailed analyses. These results complement our understanding and offer insights for the community to make a change.

Following the study, we further summarize nine popular patterns of unnecessary unsafe blocks and design an IDE plugin to auto-suggest their safe alternatives. Applied to 140 buggy unsafe blocks from the RustSec Advisory Database, the plugin identifies and offers safe versions to remove the bug for 28.6% of all cases.

DOI: 10.1145/3611643.3613878

Analyzing Microservice Connectivity with Kubesonde

作者: Bufalino, Jacopo and Di Francesco, Mario and Aura, Tuomas
关键词: Kubernetes, Microservices, network connectivity, nework security

Abstract

Modern cloud-based applications are composed of several microservices that interact over a network. They are complex distributed systems, to the point that developers may not even be aware of how microservices connect to each other and to the Internet. As a consequence, the security of these applications can be greatly compromised. This work explicitly targets this context by providing a methodology to assess microservice connectivity, a software tool that implements it, and findings from analyzing real cloud applications. Specifically, it introduces Kubesonde, a cloud-native software that instruments live applications running on a Kubernetes cluster to analyze microservice connectivity, with minimal impact on performance. An assessment of microservices in 200 popular cloud applications with Kubesonde revealed significant issues in terms of network isolation: more than 60% of them had discrepancies between their declared and actual connectivity, and none restricted outbound connections towards the Internet. Our analysis shows that Kubesonde offers valuable insights on the connectivity between microservices, beyond what is possible with existing tools.

DOI: 10.1145/3611643.3613899

Testing Real-World Healthcare IoT Application： Experiences and Lessons Learned

作者: Sartaj, Hassan and Ali, Shaukat and Yue, Tao and Moberg, Kjetil
关键词: Black-box Testing, Experience Report, Healthcare Internet of Things (IoT), REST APIs

Abstract

Healthcare Internet of Things (IoT) applications require rigorous testing to ensure their dependability. Such applications are typically integrated with various third-party healthcare applications and medical devices through REST APIs. This integrated network of healthcare IoT applications leads to REST APIs with complicated and interdependent structures, thus creating a major challenge for automated system-level testing. We report an industrial evaluation of a state-of-the-art REST APIs testing approach (RESTest) on a real-world healthcare IoT application. We analyze the effectiveness of RESTest’s testing strategies regarding REST APIs failures, faults in the application, and REST API coverage, by experimenting with six REST APIs of 41 API endpoints of the healthcare IoT application. Results show that several failures are discovered in different REST APIs with ≈56% coverage using RESTest. Moreover, nine potential faults are identified. Using the evidence collected from the experiments, we provide our experiences and lessons learned.

DOI: 10.1145/3611643.3613888

Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365

作者: Yang, Fangkai and Yin, Wenjie and Wang, Lu and Li, Tianci and Zhao, Pu and Liu, Bo and Wang, Paul and Qiao, Bo and Liu, Yudong and Bj"{o
关键词: Diffusion model, disk failure prediction, missing data imputation

Abstract

Ensuring reliability in large-scale cloud systems like Microsoft 365 is crucial. Cloud failures, such as disk and node failure, threaten service reliability, causing service interruptions and financial loss. Existing works focus on failure prediction and proactively taking action before failures happen. However, they suffer from poor data quality, like data missing in model training and prediction, which limits performance. In this paper, we focus on enhancing data quality through data imputation by the proposed Diffusion+, a sample-efficient diffusion model, to impute the missing data efficiently conditioned on the observed data. Experiments with industrial datasets and application practice show that our model contributes to improving the performance of downstream failure prediction.

DOI: 10.1145/3611643.3613866

The Most Agile Teams Are the Most Disciplined： On Scaling out Agile Development

作者: Li, Zheng and Rainer, Austen
关键词: agile at scale, agile development, disciplined agile teams, scaling out agility, targeted strategy

Abstract

As one of the next frontiers of software engineering, agile development at scale has attracted more and more research interests and efforts. When following the existing autonomy-focused and goal-driven lessons and guidelines to scale agile development for a large astronomy project, however, we encountered surprising tech stack sprawl and spreading team coordination issues. By revisiting the unique features of our project (e.g., the data processing-intensive nature and the frequent team member changes), and by identifying a fractal pattern from various data processing logic and processes, we defined disciplined agile teams to clone the best practices of pioneer agile teams, and to work on similar system modules with similar user stories. Such a targeted strategy effectively relieved the tech stack sprawl and facilitated teamwork handover, at least for refactoring and growing the data processing modules in our project. Based on this emerging result and our reflections, we distinguish this targeted strategy as scaling out agile development from the existing agile scaling approaches that are generally in a scaling-up fashion. Considering the popularity of data processing-intensive projects, and also considering the pervasive fractal patterns in modern businesses and organisations, we claim that this targeted strategy still has broad application opportunities. Therefore, developing a well-defined methodology for scaling out agility, and combining both scaling up and scaling out agility, will deserve attentions and new research efforts in the future.

DOI: 10.1145/3611643.3613886

Contribution-Based Firing of Developers?

作者: Orrei, Vincenzo and Raglianti, Marco and Nagy, Csaba and Lanza, Michele
关键词: Contribution, developer performance, mining, pony factor

Abstract

There has been some recent clamor about the developer layoff and turnover policies enacted by high-profile corporate executives. Precisely defining the contributions in software development has always been a thorny issue, as it is difficult to establish a developer’s “performance” without recurring to guesswork, due to how software development works and how Git persists history.
Taking inspiration from a seemingly informal notion, the pony factor, we present an approach to identify the key developers in a software project. We present an analysis of 1,011 GitHub repositories, providing fact-based reflections on development contributions.

DOI: 10.1145/3611643.3613085

Keeping Mutation Test Suites Consistent and Relevant with Long-Standing Mutants

作者: Ojdanic, Milos and Papadakis, Mike and Harman, Mark
关键词: Continuous Integration, Evolving Systems, Mutation Testing, Software Testing, Test Adequacy

Abstract

Mutation testing has been demonstrated to be one of the most powerful fault-revealing tools in the tester’s tool kit. Much previous work implicitly assumed it to be sufficient to re-compute mutant suites per release. Sadly, this makes mutation results inconsistent; mutant scores from each release cannot be directly compared, making it harder to measure test improvement. Furthermore, regular code change means that a mutant suite’s relevance will naturally degrade over time.
We measure this degradation in relevance for 143,500 mutants in 4 non-trivial systems, finding that 52% degrade, on average. We introduce a mutant brittleness measure and use it to audit software systems and their mutation suites. We also demonstrate how consistent-by-construction long-standing mutant suites can be identified with a 10x improvement in mutant relevance over an arbitrary test suite. Our results indicate that the research community should avoid the re-computation of mutant suites and focus, instead, on long-standing mutants, thereby improving the consistency and relevance of mutation testing.

DOI: 10.1145/3611643.3613089

Towards Top-Down Automated Development in Limited Scopes： A Neuro-Symbolic Framework from Expressibles to Executables

作者: Gu, Jian and Gall, Harald C.
关键词: code generation, deep learning, frame semantics, knowledge graph, program representation, requirement elicitation, software analytics

Abstract

Deep code generation is a topic of deep learning for software engineering (DL4SE), which adopts neural models to generate code for the intended functions. Since end-to-end neural methods lack domain knowledge and software hierarchy awareness, they tend to perform poorly w.r.t project-level tasks. To systematically explore the potential improvements of code generation, we let it participate in the whole top-down development from expressibles to executables, which is possible in limited scopes. In the process, it benefits from massive samples, features, and knowledge. As the foundation, we suggest building a taxonomy on code data, namely code taxonomy, leveraging the categorization of code information. Moreover, we introduce a three-layer semantic pyramid (SP) to associate text data and code data. It identifies the information of different abstraction levels, and thus introduces the domain knowledge on development and reveals the hierarchy of software. Furthermore, we propose a semantic pyramid framework (SPF) as the approach, focusing on software of high modularity and low complexity. SPF divides the code generation process into stages and reserves spots for potential interactions. In addition, we conceived preliminary applications in software development to confirm the neuro-symbolic framework.

DOI: 10.1145/3611643.3613076

Lessons from the Long Tail： Analysing Unsafe Dependency Updates across Software Ecosystems

作者: Wattanakriengkrai, Supatsara and Kula, Raula Gaikovina and Treude, Christoph and Matsumoto, Kenichi
关键词: Libraries, Software Ecosystems, Supply Chain

Abstract

A risk in adopting third-party dependencies into an application is their potential to serve as a doorway for malicious code to be injected (most often unknowingly). While many initiatives from both industry and research communities focus on the most critical dependencies (i.e., those most depended upon within the ecosystem), little is known about whether the rest of the ecosystem suffers the same fate. Our vision is to promote and establish safer practises throughout the ecosystem. To motivate our vision, in this paper, we present preliminary data based on three representative samples from a population of 88,416 pull requests (PRs) and identify unsafe dependency updates (i.e., any pull request that risks being unsafe during runtime), which clearly shows that unsafe dependency updates are not limited to highly impactful libraries. To draw attention to the long tail, we propose a research agenda comprising six key research questions that further explore how to safeguard against these unsafe activities. This includes developing best practises to address unsafe dependency updates not only in top-tier libraries but throughout the entire ecosystem.

DOI: 10.1145/3611643.3613086

Getting pwn’d by AI： Penetration Testing with Large Language Models

作者: Happe, Andreas and Cito, J"{u
关键词: large language models, penetration testing, security testing

Abstract

The field of software security testing, more specifically penetration testing, requires high levels of expertise and involves many manual testing and analysis steps. This paper explores the potential use of large-language models, such as GPT3.5, to augment penetration testers with AI sparring partners. We explore two distinct use cases: high-level task planning for security testing assignments and low-level vulnerability hunting within a vulnerable virtual machine. For the latter, we implemented a closed-feedback loop between LLM-generated low-level actions with a vulnerable virtual machine (connected through SSH) and allowed the LLM to analyze the machine state for vulnerabilities and suggest concrete attack vectors which were automatically executed within the virtual machine. We discuss promising initial results, detail avenues for improvement, and close deliberating on the ethics of AI sparring partners.

DOI: 10.1145/3611643.3613083

Towards Feature-Based Analysis of the Machine Learning Development Lifecycle

作者: Hu, Boyue Caroline and Chechik, Marsha
关键词: Features, Machine Learning, Software Analysis

Abstract

The safety and trustworthiness of systems with components that are based on Machine Learning (ML) require an in-depth understanding and analysis of all stages in its Development Lifecycle (MLDL). High-level abstractions of desired functionalities, model behaviour, and data are called features, and they have been studied by different communities across all MLDL stages. In this paper, we propose to support Software Engineering analysis of the MLDL through features, calling it feature-based analysis of the MLDL. First, to achieve a shared understanding of features among different experts, we establish a taxonomy of existing feature definitions currently used in various MLDL stages. Through this taxonomy, we map features from different stages to each other, discover gaps and future research directions and identify areas of collaboration between Software Engineering and other MLDL experts.

DOI: 10.1145/3611643.3613082

Exploring Moral Principles Exhibited in OSS： A Case Study on GitHub Heated Issues

作者: Ehsani, Ramtin and Rezapour, Rezvaneh and Chatterjee, Preetha
关键词: moral principles, open source, textual analysis, toxicity

Abstract

To foster collaboration and inclusivity in Open Source Software (OSS) projects, it is crucial to understand and detect patterns of toxic language that may drive contributors away, especially those from underrepresented communities. Although machine learning-based toxicity detection tools trained on domain-specific data have shown promise, their design lacks an understanding of the unique nature and triggers of toxicity in OSS discussions, highlighting the need for further investigation. In this study, we employ Moral Foundations Theory to examine the relationship between moral principles and toxicity in OSS. Specifically, we analyze toxic communications in GitHub issue threads to identify and understand five types of moral principles exhibited in text, and explore their potential association with toxic behavior. Our preliminary findings suggest a possible link between moral principles and toxic comments in OSS communications, with each moral principle associated with at least one type of toxicity. The potential of MFT in toxicity detection warrants further investigation.

DOI: 10.1145/3611643.3613077

Towards Understanding Emotions in Informal Developer Interactions： A Gitter Chat Study

作者: Sajadi, Amirali and Damevski, Kostadin and Chatterjee, Preetha
关键词: emotion analysis, software developer chats

Abstract

Emotions play a significant role in teamwork and collaborative activities like software development. While researchers have analyzed developer emotions in various software artifacts (e.g., issues, pull requests), few studies have focused on understanding the broad spectrum of emotions expressed in chats. As one of the most widely used means of communication, chats contain valuable information in the form of informal conversations, such as negative perspectives about adopting a tool. In this paper, we present a dataset of developer chat messages manually annotated with a wide range of emotion labels (and sub-labels), and analyze the type of information present in those messages. We also investigate the unique signals of emotions specific to chats and distinguish them from other forms of software communication. Our findings suggest that chats have fewer expressions of Approval and Fear but more expressions of Curiosity compared to GitHub comments. We also notice that Confusion is frequently observed when discussing programming-related information such as unexpected software behavior. Overall, our study highlights the potential of mining emotions in developer chats for supporting software maintenance and evolution tools.

DOI: 10.1145/3611643.3613084

Towards Strengthening Formal Specifications with Mutation Model Checking

作者: Cordy, Maxime and Lazreg, Sami and Legay, Axel and Schobbens, Pierre Yves
关键词: LTL, Model checking, Mutation

Abstract

We propose mutation model checking as an approach to strengthen formal specifications used for model checking. Inspired by mutation testing, our approach concludes that specifications are not strong enough if they fail to detect faults in purposely mutated models. Our preliminary experiments on two case studies confirm the relevance of the problem: their specification can only detect 40% and 60% of randomly generated mutants. As a result, we propose a framework to strengthen the original specification, such that the original model satisfies the strengthened specification but the mutants do not.

DOI: 10.1145/3611643.3613080

Assisting Static Analysis with Large Language Models： A ChatGPT Experiment

作者: Li, Haonan and Hao, Yu and Zhai, Yizhuo and Qian, Zhiyun
关键词: bug detection, large language model, static analysis

Abstract

Recent advances of Large Language Models (LLMs), e.g., ChatGPT, exhibited strong capabilities of comprehending and responding to questions across a variety of domains. Surprisingly, ChatGPT even possesses a strong understanding of program code. In this paper, we investigate where and how LLMs can assist static analysis by asking appropriate questions. In particular, we target a specific bug-finding tool, which produces many false positives from the static analysis. In our evaluation, we find that these false positives can be effectively pruned by asking carefully constructed questions about function-level behaviors or function summaries. Specifically, with a pilot study of 20 false positives, we can successfully prune 8 out of 20 based on GPT-3.5, whereas GPT-4 had a near-perfect result of 16 out of 20, where the four failed ones are not currently considered/supported by our questions, e.g., involving concurrency. Additionally, it also identified one false negative case (a missed bug). We find LLMs a promising tool that can enable a more effective and efficient program analysis.

DOI: 10.1145/3611643.3613078

Reflecting on the Use of the Policy-Process-Product Theory in Empirical Software Engineering

作者: Kalu, Kelechi G. and Schorlemmer, Taylor R. and Chen, Sophie and Robinson, Kyle A. and Kocinare, Erik and Davis, James C.
关键词: Empirical Software Engineering, Software Process and Policy

Abstract

The primary theory of software engineering is that an organization’s Policies and Processes influence the quality of its Products. We call this the PPP Theory. Although empirical software engineering research has grown common, it is unclear whether researchers are trying to evaluate the PPP Theory. To assess this, we analyzed half (33) of the empirical works published over the last two years in
three prominent software engineering conferences. In this sample, 70% focus on policies/processes or products, not both. Only 33% provided measurements relating policy/process and products. We make four recommendations: (1) Use PPP Theory in study design; (2) Study feedback relationships; (3) Diversify the studied feed-forward relationships; and (4) Disentangle policy and process. Let us remember that research results are in the context of, and with respect to, the relationship between software products, processes, and policies.

DOI: 10.1145/3611643.3613075

A Vision on Intentions in Software Engineering

作者: Kr"{u
关键词: intention, quality assurance, software evolution

Abstract

Intentions are fundamental in software engineering, but they are typically only implicitly considered through different abstractions, such as requirements, use cases, features, or issues. Specifically, software engineers develop and evolve (i.e., change) a software system based on such abstractions of a stakeholder’s intention—something a stakeholder wants the system to be able to do. Unfortunately, existing abstractions are (inherently) limited when it comes to representing stakeholder intentions and are mostly used for documenting only. So, whether a change in a system fulfills its underlying intention (and only this one) is an essential problem in practice that motivates many research areas (e.g., testing to ensure intended behavior, untangling intentions in commits). We argue that none of the existing abstractions is ideal for capturing intentions and controlling software evolution, which is why intentions are often vague and must be recovered, untangled, or understood in retrospect. In this paper, we reflect on the role of intentions (represented by changes) in software engineering and sketch how improving their management may support developers. Particularly, we argue that continuously managing and controlling intentions as well as their fulfillment has the potential to improve the reasoning about which stakeholder requests have been addressed, avoid misunderstandings, and prevent expensive retrospective analyses. To guide future research for achieving such benefits for researchers and practitioners, we discuss the relationships between different abstractions and intentions, and propose steps towards managing intentions.

DOI: 10.1145/3611643.3613087

Deeper Notions of Correctness in Image-Based DNNs： Lifting Properties from Pixel to Entities

作者: Toledo, Felipe and Shriver, David and Elbaum, Sebastian and Dwyer, Matthew B.
关键词: Neural networks, fairness, properties, validation, verification

Abstract

Deep Neural Networks (DNNs) that process images are being widely used for many safety-critical tasks, from autonomous vehicles to medical diagnosis. Currently, DNN correctness properties are defined at the pixel level over the entire input. Such properties are useful to expose system failures related to sensor noise or adversarial attacks, but they cannot capture features that are relevant to domain-specific entities and reflect richer types of behaviors. To overcome this limitation, we envision the specification of properties based on the entities that may be present in image input, capturing their semantics and how they change. Creating such properties today is difficult as it requires determining where the entities appear in images, defining how each entity can change, and writing a specification that is compatible with each particular V&V client. We introduce an initial framework structured around those challenges to assist in the generation of Domain-specific Entity-based properties automatically by leveraging object detection models to identify entities in images and creating properties based on entity features. Our feasibility study provides initial evidence that the new properties can uncover interesting system failures, such as changes in skin color can modify the output of a gender classification network. We conclude by analyzing the framework potential to implement the vision and by outlining directions for future work.

DOI: 10.1145/3611643.3613079

LazyCow： A Lightweight Crowdsourced Testing Tool for Taming Android Fragmentation

作者: Sun, Xiaoyu and Chen, Xiao and Liu, Yonghui and Grundy, John and Li, Li
关键词: Android Fragmentation, Crowdsourced Testing

Abstract

Android fragmentation refers to the increasing variety of Android devices and operating system versions. Their number make it impossible to test an app on every supported device, resulting in many device compatibility issues and leading to poor user experiences. To mitigate this, a number of works that automatically detect compatibility issues have been proposed. However, current state-of-the-art techniques can only be used to detect specific types of compatibility issues (i.e., compatibility issues caused by API signature evolution), i.e., many other essential categories of compatibility issues are still unknown. For instance, customised OS versions on real devices and semantic OS modifications could result in severe compatibility issues that are difficult to detect statically. In order to address this research gap and facilitate the prospect of taming Android frag- mentation through crowdsourced efforts, we propose LazyCow, a novel, lightweight, crowdsourced testing tool. Our experimental results involving thousands of test cases on real Android devices demonstrate that LazyCow is effective at autonomously identifying and validating API-induced compatibility issues. The source code of both client side and server side are all made publicly available in our artifact package. A demo video of our tool is available at https://www.youtube.com/watch?v=_xzWv_mo5xQ.

DOI: 10.1145/3611643.3613098

npm-follower： A Complete Dataset Tracking the NPM Ecosystem

作者: Pinckney, Donald and Cassano, Federico and Guha, Arjun and Bell, Jonathan
关键词: JavaScript, NPM, archiving, data mining, dependency-management

Abstract

Software developers typically rely upon a large network of dependencies to build their applications. For instance, the NPM package repository contains over 3 million packages and serves tens of billions of downloads weekly. Understanding the structure and nature of packages, dependencies, and published code requires datasets that provide researchers with easy access to metadata and code of packages. However, prior work on NPM dataset construction typically has two limitations: 1) only metadata is scraped, and 2) packages or versions that are deleted from NPM can not be scraped. Over 330,000 versions of packages were deleted from NPM between July 2022 and May 2023. This data is critical for researchers as it often pertains to important questions of security and malware. We present npm-follower, a dataset and crawling architecture which archives metadata and code of all packages and versions as they are published, and is thus able to retain data which is later deleted. The dataset currently includes over 35 million versions of packages, and grows at a rate of about 1 million versions per month. The dataset is designed to be easily used by researchers answering questions involving either metadata or program analysis. Both the code and dataset are available at: https://dependencies.science

DOI: 10.1145/3611643.3613094

Ad Hoc Syntax-Guided Program Reduction

作者: Tian, Jia Le and Zhang, Mengxiao and Xu, Zhenyang and Tian, Yongqiang and Dong, Yiwen and Sun, Chengnian
关键词: program reduction

Abstract

Program reduction is a widely adopted, indispensable technique for debugging language implementations such as compilers and interpreters. Given a program 𝑃 and a bug triggered by 𝑃, a program reducer can produce a minimized program 𝑃∗ that is derived from 𝑃 and still triggers the same bug. Perses is one of the state-of-the-art program reducers. It leverages the syntax of 𝑃 to guide the reduction process for efficiency and effectiveness. It is language-agnostic as its reduction algorithm is independent of any language-specific syntax. Conceptually to support a new language, Perses only needs the context-free grammar 𝐺 of the language; in practice, it is not easy. One needs to first manually transform 𝐺 into a special grammar form PNF with a tool provided by Perses, second manually change the code base of Perses to integrate the new language, and lastly build a binary of Perses. This paper presents our latest work to improve the usability of Perses by extending Perses to perform ad hoc program reduction for any new language as long as the language has a context-free grammar 𝐺. With this extended version (referred to as Persesadhoc), the difficulty of supporting new languages is significantly reduced: a user only needs to write a configuration file and execute one command to support a new language in Perses, compared to manually transforming the grammar format, modifying the code base, and re-building Perses. Our case study demonstrates that with Persesadhoc, the Perses related infrastructure code required for supporting GLSL can be reduced from 190 lines of code to 20. Our extensive evaluations also show that Persesadhoc is as effective and efficient as Perses in reducing hoc programs, and only takes 10 seconds to support a new language, which is negligible compared to the manual effort required in Perses. A video demonstration of the tool can be found at https://youtu.be/trYwOT0mXhU.

DOI: 10.1145/3611643.3613101

Idaka： Tool Demo for the FSE 2023 Demonstration Article `On Using Information Retrieval to Recommend Machine Learning Good Practices for Software Engineers`

作者: Cabra-Acela, Laura and Mojica-Hanke, Anamaria and Linares-V'{a
关键词: Good practices, Information retrieval, Large language models, Machine learning

Abstract

This is the artifact accompanying our demonstration article On Using Information Retrieval to Recommend Machine Learning Good Practices for Software Engineers accepted for the presentation at the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) 2023. This artifact contains the source code and data required to deploy Idaka. This tool allows the retrieval and generation of machine learning practices based on a systematic approach (browsing all the practices) or a query. In addition, it contains a readme file in which the instructions for building and deploying the tools are available.

DOI: 10.1145/3611643.3613093

Helion： Enabling Natural Testing of Smart Homes

作者: Mandal, Prianka and Manandhar, Sunil and Kafle, Kaushal and Moran, Kevin and Poshyvanyk, Denys and Nadkarni, Adwait
关键词: Home Assistant, Home Automation, Language Models, Trigger-Action Programming

Abstract

Prior work has developed numerous systems that test the security and safety of smart homes. For these systems to be applicable in practice, it is necessary to test them with realistic scenarios that represent the use of the smart home, i.e., home automation, in the wild. This demo paper presents the technical details and usage of Helion, a system that uses n-gram language modeling to learn the regularities in user-driven programs, i.e., routines developed for the smart home, and predicts natural scenarios of home automation, i.e., event sequences that reflect realistic home automation usage. We demonstrate the HelionHA platform, developed by integrating Helion with the popular Home Assistant smart home platform. HelionHA allows an end-to-end exploration of Helion’s scenarios by executing them as test cases with real and virtual smart home devices.

DOI: 10.1145/3611643.3613095

A Language Model of Java Methods with Train/Test Deduplication

作者: Su, Chia-Yi and Bansal, Aakash and Jain, Vijayanta and Ghanavati, Sepideh and McMillan, Collin
关键词: deduplication, java, language model, research tools

Abstract

This tool demonstration presents a research toolkit for a language model of Java source code. The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java. In contrast to many existing language models, we prioritize features for researchers including an open and easily-searchable training set, a held out test set with different levels of deduplication from the training set, infrastructure for deduplicating new examples, and an implementation platform suitable for execution on equipment accessible to a relatively modest budget. Our model is a GPT2-like architecture with 350m parameters. Our training set includes 52m Java methods (9b tokens) and 13m StackOverflow threads (10.5b tokens). To improve accessibility of research to more members of the community, we limit local resource requirements to GPUs with 16GB video memory. We provide a test set of held out Java methods that include descriptive comments, including the entire Java projects for those methods. We also provide deduplication tools using precomputed hash tables at various similarity thresholds to help researchers ensure that their own test examples are not in the training set. We make all our tools and data open source and available via Huggingface and Github.

DOI: 10.1145/3611643.3613090

DENT： A Tool for Tagging Stack Overflow Posts with Deep Learning Energy Patterns

作者: Shanbhag, Shriram and Chimalakonda, Sridhar and Sharma, Vibhu Saujanya and Kaulgud, Vikrant
关键词: deep learning, energy patterns, energy tags, stack overflow

Abstract

Energy efficiency has become an important consideration in deep learning systems. However, it remains a largely under-emphasized aspect during the development. Despite the emergence of energy-efficient deep learning patterns, their adoption remains a challenge due to limited awareness. To address this gap, we present DENT (Deep Learning Energy Pattern Tagger, a Chrome extension used to add “energy pattern tags” to the deep learning related questions from Stack Overflow. The idea of DENT is to hint to the developers about the possible energy-saving opportunities associated with the Stack Overflow post through energy pattern labels. We hope this will increase awareness about energy patterns in deep learning and improve their adoption. A preliminary evaluation of DENT achieved an average precision of 0.74, recall of 0.66, and an F1-score of 0.65 with an accuracy of 66%. The demonstration of the tool is available at https://youtu.be/S0Wf_w0xajw and the related artifacts are available at https://rishalab.github.io/DENT/

DOI: 10.1145/3611643.3613092

MASC： A Tool for Mutation-Based Evaluation of Static Crypto-API Misuse Detectors

作者: Ami, Amit Seal and Ahmed, Syed Yusuf and Redoy, Radowan Mahmud and Cooper, Nathan and Kafle, Kaushal and Moran, Kevin and Poshyvanyk, Denys and Nadkarni, Adwait
关键词: Crypto-API, crypto-API misuse detector, mutation testing, mutation-based evaluation, security, software-engineering, static analysis

Abstract

While software engineers are optimistically adopting crypto-API misuse detectors (or crypto-detectors) in their software development cycles, this momentum must be accompanied by a rigorous understanding of crypto-detectors’ effectiveness at finding crypto-API misuses in practice. This demo paper presents the technical details and usage scenarios of our tool, namely Mutation Analysis for evaluating Static Crypto-API misuse detectors (MASC). We developed 12 generalizable, usage based mutation operators and three mutation scopes, namely Main Scope, Similarity Scope, and Exhaustive Scope, which can be used to expressively instantiate compilable variants of the crypto-API misuse cases. Using MASC, we evaluated nine major crypto-detectors, and discovered 19 unique, undocumented flaws. We designed MASC to be configurable and user-friendly; a user can configure the parameters to change the nature of generated mutations. Furthermore, MASC comes with both Command Line Interface and Web-based front-end, making it practical for users of different levels of expertise.

DOI: 10.1145/3611643.3613099

llvm2CryptoLine： Verifying Arithmetic in Cryptographic C Programs

作者: Chen, Ruiling and Liu, Jiaxiang and Shi, Xiaomu and Tsai, Ming-Hsien and Wang, Bow-Yaw and Yang, Bo-Yin
关键词: cryptographic programs, formal verification, functional correctness

Abstract

Correct implementations of cryptographic primitives are essential for modern security. These implementations often contain arithmetic operations involving non-linear computations that are infamously hard to verify. We present llvm2CryptoLine, an automated formal verification tool for arithmetic operations in cryptographic C programs. llvm2CryptoLine successfully verifies 51 arithmetic C programs from industrial cryptographic libraries OpenSSL, wolfSSL and NaCl. Most of the programs are verified fully automatically and efficiently. A screencast that showcases llvm2CryptoLine can be found at https://youtu.be/QXuSmja45VA. Source code is available at https://github.com/fmlab-iis/llvm2cryptoline.

DOI: 10.1145/3611643.3613096

P4b： A Translator from P4 Programs to Boogie

作者: Ye, Chong and He, Fei
关键词: P4 programming language, Software defined networking, data plane, formal verification

Abstract

P4 is a mainstream language for Software Defined Network (SDN) data planes. P4 is designed to achieve target-independent, protocol-independent, and configurable SDN data planes. However, logic errors may occur in P4 programs, resulting in improper packet processing, which may cause serious network errors and information disclosure. In addition, P4 programs contain many branches and thus are more challenging to ensure correctness.
Formal verification is a powerful technique to verify the correctness of P4 programs. Unfortunately, current P4 verification studies lack basic toolchains, and their intermediate languages are not expressive enough. We present P4b, an efficient translator from P4 programs to Boogie, a verification-oriented intermediate representation. We provide formal translation rules to ensure the correctness of the translation process. The translated results can be verified by the toolchain of Boogie. We conducted experiments on 170 P4 programs collected from GitHub, and the experimental results demonstrate that our translator is useful and practical.
The screencast is available at https://youtu.be/8_rEj3QFQeM. The tool is available at https://github.com/Invincibleyc/P4B-Translator.

DOI: 10.1145/3611643.3613091

D2S2： Drag ’n’ Drop Mobile App Screen Search

作者: Mohian, Soumik and Tang, Tony and Trinh, Tuan and Dang, Don and Csallner, Christoph
关键词: User interface design, design examples, information retrieval, interactive screenshot search, prototyping

Abstract

The lack of diverse UI element representations in publicly available datasets hinders the scalability of sketch-based interactive mobile search. This paper introduces D2S2, a novel approach that addresses this limitation via drag-and-drop mobile screen search, accommodating visual and text-based queries. D2S2 searches 58k Rico screens for relevant UI examples based on UI element attributes, including type, position, shape, and text. In an evaluation with 10 novice software developers D2S2 successfully retrieves target screens within the top-20 search results in 15/19 attempts within a minute. The tool offers interactive and iterative search, updating its search results each time the user modifies the search query. Interested users can freely access D2S2 (http://pixeltoapp.com/D2S2), build on D2S2 or replicate results via D2S2’s open-source implementation (https://github.com/toni-tang/D2S2), or watch D2S2’s video demonstration (https://youtu.be/fdoYiw8lAn0).

DOI: 10.1145/3611643.3613100

CONAN： Statically Detecting Connectivity Issues in Android Applications

作者: Mazuera-Rozo, Alejandro and Escobar-Vel'{a
关键词: Android, Connectivity Issues, Linter

Abstract

Mobile apps are increasingly used in daily activities. Most apps require Internet connectivity to be fully exploited. Despite the fact that global access to the Internet has improved over the years, there are still complex connectivity scenarios, including situations with zero/unreliable connectivity. In such scenarios, improper handling of Eventual Connectivity Issues may cause bugs and crashes that worsen the user experience. Even though these issues have been studied in the literature, no automatic detection techniques are available. To address the mentioned gap, we have created the open source CONAN tool. CONAN can statically detect 16 types of Eventual Connectivity Issues within Android apps; it works at the source code level and alerts developers of any connectivity issue, highlighting them directly in the IDE or generating a report explaining the detected errors. In this paper, we present the technical aspects and a video of our tool, which are publicly available at https://tinyurl.com/CONAN-lint.

DOI: 10.1145/3611643.3613097

A Data Set of Extracted Rationale from Linux Kernel Commit Messages

作者: Dhaouadi, Mouna
关键词: Commit messages, Data set, Linux kernel, Software rationale

Abstract

Developer’s commit messages contain information about decisions taken and their rationale. Extracting this information is challenging since we lack a detailed understanding of how developers express these concepts. Our work-in-progress targets this challenge by producing a labelled data set of commit messages for a Linux Kernel component. We report preliminary analyses which suggest that larger commit messages and more experienced developers commits tend towards having 40% of sentences containing rationale. This may indicate a guideline for developers to target.

DOI: 10.1145/3611643.3617851

Detecting Overfitting of Machine Learning Techniques for Automatic Vulnerability Detection

作者: Risse, Niklas
关键词: automatic vulnerability detection, large language models, machine learning, semantic preserving transformations

Abstract

Recent results of machine learning for automatic vulnerability detection have been very promising indeed: Given only the source code of a function f, models trained by machine learning techniques can decide if f contains a security flaw with up to 70% accuracy. But how do we know that these results are general and not specific to the datasets? To study this question, researchers proposed to amplify the testing set by injecting semantic preserving changes and found that the model’s accuracy significantly drops. In other words, the model uses some unrelated features during classification. In order to increase the robustness of the model, researchers proposed to train on amplified training data, and indeed model accuracy increased to previous levels. In this paper, we replicate and continue this investigation, and provide an actionable model benchmarking methodology to help researchers better evaluate advances in machine learning for vulnerability detection. Specifically, we propose a cross validation algorithm, where a semantic preserving transformation is applied during the amplification of either the training set or the testing set. Using 11 transformations and 3 ML techniques, we find that the improved robustness only applies to the specific transformations used during training data amplification. In other words, the robustified models still rely on unrelated features for predicting the vulnerabilities in the testing data.

DOI: 10.1145/3611643.3617845

Detection of Optimizations Missed by the Compiler

作者: Zhang, Yi
关键词: Compiler, Missed optimizations, Testing

Abstract

With the increasing significance of compilers in software development, identifying optimization bugs and enhancing their performance has become a significant challenge. In recent years, many studies have targeted only specific analyses to identify missed optimizations (e.g., in data flow analyses). While there have been some general approaches, most rely on differential testing between different compilers, which makes it difficult to identify optimization bugs that are common to multiple compilers. This paper tackles these challenges by introducing an effective and general approach called MOD. MOD works by using a manually-curated mapping between optimizations, ensuring that code should be consistently optimized: if one optimization triggers but the other does not, that indicates a bug in either of the two optimizations. We implemented MOD as a tool to detect missed optimizations in the available expressions of the LLVM. Experimental results show that MOD can report 20 correct alerts in one hour of detection of random test programs.

DOI: 10.1145/3611643.3617846

Do All Software Projects Die When Not Maintained? Analyzing Developer Maintenance to Predict OSS Usage

作者: Nguyen, Emily
关键词: Open Source, Open Source Sustainability, Survival Analysis

Abstract

Abstract: Past research suggests software should be continuously maintained in order to remain useful in our digital society. To determine whether these studies on software evolution are supported in modern-day software libraries, we conduct a natural experiment on 26,050 GitHub repositories, statistically modeling library usage based on their package-level downloads against different factors related to project maintenance.

DOI: 10.1145/3611643.3617849

Inferring Complexity Bounds from Recurrence Relations

作者: Ishimwe, Didier
关键词: complexity analysis, dynamic invariant generation, numerical relations, recurrence relations

Abstract

Determining program complexity bounds is a fundamental problem with a variety of applications in software development. In this paper we present a novel approach for computing the asymptotic complexity bounds of non-deterministic recursive programs by solving dynamically inferred recurrence relations. Recurrences are inferred from program execution traces and solved using the annihilator method and Master Theorem to obtain closed-form solutions representing the complexity bounds. Our preliminary evaluation shows that this approach can learn correct bounds for popular classical recursive programs (e.g. O(n2) for quicksort), achieving more precise bounds for exponential programs than previously reported (e.g. O((1+√5/2)n) for fibonacci), and capturing a wide range of bounds including non-linear polynomial and non-polynomial, logarithmic, and exponential relations.

DOI: 10.1145/3611643.3617853

LLM-Based Code Generation Method for Golang Compiler Testing

作者: Gu, Qiuhan
关键词: Code generation, Compiler testing, Go language, Large model

Abstract

Modern optimizing compilers are among the most complex software systems humans build. One way to identify subtle compiler bugs is fuzzing. Both the quantity and the quality of testcases are crucial to the performance of fuzzing. Traditional testcase-generation methods, such as Csmith and YARPGen, have been proven successful at discovering compiler bugs. However, such generated testcases have limited coverage and quantity. In this paper, we present a code generation method for compiler testing based on LLM to maximize the quality and quantity of the generated code. In particular, to avoid undefined behavior and syntax errors in generated testcases, we design a filter strategy to clean the source code, preparing a high-quality dataset for the model training. Besides, we present a seed schedule strategy to improve code generation. We apply the method to test the Golang compiler and the result shows that our pipeline outperforms previous methods both qualitatively and quantitatively. It produces testcases with an average coverage of 3.38%, in contrast to the testcases generated by GoFuzz, which have an average coverage of 0.44%. Moreover, among all the generated testcases, only 2.79% exhibited syntax errors, and none displayed undefined behavior.

DOI: 10.1145/3611643.3617850

Privacy-Centric Log Parsing for Timely, Proactive Personal Data Protection

作者: Sedki, Issam
关键词: AIOps, Data Privacy and Security, GDPR, Log Analytics, Regulatory Compliance

Abstract

This paper presents a privacy-centric approach to log parsing, addressing the growing need for privacy compliance in log management. We propose a novel log parser that focuses on data minimization, a key principle in privacy protection. By integrating privacy considerations into the log parsing process, our approach enables proactive and timely privacy compliance and mitigation of privacy breaches.

DOI: 10.1145/3611643.3617847

STraceBERT： Source Code Retrieval using Semantic Application Traces

作者: Spiess, Claudio
关键词: neural information retrieval, reverse engineering, tracing

Abstract

Software reverse engineering is an essential task in software engineering and security, but it can be a challenging process, especially for adversarial artifacts. To address this challenge, we present STraceBERT, a novel approach that utilizes a Java dynamic analysis tool to record calls to core Java libraries, and pretrain a BERT-style model on the recorded application traces for effective method source code retrieval from a candidate set. Our experiments demonstrate the effectiveness of STraceBERT in retrieving the source code compared to existing approaches. Our proposed approach offers a promising solution to the problem of code retrieval in software reverse engineering and opens up new avenues for further research in this area.

DOI: 10.1145/3611643.3617852

The Call Graph Chronicles： Unleashing the Power Within

作者: Bhuiyan, Masudul Hasan Masud
关键词: Call Graphs, Graph Neural Network, Software Engineering

Abstract

Call graph generation is critical for program understanding and
analysis, but achieving both accuracy and precision is challenging.
Existing methods trade off one for the other, particularly in dy-
namic languages like JavaScript. This paper introduces “Graphia,”
an approach that combines structural and semantic information
using a Graph Neural Network (GNN) to enhance call graph accu-
racy. Graphia’s two-step process employs an initial call graph as
training data for the GNN, which then uncovers true call edges in
new programs. Experimental results show Graphia significantly
improves true positive rates in vulnerability detection, achieving up
to 95%. This approach advances call graph accuracy by effectively
incorporating code structure and context, particularly in complex
dynamic language scenarios.

DOI: 10.1145/3611643.3617854

The State of Survival in OSS： The Impact of Diversity

作者: Feng, Zixuan
关键词: Disengagement, Open Source, Survivability

Abstract

Maintaining and retaining contributors is crucial for Open Source (OSS) projects. However, there is often a high turnover among contributors (in some projects as high as 80%). The survivability of contributors is influenced by various factors, including their demographics. Research on contributors’ survivability must, therefore, consider diversity factors. This study longitudinally analyzed the impact of demographic attributes on survivability in the Flutter community through the lens of gender, region, and compensation. The preliminary analysis reveals that affiliated or Western contributors have a higher survival probability than volunteer or Non-Western contributors. However, no significant difference was found in the survival probability between men and women.

DOI: 10.1145/3611643.3617848

Artifact for “MAAT： A Novel Ensemble Approach to Addressing Fairness and Performance Bugs for Machine Learning Software”

作者: Chen, Zhenpeng and Zhang, Jie M. and Sarro, Federica and Harman, Mark
关键词: bias mitigation, ensemble learning, fairness-performance trade-off, machine learning software, Software fairness

Abstract

This artifact is for the paper entitled “MAAT: A Novel Ensemble Approach to Addressing Fairness and Performance Bugs for Machine Learning Software”, which is accepted by ESEC/FSE 2022. MAAT is a novel ensemble approach to improving the fairness-performance trade-off for machine learning software. It outperforms state-of-the-art bias mitigation methods. In this artifact, we provide the source code of MAAT and other existing bias mitigation methods that we use in our study, as well as the intermediate results, the installation instructions, and a replication guideline (included in the README). The replication guideline provides detailed steps to replicate all the results for all the research questions.

DOI: 10.1145/3540250.3549093