ICSE 2024

Challenges and Opportunities in Model Checking Large-scale Distributed Systems

作者: Majumdar, Rupak
关键词: No keywords

Abstract

The goal of the Must project is to provide design and verification support for industrial-scale distributed systems. We provide an overview of the project: its design goals, its technical features, as well as some lessons we learnt in the process of transferring academic research to an industrial tool.

DOI: 10.1145/3597503.3649398

Software Engineering Research in a World with Generative Artificial Intelligence

作者: Rinard, Martin
关键词: software engineering, generative artificial intelligence, large language models

Abstract

Generative artificial intelligence systems such as large language models (LLMs) exhibit powerful capabilities that many see as the kind of flexible and adaptive intelligence that previously only humans could exhibit. I address directions and implications of LLMs for software engineering research.

DOI: 10.1145/3597503.3649399

Trustworthy by Design

作者: Smith, Carol
关键词: keynote, ethics, trust, emerging technology, AI

Abstract

The relatively recent public release of generative artificial intelligence (AI) systems has ignited a significant leap in awareness of the capabilities of AI. In parallel, there has been a recognition of AI system limitations and the bias inherent in systems created by humans. Expectations are rising for more trustworthy, human-centered, and responsible software connecting humans to powerful systems that augment their abilities. There are decades of practice designing systems that work with, and for humans, that we can build upon to face the new challenges and opportunities brought by dynamic AI systems.

DOI: 10.1145/3597503.3649400

Domain Knowledge Matters： Improving Prompts with Fix Templates for Repairing Python Type Errors

作者: Peng, Yun and Gao, Shuzheng and Gao, Cuiyun and Huo, Yintong and Lyu, Michael
关键词: No keywords

Abstract

As a dynamic programming language, Python has become increasingly popular in recent years. Although the dynamic type system of Python facilitates the developers in writing Python programs, it also brings type errors at run-time which are prevalent yet not easy to fix. There exist rule-based approaches for automatically repairing Python type errors. The approaches can generate accurate patches for the type errors covered by manually defined templates, but they require domain experts to design patch synthesis rules and suffer from low template coverage of real-world type errors. Learning-based approaches alleviate the manual efforts in designing patch synthesis rules and have become prevalent due to the recent advances in deep learning. Among the learning-based approaches, the prompt-based approach which leverages the knowledge base of code pre-trained models via pre-defined prompts, obtains state-of-the-art performance in general program repair tasks. However, such prompts are manually defined and do not involve any specific clues for repairing Python type errors, resulting in limited effectiveness. How to automatically improve prompts with the domain knowledge for type error repair is challenging yet under-explored.In this paper, we present TypeFix, a novel prompt-based approach with fix templates incorporated for repairing Python type errors. TypeFix first mines generalized fix templates via a novel hierarchical clustering algorithm. The identified fix templates indicate the common edit patterns and contexts of existing type error fixes. TypeFix then generates code prompts for code pre-trained models by employing the generalized fix templates as domain knowledge, in which the masks are adaptively located for each type error instead of being pre-determined. Experiments on two benchmarks, including BugsInPy and TypeBugs, show that TypeFix successfully repairs 26 and 55 type errors, outperforming the best baseline approach by 9 and 14, respectively. Besides, the proposed fix template mining approach can cover 75% of developers’ patches in both benchmarks, increasing the best rule-based approach PyTER by more than 30%.

DOI: 10.1145/3597503.3608132

Practical Program Repair via Preference-based Ensemble Strategy

作者: Zhong, Wenkang and Li, Chuanyi and Liu, Kui and Xu, Tongtong and Ge, Jidong and Bissyande, Tegawende F. and Luo, Bin and Ng, Vincent
关键词: program repair, ensemble strategy

Abstract

To date, over 40 Automated Program Repair (APR) tools have been designed with varying bug-fixing strategies, which have been demonstrated to have complementary performance in terms of being effective for different bug classes. Intuitively, it should be feasible to improve the overall bug-fixing performance of APR via assembling existing tools. Unfortunately, simply invoking all available APR tools for a given bug can result in unacceptable costs on APR execution as well as on patch validation (via expensive testing). Therefore, while assembling existing tools is appealing, it requires an efficient strategy to reconcile the need to fix more bugs and the requirements for practicality. In light of this problem, we propose a Preference-based Ensemble Program Repair framework (P-EPR), which seeks to effectively rank APR tools for repairing different bugs. P-EPR is the first non-learning-based APR ensemble method that is novel in its exploitation of repair patterns as a major source of knowledge for ranking APR tools and its reliance on a dynamic update strategy that enables it to immediately exploit and benefit from newly derived repair results. Experimental results show that P-EPR outperforms existing strategies significantly both in flexibility and effectiveness.

DOI: 10.1145/3597503.3623310

Learning and Repair of Deep Reinforcement Learning Policies from Fuzz-Testing Data

作者: Tappler, Martin and Pferscher, Andrea and Aichernig, Bernhard K. and K"{o
关键词: deep reinforcement learning, reinforcement learning from demonstrations, search-based software testing, policy repair

Abstract

Reinforcement learning from demonstrations (RLfD) is a promising approach to improve the exploration efficiency of reinforcement learning (RL) by learning from expert demonstrations in addition to interactions with the environment. In this paper, we propose a framework that combines techniques from search-based testing with RLfD with the goal to raise the level of dependability of RL policies and to reduce human engineering effort. Within our framework, we provide methods for efficiently training, evaluating, and repairing RL policies. Instead of relying on the costly collection of demonstrations from (human) experts, we automatically compute a diverse set of demonstrations via search-based fuzzing methods and use the fuzz demonstrations for RLfD. To evaluate the safety and robustness of the trained RL agent, we search for safety-critical scenarios in the black-box environment. Finally, when unsafe behavior is detected, we compute demonstrations through fuzz testing that represent safe behavior and use them to repair the policy. Our experiments show that our framework is able to efficiently learn high-performing and safe policies without requiring any expert knowledge.

DOI: 10.1145/3597503.3623311

BinAug： Enhancing Binary Similarity Analysis with Low-Cost Input Repairing

作者: Wong, Wai Kin and Wang, Huaijin and Li, Zongjie and Wang, Shuai
关键词: binary analysis, DNNs, input repairing

Abstract

Binary code similarity analysis (BCSA) is a fundamental building block for various software security, reverse engineering, and re-engineering applications. Existing research has applied deep neural networks (DNNs) to measure the similarity between binary code, following the major breakthrough of DNNs in processing media data like images. Despite the encouraging results of DNN-based BCSA, it is however not widely deployed in the industry due to the instability and the black-box nature of DNNs.In this work, we first launch an extensive study over the state-of-the-art (SoTA) BCSA tools, and investigate their erroneous predictions from both quantitative and qualitative perspectives. Then, we accordingly design a low-cost and generic framework, namely Binaug, to improve the accuracy of BCSA tools by repairing their input binary codes. Aligned with the typical workflow of DNN-based BCSA, Binaug obtains the sorted top-K results of code similarity, and then re-ranks the results using a set of carefully-designed transformations. Binaug supports both black- and white-box settings, depending on the accessibility of the DNN model internals. Our experimental results show that Binaug can constantly improve performance of the SoTA BCSA tools by an average of 2.38pt and 6.46pt in the black- and the white-box settings. Moreover, with Binaug, we enhance the F1 score of binary software component analysis, an important downstream application of BCSA, by an average of 5.43pt and 7.45pt in the black- and the white-box settings.

DOI: 10.1145/3597503.3623328

VeRe： Verification Guided Synthesis for Repairing Deep Neural Networks

作者: Ma, Jianan and Yang, Pengfei and Wang, Jingyi and Sun, Youcheng and Huang, Cheng-Chao and Wang, Zhen
关键词: DNN repair, verification guided synthesis, fault localization

Abstract

Neural network repair aims to fix the 'bugs’1 of neural networks by modifying the model’s architecture or parameters. However, due to the data-driven nature of neural networks, it is difficult to explain the relationship between the internal neurons and erroneous behaviors, making further repair challenging. While several work exists to identify responsible neurons based on gradient or causality analysis, their effectiveness heavily rely on the quality of available ‘bugged’ data and multiple heuristics in layer or neuron selection. In this work, we address the issue utilizing the power of formal verification (in particular for neural networks). Specifically, we propose VeRe, a verification-guided neural network repair framework that performs fault localization based on linear relaxation to symbolically calculate the repair significance of neurons and furthermore optimize the parameters of problematic neurons to repair erroneous behaviors. We evaluated VeRe on various repair tasks, and our experimental results show that VeRe can efficiently and effectively repair all neural networks without degrading the model’s performance. For the task of removing backdoors, VeRe successfully reduces attack success rate from 98.47% to 0.38% on average, while causing an average performance drop of 0.9%. For the task of repairing safety properties, VeRe successfully repairs all the 36 tasks and achieves 99.87% generalization on average.

DOI: 10.1145/3597503.3623332

RUNNER： Responsible UNfair NEuron Repair for Enhancing Deep Neural Network Fairness

作者: Li, Tianlin and Cao, Yue and Zhang, Jian and Zhao, Shiqian and Huang, Yihao and Liu, Aishan and Guo, Qing and Liu, Yang
关键词: deep learning repair, fairness, model interpretation

Abstract

Deep Neural Networks (DNNs), an emerging software technology, have achieved impressive results in a variety of fields. However, the discriminatory behaviors towards certain groups (a.k.a. unfairness) of DNN models increasingly become a social concern, especially in high-stake applications such as loan approval and criminal risk assessment. Although there has been a number of works to improve model fairness, most of them adopt an adversary to either expand the model architecture or augment training data, which introduces excessive computational overhead. Recent work diagnoses responsible unfair neurons first and fixes them with selective retraining. Unfortunately, existing diagnosis process is time-consuming due to multi-step training sample analysis, and selective retraining may cause a performance bottleneck due to indirectly adjusting unfair neurons on biased samples. In this paper, we propose Responsible UNfair NEuron Repair (RUNNER) that improves existing works in three key aspects: (1) efficiency: we design the Importance-based Neuron Diagnosis that identifies responsible unfair neurons in one step with a novel importance criterion of neurons; (2) effectiveness: we design the Neuron Stabilizing Retraining by adding a loss term that measures the activation distance of responsible unfair neurons from different subgroups in all sources; (3) generalization: we investigate the effectiveness on both structured tabular data and large-scale unstructured image data, which is often ignored in prior studies. Our extensive experiments across 5 datasets show that RUUNER can effectively and efficiently diagnose and repair the DNNs regarding unfairness. On average, our approach significantly reduces computing overhead from 341.7s to 29.65s, and achieves improved fairness up to 79.3%. Besides, RUNNER also keeps state-of-the-art results on the unstructured dataset.

DOI: 10.1145/3597503.3623334

ITER： Iterative Neural Repair for Multi-Location Patches

作者: Ye, He and Monperrus, Martin
关键词: No keywords

Abstract

Automated program repair (APR) has achieved promising results, especially using neural networks. Yet, the overwhelming majority of patches produced by APR tools are confined to one single location. When looking at the patches produced with neural repair, most of them fail to compile, while a few uncompilable ones go in the right direction. In both cases, the fundamental problem is to ignore the potential of partial patches. In this paper, we propose an iterative program repair paradigm called ITER founded on the concept of improving partial patches until they become plausible and correct. First, ITER iteratively improves partial single-location patches by fixing compilation errors and further refining the previously generated code. Second, ITER iteratively improves partial patches to construct multi-location patches, with fault localization re-execution. ITER is implemented for Java based on battle-proven deep neural networks and code representation. ITER is evaluated on 476 bugs from 10 open-source projects in Defects4J 2.0. ITER succeeds in repairing 15.5% of them, including 9 uniquely repaired multi-location bugs.

DOI: 10.1145/3597503.3623337

EGFE： End-to-end Grouping of Fragmented Elements in UI Designs with Multimodal Learning

作者: Chen, Liuqing and Chen, Yunnong and Xiao, Shuhong and Song, Yaxuan and Sun, Lingyun and Zhen, Yankun and Zhou, Tingting and Chang, Yanfang
关键词: UI elements grouping, fragmented elements grouping, end-to-end pipeline, multi-modal transformer

Abstract

When translating UI design prototypes to code in industry, automatically generating code from design prototypes can expedite the development of applications and GUI iterations. However, in design prototypes without strict design specifications, UI components may be composed of fragmented elements. Grouping these fragmented elements can greatly improve the readability and maintainability of the generated code. Current methods employ a two-stage strategy that introduces hand-crafted rules to group fragmented elements. Unfortunately, the performance of these methods is not satisfying due to visually overlapped and tiny UI elements. In this study, we propose EGFE, a novel method for automatically End-to-end Grouping Fragmented Elements via UI sequence prediction. To facilitate the UI understanding, we innovatively construct a Transformer encoder to model the relationship between the UI elements with multi-modal representation learning. The evaluation on a dataset of 4606 UI prototypes collected from professional UI designers shows that our method outperforms the state-of-the-art baselines in the precision (by 29.75%), recall (by 31.07%), and F1-score (by 30.39%) at edit distance threshold of 4. In addition, we conduct an empirical study to assess the improvement of the generated front-end code. The results demonstrate the effectiveness of our method on a real software engineering application. Our end-to-end fragmented elements grouping method creates opportunities for improving UI-related software engineering tasks.

DOI: 10.1145/3597503.3623313

A Comprehensive Study of Learning-based Android Malware Detectors under Challenging Environments

作者: Gao, Cuiying and Huang, Gaozhun and Li, Heng and Wu, Bang and Wu, Yueming and Yuan, Wei
关键词: android malware detection, machine learning, code obfuscation, concept drift, adversarial examples

Abstract

Recent years have witnessed the proliferation of learning-based Android malware detectors. These detectors can be categorized into three types, String-based, Image-based and Graph-based. Most of them have achieved good detection performance under the ideal setting. In reality, however, detectors often face out-of-distribution samples due to the factors such as code obfuscation, concept drift (e.g., software development technique evolution and new malware category emergence), and adversarial examples (AEs). This problem has attracted increasing attention, but there is a lack of comparative studies that evaluate the existing various types of detectors under these challenging environments. In order to fill this gap, we select 12 representative detectors from three types of detectors, and evaluate them in the challenging scenarios involving code obfuscation, concept drift and AEs, respectively. Experimental results reveal that none of the evaluated detectors can maintain their ideal-setting detection performance, and the performance of different types of detectors varies significantly under various challenging environments. We identify several factors contributing to the performance deterioration of detectors, including the limitations of feature extraction methods and learning models. We also analyze the reasons why the detectors of different types show significant performance differences when facing code obfuscation, concept drift and AEs. Finally, we provide practical suggestions from the perspectives of users and researchers, respectively. We hope our work can help understand the detectors of different types, and provide guidance for enhancing their performance and robustness.

DOI: 10.1145/3597503.3623320

Toward Automatically Completing GitHub Workflows

作者: Mastropaolo, Antonio and Zampetti, Fiorella and Bavota, Gabriele and Di Penta, Massimiliano
关键词: continuous integration and delivery, GitHub workflows, pre-trained models, machine learning on code

Abstract

Continuous integration and delivery (CI/CD) are nowadays at the core of software development. Their benefits come at the cost of setting up and maintaining the CI/CD pipeline, which requires knowledge and skills often orthogonal to those entailed in other software-related tasks. While several recommender systems have been proposed to support developers across a variety of tasks, little automated support is available when it comes to setting up and maintaining CI/CD pipelines. We present GH-WCOM (GitHub Workflow COMpletion), a Transformer-based approach supporting developers in writing a specific type of CI/CD pipelines, namely GitHub workflows. To deal with such a task, we designed an abstraction process to help the learning of the transformer while still making GH-WCOM able to recommend very peculiar workflow elements such as tool options and scripting elements. Our empirical study shows that GH-WCOM provides up to 34.23% correct predictions, and the model’s confidence is a reliable proxy for the recommendations’ correctness likelihood.

DOI: 10.1145/3597503.3623351

UniLog： Automatic Logging via LLM and In-Context Learning

作者: Xu, Junjielong and Cui, Ziang and Zhao, Yuan and Zhang, Xu and He, Shilin and He, Pinjia and Li, Liqun and Kang, Yu and Lin, Qingwei and Dang, Yingnong and Rajmohan, Saravan and Zhang, Dongmei
关键词: logging, large language model, in-context learning

Abstract

Logging, which aims to determine the position of logging statements, the verbosity levels, and the log messages, is a crucial process for software reliability enhancement. In recent years, numerous automatic logging tools have been designed to assist developers in one of the logging tasks (e.g., providing suggestions on whether to log in try-catch blocks). These tools are useful in certain situations yet cannot provide a comprehensive logging solution in general. Moreover, although recent research has started to explore end-to-end logging, it is still largely constrained by the high cost of fine-tuning, hindering its practical usefulness in software development. To address these problems, this paper proposes UniLog, an automatic logging framework based on the in-context learning (ICL) paradigm of large language models (LLMs). Specifically, UniLog can generate an appropriate logging statement with only a prompt containing five demonstration examples without any model tuning. In addition, UniLog can further enhance its logging ability after warmup with only a few hundred random samples. We evaluated UniLog on a large dataset containing 12,012 code snippets extracted from 1,465 GitHub repositories. The results show that UniLog achieved the state-of-the-art performance in automatic logging: (1) 76.9% accuracy in selecting logging positions, (2) 72.3% accuracy in predicting verbosity levels, and (3) 27.1 BLEU-4 score in generating log messages. Meanwhile, UniLog requires less than 4% of the parameter tuning time needed by fine-tuning the same LLM.

DOI: 10.1145/3597503.3623326

Predicting Performance and Accuracy of Mixed-Precision Programs for Precision Tuning

作者: Wang, Yutong and Rubio-Gonz'{a
关键词: program representation, graph neural networks, floating point, mixed precision, numerical software, program optimization, precision tuning

Abstract

A mixed-precision program is a floating-point program that utilizes different precisions for different operations, providing the opportunity of balancing the trade-off between accuracy and performance. Precision tuning aims to find a mixed-precision version of a program that improves its performance while maintaining a given accuracy. Unfortunately, existing precision tuning approaches are either limited to small-scale programs, or suffer from efficiency issues. In this paper, we propose FPLearner, a novel approach that addresses these limitations. Our insight is to leverage a Machine Learning based technique, Graph Neural Networks, to learn the representation of mixed-precision programs to predict their performance and accuracy. Such prediction models can then be used to accelerate the process of dynamic precision tuning by reducing the number of program runs. We create a dataset of mixed-precision programs from five diverse HPC applications for training our models, which achieve 96.34% F1 score in performance prediction and 97.03% F1 score in accuracy prediction. FPLearner improves the time efficiency of two dynamic precision tuners, Precimonious and HiFPTuner, by an average of 25.54% and up to 61.07% while achieving precision tuning results of comparable or better quality.

DOI: 10.1145/3597503.3623338

Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection

作者: Steenhoek, Benjamin and Gao, Hongyang and Le, Wei
关键词: No keywords

Abstract

Deep learning-based vulnerability detection has shown great performance and, in some studies, outperformed static analysis tools. However, the highest-performing approaches use token-based transformer models, which are not the most efficient to capture code semantics required for vulnerability detection. Classical program analysis techniques such as dataflow analysis can detect many types of bugs based on their root causes. In this paper, we propose to combine such causal-based vulnerability detection algorithms with deep learning, aiming to achieve more efficient and effective vulnerability detection. Specifically, we designed DeepDFA, a dataflow analysis-inspired graph learning framework and an embedding technique that enables graph learning to simulate dataflow computation. We show that DeepDFA is both performant and efficient. DeepDFA outperformed all non-transformer baselines. It was trained in 9 minutes, 75x faster than the highest-performing baseline model. When using only 50+ vulnerable and several hundreds of total examples as training data, the model retained the same performance as 100% of the dataset. DeepDFA also generalized to real-world vulnerabilities in DbgBench; it detected 8.7 out of 17 vulnerabilities on average across folds and was able to distinguish between patched and buggy versions, while the highest-performing baseline models did not detect any vulnerabilities. By combining DeepDFA with a large language model, we surpassed the state-of-the-art vulnerability detection performance on the Big-Vul dataset with 96.46 F1 score, 97.82 precision, and 95.14 recall. Our replication package is located at https://doi.org/10.6084/m9.figshare.21225413.

DOI: 10.1145/3597503.3623345

Large Language Models for Test-Free Fault Localization

作者: Yang, Aidan Z. H. and Le Goues, Claire and Martins, Ruben and Hellendoorn, Vincent
关键词: No keywords

Abstract

Fault Localization (FL) aims to automatically localize buggy lines of code, a key first step in many manual and automatic debugging tasks. Previous FL techniques assume the provision of input tests, and often require extensive program analysis, program instrumentation, or data preprocessing. Prior work on deep learning for APR struggles to learn from small datasets and produces limited results on real-world programs. Inspired by the ability of large language models (LLMs) of code to adapt to new tasks based on very few examples, we investigate the applicability of LLMs to line level fault localization. Specifically, we propose to overcome the left-to-right nature of LLMs by fine-tuning a small set of bidirectional adapter layers on top of the representations learned by LLMs to produce LLMAO, the first language model based fault localization approach that locates buggy lines of code without any test coverage information. We fine-tune LLMs with 350 million, 6 billion, and 16 billion parameters on small, manually curated corpora of buggy programs such as the Defects4J corpus. We observe that our technique achieves substantially more confidence in fault localization when built on the larger models, with bug localization performance scaling consistently with the LLM size. Our empirical evaluation shows that LLMAO improves the Top-1 results over the state-of-the-art machine learning fault localization (MLFL) baselines by 2.3%–54.4%, and Top-5 results by 14.4%-35.6%. LLMAO is also the first FL technique trained using a language model architecture that can detect security vulnerabilities down to the code line level.

DOI: 10.1145/3597503.3623342

CrashTranslator： Automatically Reproducing Mobile Application Crashes Directly from Stack Trace

作者: Huang, Yuchao and Wang, Junjie and Liu, Zhe and Wang, Yawen and Wang, Song and Chen, Chunyang and Hu, Yuanzhe and Wang, Qing
关键词: bug reproduction, stack trace, mobile application testing

Abstract

Crash reports are vital for software maintenance since they allow the developers to be informed of the problems encountered in the mobile application. Before fixing, developers need to reproduce the crash, which is an extremely time-consuming and tedious task. Existing studies conducted the automatic crash reproduction with the natural language described reproducing steps. Yet we find a non-neglectable portion of crash reports only contain the stack trace when the crash occurs. Such stack-trace-only crashes merely reveal the last GUI page when the crash occurs, and lack step-by-step guidance. Developers tend to spend more effort in understanding the problem and reproducing the crash, and existing techniques cannot work on this, thus calling for a greater need for automatic support. This paper proposes an approach named CrashTranslator to automatically reproduce mobile application crashes directly from the stack trace. It accomplishes this by leveraging a pre-trained Large Language Model to predict the exploration steps for triggering the crash, and designing a reinforcement learning based technique to mitigate the inaccurate prediction and guide the search holistically. We evaluate CrashTranslator on 75 crash reports involving 58 popular Android apps, and it successfully reproduces 61.3% of the crashes, outperforming the state-of-the-art baselines by 109% to 206%. Besides, the average reproducing time is 68.7 seconds, outperforming the baselines by 302% to 1611%. We also evaluate the usefulness of CrashTranslator with promising results.

DOI: 10.1145/3597503.3623298

Reorder Pointer Flow in Sound Concurrency Bug Prediction

作者: Guo, Yuqi and Zhu, Shihao and Cai, Yan and He, Liang and Zhang, Jian
关键词: concurrency bug prediction, point-to analysis

Abstract

Due to the non-determinism of thread interleaving, predicting concurrency bugs has long been an extremely difficult task. Recently, several sound bug-detecting approaches were proposed. These approaches are based on local search, i.e., mutating the sequential order of the observed trace and predicting whether the mutated sequential order can trigger a bug. Surprisingly, during this process, they never consider reordering the data flow of the pointers, which can be the key point to detecting many complex bugs. To alleviate this weakness, we propose a new flow-sensitive point-to analysis technique ConPTA to help actively reorder the pointer flow during the sequential order mutation process. Based on ConPTA, we further propose a new sound predictive bug-detecting approach Eagle to predict four types of concurrency bugs. They are null pointer dereference (NPD), uninitialized pointer use (UPU), use after free (UAF), and double free (DF). By actively reordering the pointer flow, Eagle can explore a larger search space of the thread interleaving during the mutation and thus detect more concurrency bugs. Our evaluation of Eagle on 10 real-world multi-threaded programs shows that Eagle significantly outperforms four state-of-the-art bug-detecting approaches UFO, ConVul, ConVulPOE and Period in both effectiveness and efficiency.

DOI: 10.1145/3597503.3623300

Object Graph Programming

作者: Thimmaiah, Aditya and Lampropoulos, Leonidas and Rossbach, Christopher and Gligoric, Milos
关键词: object graph, graph database, query, cypher

Abstract

We introduce Object Graph Programming (OGO), which enables reading and modifying an object graph (i.e., the entire state of the object heap) via declarative queries. OGO models the objects and their relations in the heap as an object graph thereby treating the heap as a graph database: each node in the graph is an object (e.g., an instance of a class or an instance of a metadata class) and each edge is a relation between objects (e.g., a field of one object references another object). We leverage Cypher, the most popular query language for graph databases, as OGO’s query language. Unlike LINQ, which uses collections (e.g., List) as a source of data, OGO views the entire object graph as a single “collection”. OGO is ideal for querying collections (just like LINQ), introspecting the runtime system state (e.g., finding all instances of a given class or accessing fields via reflection), and writing assertions that have access to the entire program state. We prototyped OGO for Java in two ways: (a) by translating an object graph into a Neo4j database on which we run Cypher queries, and (b) by implementing our own in-memory graph query engine that directly queries the object heap. We used OGO to rewrite hundreds of statements in large open-source projects into OGO queries. We report our experience and performance of our prototypes.

DOI: 10.1145/3597503.3623319

Semantic Analysis of Macro Usage for Portability

作者: Pappas, Brent and Gazzillo, Paul
关键词: macros, C, program analysis

Abstract

C is an unsafe language. Researchers have been developing tools to port C to safer languages such as Rust, Checked C, or Go. Existing tools, however, resort to preprocessing the source file first, then porting the resulting code, leaving barely recognizable code that loses macro abstractions. To preserve macro usage, porting tools need analyses that understand macro behavior to port to equivalent constructs. But macro semantics differ from typical functions, precluding simple syntactic transformations to port them. We introduce the first comprehensive framework for analyzing the portability of macro usage. We decompose macro behavior into 26 fine-grained properties and implement a program analysis tool, called Maki, that identifies them in real-world code with 94% accuracy. We apply Maki to 21 programs containing a total of 86,199 macro definitions. We found that real-world macros are much more portable than previously known. More than a third (37%) are easy-to-port, and Maki provides hints for porting more complicated macros. We find, on average, 2x more easy-to-port macros and up to 7x more in the best case compared to prior work. Guided by Maki’s output, we found and hand-ported macros in three real-world programs. We submitted patches to Linux maintainers that transform eleven macros, nine of which have been accepted.

DOI: 10.1145/3597503.3623323

NuzzleBug： Debugging Block-Based Programs in Scratch

作者: Deiner, Adina and Fraser, Gordon
关键词: debugging tools, omniscient debugging, interrogative debugging, scratch, computer science education

Abstract

While professional integrated programming environments support developers with advanced debugging functionality, block-based programming environments for young learners often provide no support for debugging at all, thus inhibiting debugging and preventing debugging education. In this paper we introduce NuzzleBug, an extension of the popular block-based programming environment Scratch that provides the missing debugging support. NuzzleBug allows controlling the executions of Scratch programs with classical debugging functionality such as stepping and breakpoints, and it is an omniscient debugger that also allows reverse stepping. To support learners in deriving hypotheses that guide debugging, NuzzleBug is an interrogative debugger that enables to ask questions about executions and provides answers explaining the behavior in question. In order to evaluate NuzzleBug, we survey the opinions of teachers, and study the effects on learners in terms of debugging effectiveness and efficiency. We find that teachers consider NuzzleBug to be useful, and children can use it to debug faulty programs effectively. However, systematic debugging requires dedicated training, and even when NuzzleBug can provide correct answers learners may require further help to comprehend faults and necessary fixes, thus calling for further research on improving debugging techniques and the information they provide.

DOI: 10.1145/3597503.3623331

LogShrink： Effective Log Compression by Leveraging Commonality and Variability of Log Data

作者: Li, Xiaoyun and Zhang, Hongyu and Le, Van-Hoang and Chen, Pengfei
关键词: log compression, data compression, log analysis, clustering

Abstract

Log data is a crucial resource for recording system events and states during system execution. However, as systems grow in scale, log data generation has become increasingly explosive, leading to an expensive overhead on log storage, such as several petabytes per day in production. To address this issue, log compression has become a crucial task in reducing disk storage while allowing for further log analysis. Unfortunately, existing general-purpose and log-specific compression methods have been limited in their ability to utilize log data characteristics. To overcome these limitations, we conduct an empirical study and obtain three major observations on the characteristics of log data that can facilitate the log compression task. Based on these observations, we propose LogShrink, a novel and effective log compression method by leveraging commonality and variability of log data. An analyzer based on longest common subsequence and entropy techniques is proposed to identify the latent commonality and variability in log messages. The key idea behind this is that the commonality and variability can be exploited to shrink log data with a shorter representation. Besides, a clustering-based sequence sampler is introduced to accelerate the commonality and variability analyzer. The extensive experimental results demonstrate that LogShrink can exceed baselines in compression ratio by 16% to 356% on average while preserving a reasonable compression speed.

DOI: 10.1145/3597503.3608129

Demystifying Compiler Unstable Feature Usage and Impacts in the Rust Ecosystem

作者: Li, Chenghao and Wu, Yifei and Shen, Wenbo and Zhao, Zichen and Chang, Rui and Liu, Chengwei and Liu, Yang and Ren, Kui
关键词: rust ecosystem, rust unstable feature, dependency graph

Abstract

Rust programming language is gaining popularity rapidly in building reliable and secure systems due to its security guarantees and outstanding performance. To provide extra functionalities, the Rust compiler introduces Rust unstable features (RUF) to extend compiler functionality, syntax, and standard library support. However, these features are unstable and may get removed, introducing compilation failures to dependent packages. Even worse, their impacts propagate through transitive dependencies, causing large-scale failures in the whole ecosystem. Although RUF is widely used in Rust, previous research has primarily concentrated on Rust code safety, with the usage and impacts of RUF from the Rust compiler remaining unexplored. Therefore, we aim to bridge this gap by systematically analyzing the RUF usage and impacts in the Rust ecosystem. We propose novel techniques for extracting RUF precisely, and to assess its impact on the entire ecosystem quantitatively, we accurately resolve package dependencies. We have analyzed the whole Rust ecosystem with 590K package versions and 140M transitive dependencies. Our study shows that the Rust ecosystem uses 1000 different RUF, and at most 44% of package versions are affected by RUF, causing compiling failures for at most 12% of package versions. To mitigate wide RUF impacts, we further design and implement a RUF-compilation-failure recovery tool that can recover up to 90% of the failure. We believe our techniques, findings, and tools can help stabilize the Rust compiler, ultimately enhancing the security and reliability of the Rust ecosystem.

DOI: 10.1145/3597503.3623352

Resource Usage and Optimization Opportunities in Workflows of GitHub Actions

作者: Bouzenia, Islem and Pradel, Michael
关键词: No keywords

Abstract

Continuous integration and continuous delivery (CI/CD) has become a prevalent practice in software development. GitHub Actions is emerging as a popular platform for implementing CI/CD pipelines, called workflows, especially because the platform offers 2,000 minutes of computation for free to public repositories each month. To understand what these resources are used for and whether CI/CD could be more efficient, this paper presents the first comprehensive empirical study of resource usage and optimization opportunities of GitHub Action workflows. Our findings show that CI/CD imposes significant costs, e.g., $504 per year for an average paid-tier repository. The majority of the used resources is consumed by testing and building (91.2%), which is triggered by pull requests (50.7%), pushes (30.9%), and regularly scheduled workflows (15.5%). While existing optimizations, such as caching (adopted by 32.9% of paid-tier repositories), demonstrate a positive impact, they overall remain underutilized. This result underscores the need for enhanced documentation and tools to guide developers toward more resource-efficient workflows. Moreover, we show that relatively simple changes in the platform, such as deactivating scheduled workflows when repositories are inactive, could result in reductions of execution time between 1.1% and 31.6% over the impacted workflows. Overall, we envision our findings to help improve the resource efficiency of CI/CD pipelines.

DOI: 10.1145/3597503.3623303

Revealing Hidden Threats： An Empirical Study of Library Misuse in Smart Contracts

作者: Huang, Mingyuan and Chen, Jiachi and Jiang, Zigui and Zheng, Zibin
关键词: blockchain, ethereum, library misuse, empirical study

Abstract

Smart contracts are Turing-complete programs that execute on the blockchain. Developers can implement complex contracts, such as auctions and lending, on Ethereum using the Solidity programming language. As an object-oriented language, Solidity provides libraries within its syntax to facilitate code reusability and reduce development complexity. Library misuse refers to the incorrect writing or usage of libraries, resulting in unexpected results, such as introducing vulnerabilities during library development or incorporating an unsafe library during contract development. Library misuse could lead to contract defects that cause financial losses. Currently, there is a lack of research on library misuse. To fill this gap, we collected more than 500 audit reports from the official websites of five audit companies and 223,336 real-world smart contracts from Etherscan to measure library popularity and library misuse. Then, we defined eight general patterns for library misuse; three of them occurring during library development and five during library utilization, which covers the entire library lifecycle. To validate the practicality of these patterns, we manually analyzed 1,018 real-world smart contracts and publicized our dataset. We identified 905 misuse cases across 456 contracts, indicating that library misuse is a widespread issue. Three patterns of misuse are found in more than 50 contracts, primarily due to developers lacking security awareness or underestimating negative impacts. Additionally, our research revealed that vulnerable libraries on Ethereum continue to be employed even after they have been deprecated or patched. Our findings can assist contract developers in preventing library misuse and ensuring the safe use of libraries.

DOI: 10.1145/3597503.3623335

Fine-SE： Integrating Semantic Features and Expert Features for Software Effort Estimation

作者: Li, Yue and Ren, Zhong and Wang, Zhiqi and Yang, Lanxin and Dong, Liming and Zhong, Chenxing and Zhang, He
关键词: effort estimation, AI for SE, deep learning

Abstract

Reliable effort estimation is of paramount importance to software planning and management, especially in industry that requires effective and on-time delivery. Although various estimation approaches have been proposed (e.g., planning poker and analogy), they may be manual and/or subjective, which are difficult to apply to other projects. In recent years, deep learning approaches for effort estimation that rely on learning expert features or semantic features respectively have been extensively studied and have been found to be promising. Semantic features and expert features describe software tasks from different perspectives, however, in the literature, the best combination of these two features has not been explored to enhance effort estimation. Additionally, there are a few studies that discuss which expert features are useful for estimating effort in the industry. To this end, we investigate the potential 13 expert features that can be used to estimate effort by interviewing 26 enterprise employees. Based on that, we propose a novel model, called Fine-SE, that leverages semantic features and expert features for effort estimation. To validate our model, a series of evaluations are conducted on more than 30,000 software tasks from 17 industrial projects of a global ICT enterprise and four open-source software (OSS) projects. The evaluation results indicate that Fine-SE provides higher performance than the baselines on evaluation measures (i.e., mean absolute error, mean magnitude of relative error, and performance indicator), particularly in industrial projects with large amounts of software tasks, which implies a significant improvement in effort estimation. In comparison with expert estimation, Fine-SE improves the performance of evaluation measures by 32.0%-45.2% in within-project estimation. In comparison with the state-of-the-art models, Deep-SE and GPT2SP, it also achieves an improvement of 8.9%-91.4% in industrial projects. The experimental results reveal the value of integrating expert features with semantic features in effort estimation.

DOI: 10.1145/3597503.3623349

Kind Controllers and Fast Heuristics for Non-Well-Separated GR(1) Specifications

作者: Gorenstein, Ariel and Maoz, Shahar and Ringert, Jan Oliver
关键词: No keywords

Abstract

Non-well-separation (NWS) is a known quality issue in specifications for reactive synthesis. The problem of NWS occurs when the synthesized system can avoid satisfying its guarantees by preventing the environment from being able to satisfy its assumptions.In this work we present two contributions to better deal with NWS. First, we show how to synthesize systems that avoid taking advantage of NWS, i.e., do not prevent the satisfaction of any environment assumption, even if possible. Second, we propose a set of heuristics for fast detection of NWS. Evaluation over benchmarks from the literature shows the effectiveness and significance of our work.

DOI: 10.1145/3597503.3608131

It’s Not a Feature, It’s a Bug： Fault-Tolerant Model Mining from Noisy Data

作者: Wallner, Felix and Aichernig, Bernhard K. and Burghard, Christian
关键词: automata learning, SAT solving, partial Max-SAT, model inference, non-determinism

Abstract

The mining of models from data finds widespread use in industry. There exists a variety of model inference methods for perfectly deterministic behaviour, however, in practice, the provided data often contains noise due to faults such as message loss or environmental factors that many of the inference algorithms have problems dealing with. We present a novel model mining approach using Partial Max-SAT solving to infer the best possible automaton from a set of noisy execution traces. This approach enables us to ignore the minimal number of presumably faulty observations to allow the construction of a deterministic automaton. No pre-processing of the data is required. The method’s performance as well as a number of considerations for practical use are evaluated, including three industrial use cases, for which we inferred the correct models.

DOI: 10.1145/3597503.3623346

Enabling Runtime Verification of Causal Discovery Algorithms with Automated Conditional Independence Reasoning

作者: Ma, Pingchuan and Ji, Zhenlan and Yao, Peisen and Wang, Shuai and Ren, Kui
关键词: causal discovery, conditional independence, SMT

Abstract

Causal discovery is a powerful technique for identifying causal relationships among variables in data. It has been widely used in various applications in software engineering. Causal discovery extensively involves conditional independence (CI) tests. Hence, its output quality highly depends on the performance of CI tests, which can often be unreliable in practice. Moreover, privacy concerns arise when excessive CI tests are performed.Despite the distinct nature between unreliable and excessive CI tests, this paper identifies a unified and principled approach to addressing both of them. Generally, CI statements, the outputs of CI tests, adhere to Pearl’s axioms, which are a set of well-established integrity constraints on conditional independence. Hence, we can either detect erroneous CI statements if they violate Pearl’s axioms or prune excessive CI statements if they are logically entailed by Pearl’s axioms. Holistically, both problems boil down to reasoning about the consistency of CI statements under Pearl’s axioms (referred to as CIR problem).We propose a runtime verification tool called CICheck, designed to harden causal discovery algorithms from reliability and privacy perspectives. CICheck employs a sound and decidable encoding scheme that translates CIR into SMT problems. To solve the CIR problem efficiently, CICheck introduces a four-stage decision procedure with three lightweight optimizations that actively prove or refute consistency, and only resort to costly SMT-based reasoning when necessary. Based on the decision procedure to CIR, CICheck includes two variants: ED-Check and P-Check, which detect erroneous CI tests (to enhance reliability) and prune excessive CI tests (to enhance privacy), respectively. We evaluate CICheck on four real-world datasets and 100 CIR instances, showing its effectiveness in detecting erroneous CI tests and reducing excessive CI tests while retaining practical performance.

DOI: 10.1145/3597503.3623348

Modularizing while Training： A New Paradigm for Modularizing DNN Models

作者: Qi, Binhang and Sun, Hailong and Zhang, Hongyu and Zhao, Ruobing and Gao, Xiang
关键词: DNN modularization, model reuse, modular training, convolutional neural network

Abstract

Deep neural network (DNN) models have become increasingly crucial components of intelligent software systems. However, training a DNN model is typically expensive in terms of both time and computational resources. To address this issue, recent research has focused on reusing existing DNN models - borrowing the concept of software reuse in software engineering. However, reusing an entire model could cause extra overhead or inherit the weaknesses from the undesired functionalities. Hence, existing work proposes to decompose an already trained model into modules, i.e., modularizing-after-training, to enable module reuse. Since the trained models are not built for modularization, modularizing-after-training may incur huge overhead and model accuracy loss. In this paper, we propose a novel approach that incorporates modularization into the model training process, i.e., modularizing-while-training (MwT). We train a model to be structurally modular through two loss functions that optimize intra-module cohesion and inter-module coupling. We have implemented the proposed approach for modularizing Convolutional Neural Network (CNN) models. The evaluation results on representative models demonstrate that MwT outperforms the existing state-of-the-art modularizing-after-training approach. Specifically, the accuracy loss caused by MwT is only 1.13 percentage points, which is less than that of the existing approach. The kernel retention rate of the modules generated by MwT is only 14.58%, with a reduction of 74.31% over the existing approach. Furthermore, the total time cost required for training and modularizing is only 108 minutes, which is half the time required by the existing approach. Our work demonstrates that MwT is a new and more effective paradigm for realizing DNN model modularization, offering a fresh perspective on achieving model reuse.

DOI: 10.1145/3597503.3608135

KnowLog： Knowledge Enhanced Pre-trained Language Model for Log Understanding

作者: Ma, Lipeng and Yang, Weidong and Xu, Bo and Jiang, Sihang and Fei, Ben and Liang, Jiaqing and Zhou, Mingjie and Xiao, Yanghua
关键词: pre-trained language model, knowledge enhancement, log understanding

Abstract

Logs as semi-structured text are rich in semantic information, making their comprehensive understanding crucial for automated log analysis. With the recent success of pre-trained language models in natural language processing, many studies have leveraged these models to understand logs. Despite their successes, existing pre-trained language models still suffer from three weaknesses. Firstly, these models fail to understand domain-specific terminology, especially abbreviations. Secondly, these models struggle to adequately capture the complete log context information. Thirdly, these models have difficulty in obtaining universal representations of different styles of the same logs. To address these challenges, we introduce KnowLog, a knowledge-enhanced pre-trained language model for log understanding. Specifically, to solve the previous two challenges, we exploit abbreviations and natural language descriptions of logs from public documentation as local and global knowledge, respectively, and leverage this knowledge by designing novel pre-training tasks for enhancing the model. To solve the last challenge, we design a contrastive learning-based pre-training task to obtain universal representations. We evaluate KnowLog by fine-tuning it on six different log understanding tasks. Extensive experiments demonstrate that KnowLog significantly enhances log understanding and achieves state-of-the-art results compared to existing pre-trained language models without knowledge enhancement. Moreover, we conduct additional experiments in transfer learning and low-resource scenarios, showcasing the substantial advantages of KnowLog. Our source code and detailed experimental data are available at https://github.com/LeaperOvO/KnowLog.

DOI: 10.1145/3597503.3623304

FAIR： Flow Type-Aware Pre-Training of Compiler Intermediate Representations

作者: Niu, Changan and Li, Chuanyi and Ng, Vincent and Lo, David and Luo, Bin
关键词: No keywords

Abstract

While the majority of existing pre-trained models from code learn source code features such as code tokens and abstract syntax trees, there are some other works that focus on learning from compiler intermediate representations (IRs). Existing IR-based models typically utilize IR features such as instructions, control and data flow graphs (CDFGs), call graphs, etc. However, these methods confuse variable nodes and instruction nodes in a CDFG and fail to distinguish different types of flows, and the neural networks they use fail to capture long-distance dependencies and have over-smoothing and over-squashing problems. To address these weaknesses, we propose FAIR, a Flow type-Aware pre-trained model for IR that involves employing (1) a novel input representation of IR programs; (2) Graph Transformer to address over-smoothing, over-squashing and long-dependencies problems; and (3) five pre-training tasks that we specifically propose to enable FAIR to learn the semantics of IR tokens, flow type information, and the overall representation of IR. Experimental results show that FAIR can achieve state-of-the-art results on four code-related downstream tasks.

DOI: 10.1145/3597503.3608136

作者: Guo, Qi and Cao, Junming and Xie, Xiaofei and Liu, Shangqing and Li, Xiaohong and Chen, Bihuan and Peng, Xin
关键词: No keywords

Abstract

Code review is an essential activity for ensuring the quality and maintainability of software projects. However, it is a time-consuming and often error-prone task that can significantly impact the development process. Recently, ChatGPT, a cutting-edge language model, has demonstrated impressive performance in various natural language processing tasks, suggesting its potential to automate code review processes. However, it is still unclear how well ChatGPT performs in code review tasks. To fill this gap, in this paper, we conduct the first empirical study to understand the capabilities of ChatGPT in code review tasks, specifically focusing on automated code refinement based on given code reviews. To conduct the study, we select the existing benchmark CodeReview and construct a new code review dataset with high quality. We use CodeReviewer, a state-of-the-art code review tool, as a baseline for comparison with ChatGPT. Our results show that ChatGPT outperforms CodeReviewer in code refinement tasks. Specifically, our results show that ChatGPT achieves higher EM and BLEU scores of 22.78 and 76.44 respectively, while the state-of-the-art method achieves only 15.50 and 62.88 on a high-quality code review dataset. We further identify the root causes for ChatGPT’s underperformance and propose several strategies to mitigate these challenges. Our study provides insights into the potential of ChatGPT in automating the code review process, and highlights the potential research directions.

DOI: 10.1145/3597503.3623306

Deep Learning or Classical Machine Learning? An Empirical Study on Log-Based Anomaly Detection

作者: Yu, Boxi and Yao, Jiayi and Fu, Qiuai and Zhong, Zhiqing and Xie, Haotian and Wu, Yaoliang and Ma, Yuchi and He, Pinjia
关键词: log analysis, anomaly detection, dataset, empirical study

Abstract

While deep learning (DL) has emerged as a powerful technique, its benefits must be carefully considered in relation to computational costs. Specifically, although DL methods have achieved strong performance in log anomaly detection, they often require extended time for log preprocessing, model training, and model inference, hindering their adoption in online distributed cloud systems that require rapid deployment of log anomaly detection service.This paper investigates the superiority of DL methods compared to simpler techniques in log anomaly detection. We evaluate basic algorithms (e.g., KNN, SLFN) and DL approaches (e.g., CNN) on five public log anomaly detection datasets (e.g., HDFS). Our findings demonstrate that simple algorithms outperform DL methods in both time efficiency and accuracy. For instance, on the Thunderbird dataset, the K-nearest neighbor algorithm trains 1,000 times faster than NeuralLog while achieving a higher F1-Score by 0.0625. We also identify three factors contributing to this phenomenon, which are: (1) redundant log preprocessing strategies, (2) dataset simplicity, and (3) the nature of binary classification in log anomaly detection. To assess the necessity of DL, we propose LightAD, an architecture that optimizes training time, inference time, and performance score. With automated hyper-parameter tuning, LightAD allows fair comparisons among log anomaly detection models, enabling engineers to evaluate the suitability of complex DL methods.Our findings serve as a cautionary tale for the log anomaly detection community, highlighting the need to critically analyze datasets and research tasks before adopting DL approaches. Researchers proposing computationally expensive models should benchmark their work against lightweight algorithms to ensure a comprehensive evaluation.

DOI: 10.1145/3597503.3623308

TRACED： Execution-aware Pre-training for Source Code

作者: Ding, Yangruibo and Steenhoek, Benjamin and Pei, Kexin and Kaiser, Gail and Le, Wei and Ray, Baishakhi
关键词: No keywords

Abstract

Most existing pre-trained language models for source code focus on learning the static code text, typically augmented with static code structures (abstract syntax tree, dependency graphs, etc.). However, program semantics will not be fully exposed before the real execution. Without an understanding of the program execution, statically pre-trained models fail to comprehensively capture the dynamic code properties, such as the branch coverage and the runtime variable values, and they are consequently less effective at code understanding tasks, such as retrieving semantic clones and detecting software vulnerabilities.To close the gap between the static nature of language models and the dynamic characteristics of programs, we introduce TRACED, an execution-aware pre-training strategy for source code. Specifically, we pre-train code language models with a combination of source code, executable inputs, and corresponding execution traces. Our goal is to teach code models the complicated execution logic during the pre-training, enabling the model to statically estimate the dynamic code properties without repeatedly executing code during task-specific fine-tuning.To illustrate the effectiveness of our proposed approach, we fine-tune and evaluate TRACED on three downstream tasks: static execution estimation, clone retrieval, and vulnerability detection. The empirical results show that TRACED relatively improves the statically pre-trained code models by 12.4% for complete execution path prediction and by 25.2% for runtime variable value predictions. TRACED also significantly outperforms statically pre-trained models in clone retrieval and vulnerability detection across four public benchmarks.

DOI: 10.1145/3597503.3608140

CoderEval： A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models

作者: Yu, Hao and Shen, Bo and Ran, Dezhi and Zhang, Jiaxin and Zhang, Qi and Ma, Yuchi and Liang, Guangtai and Li, Ying and Wang, Qianxiang and Xie, Tao
关键词: code generation, large language models, benchmark

Abstract

Code generation models based on the pre-training and fine-tuning paradigm have been increasingly attempted by both academia and industry, resulting in well-known industrial models such as Codex, CodeGen, and PanGu-Coder. To evaluate the effectiveness of these models, multiple existing benchmarks (e.g., HumanEval and AiXBench) are proposed, including only cases of generating a standalone function, i.e., a function that may invoke or access only built-in functions and standard libraries. However, non-standalone functions, which typically are not included in the existing benchmarks, constitute more than 70% of the functions in popular open-source projects, and evaluating models’ effectiveness on standalone functions cannot reflect these models’ effectiveness on pragmatic code generation scenarios (i.e., code generation for real settings of open source or proprietary code).To help bridge the preceding gap, in this paper, we propose a benchmark named CoderEval, consisting of 230 Python and 230 Java code generation tasks carefully curated from popular real-world open-source projects and a self-contained execution platform to automatically assess the functional correctness of generated code. CoderEval supports code generation tasks from six levels of context dependency, where context refers to code elements such as types, APIs, variables, and consts defined outside the function under generation but within the dependent third-party libraries, current class, file, or project. CoderEval can be used to evaluate the effectiveness of models in generating code beyond only standalone functions. By evaluating three state-of-the-art code generation models (CodeGen, PanGu-Coder, and ChatGPT) on CoderEval and HumanEval, we find that the effectiveness of these models in generating standalone functions is substantially higher than that in generating non-standalone functions. Our analysis highlights the current progress and pinpoints future directions to further improve a model’s effectiveness by leveraging contextual information for pragmatic code generation.

DOI: 10.1145/3597503.3623316

Inferring Data Preconditions from Deep Learning Models for Trustworthy Prediction in Deployment

作者: Ahmed, Shibbir and Gao, Hongyang and Rajan, Hridesh
关键词: deep neural networks, weakest precondition, trustworthiness

Abstract

Deep learning models are trained with certain assumptions about the data during the development stage and then used for prediction in the deployment stage. It is important to reason about the trustworthiness of the model’s predictions with unseen data during deployment. Existing methods for specifying and verifying traditional software are insufficient for this task, as they cannot handle the complexity of DNN model architecture and expected outcomes. In this work, we propose a novel technique that uses rules derived from neural network computations to infer data preconditions for a DNN model to determine the trustworthiness of its predictions. Our approach, DeepInfer involves introducing a novel abstraction for a trained DNN model that enables weakest precondition reasoning using Dijkstra’s Predicate Transformer Semantics. By deriving rules over the inductive type of neural network abstract representation, we can overcome the matrix dimensionality issues that arise from the backward non-linear computation from the output layer to the input layer. We utilize the weakest precondition computation using rules of each kind of activation function to compute layer-wise precondition from the given postcondition on the final output of a deep neural network. We extensively evaluated DeepInfer on 29 real-world DNN models using four different datasets collected from five different sources and demonstrated the utility, effectiveness, and performance improvement over closely related work. DeepInfer efficiently detects correct and incorrect predictions of high-accuracy models with high recall (0.98) and high F-1 score (0.84) and has significantly improved over the prior technique, SelfChecker. The average runtime overhead of DeepInfer is low, 0.22 sec for all the unseen datasets. We also compared runtime overhead using the same hardware settings and found that DeepInfer is 3.27 times faster than SelfChecker, the state-of-the-art in this area.

DOI: 10.1145/3597503.3623333

Large Language Models are Few-Shot Summarizers： Multi-Intent Comment Generation via In-Context Learning

作者: Geng, Mingyang and Wang, Shangwen and Dong, Dezun and Wang, Haotian and Li, Ge and Jin, Zhi and Mao, Xiaoguang and Liao, Xiangke
关键词: code summarization, large language model, in-context learning

Abstract

Code comment generation aims at generating natural language descriptions for a code snippet to facilitate developers’ program comprehension activities. Despite being studied for a long time, a bottleneck for existing approaches is that given a code snippet, they can only generate one comment while developers usually need to know information from diverse perspectives such as what is the functionality of this code snippet and how to use it. To tackle this limitation, this study empirically investigates the feasibility of utilizing large language models (LLMs) to generate comments that can fulfill developers’ diverse intents. Our intuition is based on the facts that (1) the code and its pairwise comment are used during the pre-training process of LLMs to build the semantic connection between the natural language and programming language, and (2) comments in the real-world projects, which are collected for the pre-training, usually contain different developers’ intents. We thus postulate that the LLMs can already understand the code from different perspectives after the pre-training. Indeed, experiments on two large-scale datasets demonstrate the rationale of our insights: by adopting the in-context learning paradigm and giving adequate prompts to the LLM (e.g., providing it with ten or more examples), the LLM can significantly outperform a state-of-the-art supervised learning approach on generating comments with multiple intents. Results also show that customized strategies for constructing the prompts and post-processing strategies for reranking the results can both boost the LLM’s performances, which shed light on future research directions for using LLMs to achieve comment generation.

DOI: 10.1145/3597503.3608134

On Using GUI Interaction Data to Improve Text Retrieval-based Bug Localization

作者: Mahmud, Junayed and De Silva, Nadeeshan and Khan, Safwat Ali and Mostafavi, Seyed Hooman and Mansur, S M Hasan and Chaparro, Oscar and Marcus, Andrian (Andi) and Moran, Kevin
关键词: bug localization, GUI, natural language processing, mobile apps

Abstract

One of the most important tasks related to managing bug reports is localizing the fault so that a fix can be applied. As such, prior work has aimed to automate this task of bug localization by formulating it as an information retrieval problem, where potentially buggy files are retrieved and ranked according to their textual similarity with a given bug report. However, there is often a notable semantic gap between the information contained in bug reports and identifiers or natural language contained within source code files. For user-facing software, there is currently a key source of information that could aid in bug localization, but has not been thoroughly investigated - information from the graphical user interface (GUI).In this paper, we investigate the hypothesis that, for end user-facing applications, connecting information in a bug report with information from the GUI, and using this to aid in retrieving potentially buggy files, can improve upon existing techniques for text retrieval-based bug localization. To examine this phenomenon, we conduct a comprehensive empirical study that augments four baseline text-retrieval techniques for bug localization with GUI interaction information from a reproduction scenario to (i) filter out potentially irrelevant files, (ii) boost potentially relevant files, and (iii) reformulate text-retrieval queries. To carry out our study, we source the current largest dataset of fully-localized and reproducible real bugs for Android apps, with corresponding bug reports, consisting of 80 bug reports from 39 popular open-source apps. Our results illustrate that augmenting traditional techniques with GUI information leads to a marked increase in effectiveness across multiple metrics, including a relative increase in Hits@10 of 13–18%. Additionally, through further analysis, we find that our studied augmentations largely complement existing techniques, pushing additional buggy files into the top-10 results while generally preserving top ranked files from the baseline techniques.

DOI: 10.1145/3597503.3608139

DEMISTIFY： Identifying On-device Machine Learning Models Stealing and Reuse Vulnerabilities in Mobile Apps

作者: Ren, Pengcheng and Zuo, Chaoshun and Liu, Xiaofeng and Diao, Wenrui and Zhao, Qingchuan and Guo, Shanqing
关键词: android app, machine learning, on-device model reuse, program analysis

Abstract

Mobile apps have become popular for providing artificial intelligence (AI) services via on-device machine learning (ML) techniques. Unlike accomplishing these AI services on remote servers traditionally, these on-device techniques process sensitive information required by AI services locally, which can mitigate the severe concerns of the sensitive data collection on the remote side. However, these on-device techniques have to push the core of ML expertise (e.g., models) to smartphones locally, which are still subject to similar vulnerabilities on the remote clouds and servers, especially when facing the model stealing attack. To defend against these attacks, developers have taken various protective measures. Unfortunately, we have found that these protections are still insufficient, and on-device ML models in mobile apps could be extracted and reused without limitation. To better demonstrate its inadequate protection and the feasibility of this attack, this paper presents DeMistify, which statically locates ML models within an app, slices relevant execution components, and finally generates scripts automatically to instrument mobile apps to successfully steal and reuse target ML models freely. To evaluate DeMistify and demonstrate its applicability, we apply it on 1,511 top mobile apps using on-device ML expertise for several ML services based on their install numbers from Google Play and DeMistify can successfully execute 1250 of them (82.73%). In addition, an in-depth study is conducted to understand the on-device ML ecosystem in the mobile application.

DOI: 10.1145/3597503.3623325

How do Developers Talk about GitHub Actions? Evidence from Online Software Development Community

作者: Zhang, Yang and Wu, Yiwen and Chen, Tingting and Wang, Tao and Liu, Hui and Wang, Huaimin
关键词: GitHub actions, empirical study, stack overflow

Abstract

Continuous integration, deployment and delivery (CI/CD) have become cornerstones of DevOps practices. In recent years, GitHub Action (GHA) has rapidly replaced the traditional CI/CD tools on GitHub, providing efficiently automated workflows for developers. With the widespread use and influence of GHA, it is critical to understand the existing problems that GHA developers face in their practices as well as the potential solutions to these problems. Unfortunately, we currently have relatively little knowledge in this area. To fill this gap, we conduct a large-scale empirical study of 6,590 Stack Overflow (SO) questions and 315 GitHub issues. Our study leads to the first comprehensive taxonomy of problems related to GHA, covering 4 categories and 16 sub-categories. Then, we analyze the popularity and difficulty of problem categories and their correlations. Further, we summarize 56 solution strategies for different GHA problems. We also distill practical implications of our findings from the perspective of different audiences. We believe that our study contributes to the research of emerging GHA practices and guides the future support of tools and technologies.

DOI: 10.1145/3597503.3623327

Block-based Programming for Two-Armed Robots： A Comparative Study

作者: Fronchetti, Felipe and Ritschel, Nico and Schorr, Logan and Barfield, Chandler and Chang, Gabriella and Spinola, Rodrigo and Holmes, Reid and Shepherd, David C.
关键词: two-armed, robots, end-users, block-based, programming

Abstract

Programming industrial robots is difficult and expensive. Although recent work has made substantial progress in making it accessible to a wider range of users, it is often limited to simple programs and its usability remains untested in practice. In this article, we introduce Duplo, a block-based programming environment that allows end-users to program two-armed robots and solve tasks that require coordination. Duplo positions the program for each arm side-by-side, using the spatial relationship between blocks from each program to represent parallelism in a way that end-users can easily understand. This design was proposed by previous work, but not implemented or evaluated in a realistic programming setting. We performed a randomized experiment with 52 participants that evaluated Duplo on a complex programming task that contained several sub-tasks. We compared Duplo with RobotStudio Online YuMi, a commercial solution, and found that Duplo allowed participants to solve the same task faster and with greater success. By analyzing the information collected during our user study, we further identified factors that explain this performance difference, as well as remaining barriers, such as debugging issues and difficulties in interacting with the robot. This work represents another step towards allowing a wider audience of non-professionals to program, which might enable the broader deployment of robotics.

DOI: 10.1145/3597503.3623329

BOMs Away! Inside the Minds of Stakeholders： A Comprehensive Study of Bills of Materials for Software Systems

作者: Stalnaker, Trevor and Wintersgill, Nathan and Chaparro, Oscar and Di Penta, Massimiliano and German, Daniel M and Poshyvanyk, Denys
关键词: software bill of materials, survey, interviews, software supply chain, open source software

Abstract

Software Bills of Materials (SBOMs) have emerged as tools to facilitate the management of software dependencies, vulnerabilities, licenses, and the supply chain. While significant effort has been devoted to increasing SBOM awareness and developing SBOM formats and tools, recent studies have shown that SBOMs are still an early technology not yet adequately adopted in practice. Expanding on previous research, this paper reports a comprehensive study that investigates the current challenges stakeholders encounter when creating and using SBOMs. The study surveyed 138 practitioners belonging to five stakeholder groups (practitioners familiar with SBOMs, members of critical open source projects, AI/ML, cyberphysical systems, and legal practitioners) using differentiated questionnaires, and interviewed 8 survey respondents to gather further insights about their experience. We identified 12 major challenges facing the creation and use of SBOMs, including those related to the SBOM content, deficiencies in SBOM tools, SBOM maintenance and verification, and domain-specific challenges. We propose and discuss 4 actionable solutions to the identified challenges and present the major avenues for future research and development.

DOI: 10.1145/3597503.3623347

EDEFuzz： A Web API Fuzzer for Excessive Data Exposures

作者: Pan, Lianglu and Cohney, Shaanan and Murray, Toby and Pham, Van-Thuan
关键词: No keywords

Abstract

APIs often transmit far more data to client applications than they need, and in the context of web applications, often do so over public channels. This issue, termed Excessive Data Exposure (EDE), was OWASP’s third most significant API vulnerability of 2019. However, there are few automated tools—either in research or industry—to effectively find and remediate such issues. This is unsurprising as the problem lacks an explicit test oracle: the vulnerability does not manifest through explicit abnormal behaviours (e.g., program crashes or memory access violations).In this work, we develop a metamorphic relation to tackle that challenge and build the first fuzzing tool—that we call EDEFuzz—to systematically detect EDEs. EDEFuzz can significantly reduce false negatives that occur during manual inspection and ad-hoc text-matching techniques, the current most-used approaches.We tested EDEFuzz against the sixty-nine applicable targets from the Alexa Top-200 and found 33,365 potential leaks—illustrating our tool’s broad applicability and scalability. In a more-tightly controlled experiment of eight popular websites in Australia, EDEFuzz achieved a high true positive rate of 98.65% with minimal configuration, illustrating our tool’s accuracy and efficiency.

DOI: 10.1145/3597503.3608133

Detecting Logic Bugs in Graph Database Management Systems via Injective and Surjective Graph Query Transformation

作者: Jiang, Yuancheng and Liu, Jiahao and Ba, Jinsheng and Yap, Roland H. C. and Liang, Zhenkai and Rigger, Manuel
关键词: graph databases, logic bugs, metamorphic testing

Abstract

Graph Database Management Systems (GDBMSs) store graphs as data. They are used naturally in applications such as social networks, recommendation systems and program analysis. However, they can be affected by logic bugs, which cause the GDBMSs to compute incorrect results and subsequently affect the applications relying on them. In this work, we propose injective and surjective Graph Query Transformation (GQT) to detect logic bugs in GDBMSs. Given a query Q, we derive a mutated query Q’, so that either their result sets are: (i) semantically equivalent; or (ii) variant based on the mutation to be either a subset or superset of each other. When the expected relationship between the results does not hold, a logic bug in the GDBMS is detected. The key insight to mutate Q is that the graph pattern in graph queries enables systematic query transformations derived from injective and surjective mappings of the directed edge sets between Q and Q’. We implemented injective and surjective Graph Query Transformation (GQT) as a tool called GraphGenie and evaluated it on 6 popular and mature GDBMSs. GraphGenie has found 25 unknown bugs, comprising 16 logic bugs, 3 internal errors, and 6 performance issues. Our results demonstrate the practicality and effectiveness of GraphGenie in detecting logic bugs in GDBMSs which has the potential for improving the reliability of applications relying on these GDBMSs.

DOI: 10.1145/3597503.3623307

Do Automatic Test Generation Tools Generate Flaky Tests?

作者: Gruber, Martin and Roslan, Muhammad Firhard and Parry, Owain and Scharnb"{o
关键词: test generation, flaky tests, empirical study

Abstract

Non-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue and proposed approaches to mitigate it. However, the vast majority of previous work has only considered developer-written tests. The prevalence and nature of flaky tests produced by test generation tools remain largely unknown. We ask whether such tools also produce flaky tests and how these differ from developer-written ones. Furthermore, we evaluate mechanisms that suppress flaky test generation. We sample 6 356 projects written in Java or Python. For each project, we generate tests using EvoSuite (Java) and Pynguin (Python), and execute each test 200 times, looking for inconsistent outcomes. Our results show that flakiness is at least as common in generated tests as in developer-written tests. Nevertheless, existing flakiness suppression mechanisms implemented in EvoSuite are effective in alleviating this issue (71.7 % fewer flaky tests). Compared to developer-written flaky tests, the causes of generated flaky tests are distributed differently. Their non-deterministic behavior is more frequently caused by randomness, rather than by networking and concurrency. Using flakiness suppression, the remaining flaky tests differ significantly from any flakiness previously reported, where most are attributable to runtime optimizations and EvoSuite-internal resource thresholds. These insights, with the accompanying dataset, can help maintainers to improve test generation tools, give recommendations for developers using these tools, and serve as a foundation for future research in test flakiness or test generation.

DOI: 10.1145/3597503.3608138

ECFuzz： Effective Configuration Fuzzing for Large-Scale Systems

作者: Li, Junqiang and Li, Senyi and Li, Keyao and Luo, Falin and Yu, Hongfang and Li, Shanshan and Li, Xiang
关键词: configuration, large-scale systems, testing, fuzzing

Abstract

A large-scale system contains a huge configuration space because of its large number of configuration parameters. This leads to a combination explosion among configuration parameters when exploring the configuration space. Existing configuration testing techniques first use fuzzing to generate different configuration parameters, and then directly inject them into the program under test to find configuration-induced bugs. However, they do not fully consider the complexity of large-scale systems, resulting in low testing effectiveness. In this paper, we propose ECFuzz, an effective configuration fuzzer for large-scale systems. Our core approach consists of (i) Multi-dimensional configuration generation strategy. ECFuzz first designs different mutation strategies according to different dependencies and selects multiple configuration parameters from the candidate configuration parameters to effectively generate configuration parameters; (ii) Unit-testing-oriented configuration validation strategy. ECFuzz introduces unit testing into configuration testing techniques to filter out configuration parameters that are unlikely to yield errors before executing system testing, and effectively validate generated configuration parameters. We have conducted extensive experiments in real-world large-scale systems including HCommon, HDFS, HBase, ZooKeeper and Alluxio. Our evaluation shows that ECFuzz is effective in finding configuration-induced crash bugs. Compared with the state-of-the-art configuration testing tools including ConfTest, ConfErr and ConfDiagDetector, ECFuzz finds 60.3–67 more unexpected failures when the same 1000 testcases are injected into the system with an increase of 1.87x–2.63x. Moreover, ECFuzz has exposed 14 previously unknown bugs, and 5 of them have been confirmed.

DOI: 10.1145/3597503.3623315

Improving Testing Behavior by Gamifying IntelliJ

作者: Straubinger, Philipp and Fraser, Gordon
关键词: gamification, IDE, IntelliJ, software testing

Abstract

Testing is an important aspect of software development, but unfortunately, it is often neglected. While test quality analyses such as code coverage or mutation analysis inform developers about the quality of their tests, such reports are viewed only sporadically during continuous integration or code review, if they are considered at all, and their impact on the developers’ testing behavior therefore tends to be negligible. To actually influence developer behavior, it may rather be necessary to motivate developers directly within their programming environment, while they are coding. We introduce IntelliGame, a gamified plugin for the popular IntelliJ Java Integrated Development Environment, which rewards developers for positive testing behavior using a multi-level achievement system: A total of 27 different achievements, each with incremental levels, provide affirming feedback when developers exhibit commendable testing behavior, and provide an incentive to further continue and improve this behavior. A controlled experiment with 49 participants given a Java programming task reveals substantial differences in the testing behavior triggered by IntelliGame: Incentivized developers write more tests, achieve higher coverage and mutation scores, run their tests more often, and achieve functionality earlier.

DOI: 10.1145/3597503.3623339

SCTrans： Constructing a Large Public Scenario Dataset for Simulation Testing of Autonomous Driving Systems

作者: Dai, Jiarun and Gao, Bufan and Luo, Mingyuan and Huang, Zongan and Li, Zhongrui and Zhang, Yuan and Yang, Min
关键词: simulation scenario, autonomous driving, model transformation

Abstract

For the safety assessment of autonomous driving systems (ADS), simulation testing has become an important complementary technique to physical road testing. In essence, simulation testing is a scenario-driven approach, whose effectiveness is highly dependent on the quality of given simulation scenarios. Moreover, simulation scenarios should be encoded into well-formatted files, otherwise, ADS simulation platforms cannot take them as inputs. Without large public datasets of simulation scenario files, both industry and academic applications of ADS simulation testing are hindered.To fill this gap, we propose a transformation-based approach SCTrans to construct simulation scenario files, utilizing existing traffic scenario datasets (i.e., naturalistic movement of road users recorded on public roads) as data sources. Specifically, we try to transform existing traffic scenario recording files into simulation scenario files that are compatible with the most advanced ADS simulation platforms, and this task is formalized as a Model Transformation Problem. Following this idea, we construct a dataset consisting of over 1,900 diverse simulation scenarios, each of which can be directly used to test the state-of-the-art ADSs (i.e., Apollo and Autoware) via high-fidelity simulators (i.e., Carla and LGSVL). To further demonstrate the utility of our dataset, we showcase that it can boost the collision-finding capability of existing simulation-based ADS fuzzers, helping identify about seven times more unique ADS-involved collisions within the same time period. By analyzing these collisions at the code level, we identify nine safety-critical bugs of Apollo and Autoware, each of which can be stably exploited to cause vehicle crashes. Till now, four of them have been confirmed.

DOI: 10.1145/3597503.3623350

Co-Creation in Fully Remote Software Teams

作者: Jackson, Victoria and Prikladnicki, Rafael and van der Hoek, Andre
关键词: collaboration, remote software development, developer tools, virtual software teams, software team practices

Abstract

In this paper, we use the lens of co-creation—a concept originally coined and applied in the fields of management and design that denotes how groups of people collaboratively create something of meaning through an orchestration of people, activities, and tools—to study how fully remote software teams co-create digital artifacts that can be considered as a form of documentation. We report on the results of a qualitative, interview-based study with 25 software professionals working in remote teams. Our primary findings are the definition of four models of co-creation, examples of sequencing these models into work chains to produce artifacts, factors that influence how developers match tasks to models and chains, and insights into tool support for co-creation. Together, our findings illustrate how co-creation is an intentional activity that has a significant role in how remote software teams’ choose to structure their collaborative activities.

DOI: 10.1145/3597503.3623297

A Large-Scale Survey on the Usability of AI Programming Assistants： Successes and Challenges

作者: Liang, Jenny T. and Yang, Chenyang and Myers, Brad A.
关键词: AI programming assistants, usability study

Abstract

The software engineering community recently has witnessed widespread deployment of AI programming assistants, such as GitHub Copilot. However, in practice, developers do not accept AI programming assistants’ initial suggestions at a high frequency. This leaves a number of open questions related to the usability of these tools. To understand developers’ practices while using these tools and the important usability challenges they face, we administered a survey to a large population of developers and received responses from a diverse set of 410 developers. Through a mix of qualitative and quantitative analyses, we found that developers are most motivated to use AI programming assistants because they help developers reduce key-strokes, finish programming tasks quickly, and recall syntax, but resonate less with using them to help brainstorm potential solutions. We also found the most important reasons why developers do not use these tools are because these tools do not output code that addresses certain functional or non-functional requirements and because developers have trouble controlling the tool to generate the desired output. Our findings have implications for both creators and users of AI programming assistants, such as designing minimal cognitive effort interactions with these tools to reduce distractions for users while they are programming.

DOI: 10.1145/3597503.3608128

How to Support ML End-User Programmers through a Conversational Agent

作者: Arteaga Garcia, Emily Judith and Nicolaci Pimentel, Jo~{a
关键词: end-user programming, conversational agent, wizard of Oz

Abstract

Machine Learning (ML) is increasingly gaining significance for enduser programmer (EUP) applications. However, machine learning end-user programmers (ML-EUPs) without the right background face a daunting learning curve and a heightened risk of mistakes and flaws in their models. In this work, we designed a conversational agent named “Newton” as an expert to support ML-EUPs. Newton’s design was shaped by a comprehensive review of existing literature, from which we identified six primary challenges faced by ML-EUPs and five strategies to assist them. To evaluate the efficacy of Newton’s design, we conducted a Wizard of Oz within-subjects study with 12 ML-EUPs. Our findings indicate that Newton effectively assisted ML-EUPs, addressing the challenges highlighted in the literature. We also proposed six design guidelines for future conversational agents, which can help other EUP applications and software engineering activities.

DOI: 10.1145/3597503.3608130

Unveiling the Life Cycle of User Feedback： Best Practices from Software Practitioners

作者: Li, Ze Shi and Arony, Nowshin Nawar and Devathasan, Kezia and Sihag, Manish and Ernst, Neil and Damian, Daniela
关键词: user feedback, requirements engineering, social media analysis, product management

Abstract

User feedback has grown in importance for organizations to improve software products. Prior studies focused primarily on feedback collection and reported a high-level overview of the processes, often overlooking how practitioners reason about, and act upon this feedback through a structured set of activities. In this work, we conducted an exploratory interview study with 40 practitioners from 32 organizations of various sizes and in several domains such as e-commerce, analytics, and gaming. Our findings indicate that organizations leverage many different user feedback sources. Social media emerged as a key category of feedback that is increasingly critical for many organizations. We found that organizations actively engage in a number of non-trivial activities to curate and act on user feedback, depending on its source. We synthesize these activities into a life cycle of managing user feedback. We also report on the best practices for managing user feedback that we distilled from responses of practitioners who felt that their organization effectively understood and addressed their users’ feedback. We present actionable empirical results that organizations can leverage to increase their understanding of user perception and behavior for better products thus reducing user attrition.

DOI: 10.1145/3597503.3623309

Novelty Begets Popularity, But Curbs Participation - A Macroscopic View of the Python Open-Source Ecosystem

作者: Fang, Hongbo and Herbsleb, James and Vasilescu, Bogdan
关键词: open-source software, innovation

Abstract

Who creates the most innovative open-source software projects? And what fate do these projects tend to have? Building on a long history of research to understand innovation in business and other domains, as well as recent advances towards modeling innovation in scientific research from the science of science field, in this paper we adopt the analogy of innovation as emerging from the novel recombination of existing bits of knowledge. As such, we consider as innovative the software projects that recombine existing software libraries in novel ways, i.e., those built on top of atypical combinations of packages as extracted from import statements. We then report on a large-scale quantitative study of innovation in the Python open-source software ecosystem. Our results show that higher levels of innovativeness are statistically associated with higher GitHub star counts, i.e., novelty begets popularity. At the same time, we find that controlling for project size, the more innovative projects tend to involve smaller teams of contributors, as well as be at higher risk of becoming abandoned in the long term. We conclude that innovation and open source sustainability are closely related and, to some extent, antagonistic.

DOI: 10.1145/3597503.3608142

Characterizing Software Maintenance Meetings： Information Shared, Discussion Outcomes, and Information Captured

作者: Soria, Adriana Meza and Lopez, Taylor and Seero, Elizabeth and Mashhadi, Negin and Evans, Emily and Burge, Janet and Van der Hoek, Andr'{e
关键词: meetings, software maintenance, information, resolution

Abstract

A type of meeting that has been understudied in the software engineering literature to date is what we term the software maintenance meeting: a regularly scheduled team meeting in which emergent issues are addressed that are usually out of scope of the daily stand-up but not necessarily challenging enough to warrant an entirely separate meeting. These meetings tend to discuss a wide variety of topics and are crucial in keeping software development projects going, but little is known about these meetings and how they proceed. In this paper, we report on a single exploratory case study in which we analyzed ten consecutive maintenance meetings from a major healthcare software provider. We analyzed what kind of information is brought into the discussions held in these meetings and how, what outcomes arose from the discussions, and what information was captured for downstream use. Our findings are varied, giving rise to both practical considerations for those conducting these kinds of meetings and new research directions toward further understanding and supporting them.

DOI: 10.1145/3597503.3623330

作者: Jamieson, Jack and Yamashita, Naomi and Foong, Eureka
关键词: human values, turnover, open source, GitHub

Abstract

Discussions about project values are important for engineering software that meets diverse human needs and positively impacts society. Because value-related discussions involve deeply held beliefs, they can lead to conflicts or other outcomes that may affect motivations to continue contributing to open source projects. However, it is unclear what kind of value-related discussions are associated with significant changes in turnover. We address this gap by identifying discussions related to important project values and investigating the extent to which those discussions predict project turnover in the following months. We collected logs of GitHub issues and commits from 52 projects that share similar ethical commitments and were identified as part of the DWeb (Decentralized Web) community. We identify issues related to DWeb’s core values of respectfulness, freedom, broadmindedness, opposing centralized social power, equity & equality, and protecting the environment. We then use Granger causality analysis to examine how changes in the proportion of discussions related to those values might predict changes in incoming and outgoing turnover. We found multiple significant relationships between value-related discussions and turnover, including that discussions about respectfulness predict an increase in contributors leaving and a decrease in new contributors, while discussions about social power predicted better contributor retention. Understanding these antecedents of contributor turnover is important for managing open source projects that incorporate human-centric issues. Based on the results, we discuss implications for open source maintainers and for future research.

DOI: 10.1145/3597503.3623340

On the Helpfulness of Answering Developer Questions on Discord with Similar Conversations and Posts from the Past

作者: Lill, Alexander and Meyer, Andr'{e
关键词: developer questions, chat community, semantic similarity

Abstract

A big part of software developers’ time is spent finding answers to their coding-task-related questions. To answer their questions, developers usually perform web searches, ask questions on Q&A websites, or, more recently, in chat communities. Yet, many of these questions have frequently already been answered in previous chat conversations or other online communities. Automatically identifying and then suggesting these previous answers to the askers could, thus, save time and effort. In an empirical analysis, we first explored the frequency of repeating questions on the Discord chat platform and assessed our approach to identify them automatically. The approach was then evaluated with real-world developers in a field experiment, through which we received 142 ratings on the helpfulness of the suggestions we provided to help answer 277 questions that developers posted in four Discord communities. We further collected qualitative feedback through 53 surveys and 10 follow-up interviews. We found that the suggestions were considered helpful in 40% of the cases, that suggesting Stack Overflow posts is more often considered helpful than past Discord conversations, and that developers have difficulties describing their problems as search queries and, thus, prefer describing them as natural language questions in online communities.

DOI: 10.1145/3597503.3623341

Marco： A Stochastic Asynchronous Concolic Explorer

作者: Hu, Jie and Duan, Yue and Yin, Heng
关键词: No keywords

Abstract

Concolic execution is a powerful program analysis technique for code path exploration. Despite recent advances that greatly improved the efficiency of concolic execution engines, path constraint solving remains a major bottleneck of concolic testing. An intelligent scheduler for inputs/branches becomes even more crucial. Our studies show that the previously under-studied branch-flipping policy adopted by state-of-the-art concolic execution engines has several limitations. We propose to assess each branch by its potential for new code coverage from a global view, concerning the path divergence probability at each branch. To validate this idea, we implemented a prototype Marco and evaluated it against the state-of-the-art concolic executor on 30 real-world programs from Google’s Fuzzbench, Binutils, and UniBench. The result shows that Marco can outperform the baseline approach and make continuous progress after the baseline approach terminates.

DOI: 10.1145/3597503.3623301

Smart Contract and DeFi Security Tools： Do They Meet the Needs of Practitioners?

作者: Chaliasos, Stefanos and Charalambous, Marcos Antonios and Zhou, Liyi and Galanopoulou, Rafaila and Gervais, Arthur and Mitropoulos, Dimitris and Livshits, Benjamin
关键词: No keywords

Abstract

The growth of the decentralized finance (DeFi) ecosystem built on blockchain technology and smart contracts has led to an increased demand for secure and reliable smart contract development. However, attacks targeting smart contracts are increasing, causing an estimated $6.45 billion in financial losses. Researchers have proposed various automated security tools to detect vulnerabilities, but their real-world impact remains uncertain.In this paper, we aim to shed light on the effectiveness of automated security tools in identifying vulnerabilities that can lead to high-profile attacks, and their overall usage within the industry. Our comprehensive study encompasses an evaluation of five SoTA automated security tools, an analysis of 127 high-impact real-world attacks resulting in $2.3 billion in losses, and a survey of 49 developers and auditors working in leading DeFi protocols. Our findings reveal a stark reality: the tools could have prevented a mere 8% of the attacks in our dataset, amounting to $149 million out of the $2.3 billion in losses. Notably, all preventable attacks were related to reentrancy vulnerabilities. Furthermore, practitioners distinguish logic-related bugs and protocol layer vulnerabilities as significant threats that are not adequately addressed by existing security tools. Our results emphasize the need to develop specialized tools catering to the distinct demands and expectations of developers and auditors. Further, our study highlights the necessity for continuous advancements in security tools to effectively tackle the ever-evolving challenges confronting the DeFi ecosystem.

DOI: 10.1145/3597503.3623302

DocFlow： Extracting Taint Specifications from Software Documentation

作者: Tileria, Marcos and Blasco, Jorge and Dash, Santanu Kumar
关键词: taint analysis, documentation, android, natural language processing

Abstract

Security practitioners routinely use static analysis to detect security problems and privacy violations in Android apps. The soundness of these analyses depends on how the platform is modelled and the list of sensitive methods. Collecting these methods often becomes impractical given the number of methods available, the pace at which the Android platform is updated, and the proprietary libraries Google releases on each new version. Despite the constant evolution of the Android platform, app developers cope with all these new features thanks to the documentation that comes with each new Android release. In this work, we take advantage of the rich documentation provided by platforms like Android and propose DocFlow, a framework to generate taint specifications for a platform, directly from its documentation. DocFlow models the semantics of API methods using their documentation to detect sensitive methods (sources and sinks) and assigns them semantic labels. Our approach does not require access to source code, enabling the analysis of proprietary libraries for which the code is unavailable. We evaluate DocFlow using Android platform packages and closed-source Google Play Services libraries. Our results show that our framework detects sensitive methods with high precision, adapts to new API versions, and can be easily extended to detect other method types. Our approach provides evidence that Android documentation encodes rich semantic information to categorise sensitive methods, removing the need to analyse source code or perform feature extraction.

DOI: 10.1145/3597503.3623312

Toward Improved Deep Learning-based Vulnerability Detection

作者: Sejfia, Adriana and Das, Satyaki and Shafiq, Saad and Medvidovi'{c
关键词: No keywords

Abstract

Deep learning (DL) has been a common thread across several recent techniques for vulnerability detection. The rise of large, publicly available datasets of vulnerabilities has fueled the learning process underpinning these techniques. While these datasets help the DL-based vulnerability detectors, they also constrain these detectors’ predictive abilities. Vulnerabilities in these datasets have to be represented in a certain way, e.g., code lines, functions, or program slices within which the vulnerabilities exist. We refer to this representation as a base unit. The detectors learn how base units can be vulnerable and then predict whether other base units are vulnerable. We have hypothesized that this focus on individual base units harms the ability of the detectors to properly detect those vulnerabilities that span multiple base units (or MBU vulnerabilities). For vulnerabilities such as these, a correct detection occurs when all comprising base units are detected as vulnerable. Verifying how existing techniques perform in detecting all parts of a vulnerability is important to establish their effectiveness for other downstream tasks. To evaluate our hypothesis, we conducted a study focusing on three prominent DL-based detectors: ReVeal, DeepWukong, and LineVul. Our study shows that all three detectors contain MBU vulnerabilities in their respective datasets. Further, we observed significant accuracy drops when detecting these types of vulnerabilities. We present our study and a framework that can be used to help DL-based detectors toward the proper inclusion of MBU vulnerabilities.

DOI: 10.1145/3597503.3608141

Attention! Your Copied Data is Under Monitoring： A Systematic Study of Clipboard Usage in Android Apps

作者: Chen, Yongliang and Tang, Ruoqin and Zuo, Chaoshun and Zhang, Xiaokuan and Xue, Lei and Luo, Xiapu and Zhao, Qingchuan
关键词: program analysis, mobile security

Abstract

Recently, clipboard usage has become prevalent in mobile apps allowing users to copy and paste text within the same app or across different apps. However, insufficient access control on the clipboard in the mobile operating systems exposes its contained data to high risks where one app can read the data copied in other apps and store it locally or even send it to remote servers. Unfortunately, the literature only has ad-hoc studies in this respect and lacks a comprehensive and systematic study of the entire mobile app ecosystem. To establish the missing links, this paper proposes an automated tool, ClipboardScope, that leverages the principled static program analysis to uncover the clipboard data usage in mobile apps at scale by defining a usage as a combination of two aspects, i.e., how the clipboard data is validated and where does it go. It defines four primary categories of clipboard data operation, namely spot-on, grand-slam, selective, and cherry-pick, based on the clipboard usage in an app. ClipboardScope is evaluated on 26,201 out of a total of 2.2 million mobile apps available on Google Play as of June 2022 that access and process the clipboard text. It identifies 23,948, 848, 1,075, and 330 apps that are recognized as the four designated categories, respectively. In addition, we uncovered a prevalent programming habit of using the SharedPreferences object to store historical data, which can become an unnoticeable privacy leakage channel.

DOI: 10.1145/3597503.3623317

PonziGuard： Detecting Ponzi Schemes on Ethereum with Contract Runtime Behavior Graph (CRBG)

作者: Liang, Ruichao and Chen, Jing and He, Kun and Wu, Yueming and Deng, Gelei and Du, Ruiying and Wu, Cong
关键词: smart contract, ponzi scheme, flow analysis, graph neural networks

Abstract

Ponzi schemes, a form of scam, have been discovered in Ethereum smart contracts in recent years, causing massive financial losses. Rule-based detection approaches rely on pre-defined rules with limited capabilities and domain knowledge dependency. Additionally, using static information like opcodes and transactions for machine learning models fails to effectively characterize the Ponzi contracts, resulting in poor reliability and interpretability.In this paper, we propose PonziGuard, an efficient Ponzi scheme detection approach based on contract runtime behavior. Inspired by the observation that a contract’s runtime behavior is more effective in disguising Ponzi contracts from the innocent contracts, PonziGuard establishes a comprehensive graph representation called contract runtime behavior graph (CRBG), to accurately depict the behavior of Ponzi contracts. Furthermore, it formulates the detection process as a graph classification task, enhancing its overall effectiveness. We conducted comparative experiments on a ground-truth dataset and applied PonziGuard to Ethereum Mainnet. The results show that PonziGuard outperforms the current state-of-the-art approaches and is also effective in open environments. Using PonziGuard, we have identified 805 Ponzi contracts on Ethereum Mainnet, which have resulted in an estimated economic loss of 281,700 Ether or approximately $500 million USD.

DOI: 10.1145/3597503.3623318

FuzzSlice： Pruning False Positives in Static Analysis Warnings through Function-Level Fuzzing

作者: Murali, Aniruddhan and Mathews, Noble and Alfadel, Mahmoud and Nagappan, Meiyappan and Xu, Meng
关键词: fuzzing, static analysis warning, vulnerability

Abstract

Manual confirmation of static analysis reports is a daunting task. This is due to both the large number of warnings and the high density of false positives among them. Fuzzing techniques have been proposed to verify static analysis warnings. However, a major limitation is that fuzzing the whole project to reach all static analysis warnings is not feasible. This can take several days and exponential machine time to increase code coverage linearly.Therefore, we propose FuzzSlice, a novel framework that automatically prunes possible false positives among static analysis warnings. Unlike prior work that mostly focuses on confirming true positives among static analysis warnings, which inevitably requires end-to-end fuzzing, FuzzSlice focuses on ruling out potential false positives, which are the majority in static analysis reports. The key insight that we base our work on is that a warning that does not yield a crash when fuzzed at the function level in a given time budget is a possible false positive. To achieve this, FuzzSlice first aims to generate compilable code slices at the function level. Then, FuzzSlice fuzzes these code slices instead of the entire binary to prune possible false positives. FuzzSlice is also unlikely to misclassify a true bug as a false positive because the crashing input can be reproduced by a fuzzer at the function level as well. We evaluate FuzzSlice on the Juliet synthetic dataset and real-world complex C projects: openssl, tmux and openssh-portable. Our evaluation shows that the ground truth in the Juliet dataset had 864 false positives which were all detected by FuzzSlice. For the open-source repositories, we were able to get the developers from two of these open-source repositories to independently label these warnings. FuzzSlice automatically identifies 33 out of 53 false positives confirmed by developers in these two repositories. This implies that FuzzSlice can reduce the number of false positives by 62.26% in the open-source repositories and by 100% in the Juliet dataset.

DOI: 10.1145/3597503.3623321

LibvDiff： Library Version Difference Guided OSS Version Identification in Binaries

作者: Dong, Chaopeng and Li, Siyuan and Yang, Shouguo and Xiao, Yang and Wang, Yongpan and Li, Hong and Li, Zhi and Sun, Limin
关键词: open-source software, version identification, vulnerability detection, firmware analysis

Abstract

Open-source software (OSS) has been extensively employed to expedite software development, inevitably exposing downstream software to the peril of potential vulnerabilities. Precisely identifying the version of OSS not only facilitates the detection of vulnerabilities associated with it but also enables timely alerts upon the release of 1-day vulnerabilities. However, current methods for identifying OSS versions rely heavily on version strings or constant features, which may not be present in compiled OSS binaries or may not be representative when only function code changes are made. As a result, these methods are often imprecise in identifying the version of OSS binaries being used.To this end, we propose LibvDiff, a novel approach for identifying open-source software versions. It detects subtle differences through precise symbol information and function-level code changes using binary code similarity detection. LibvDiff introduces a candidate version filter based on a novel version coordinate system to improve efficiency by quantifying gaps between versions and rapidly identifying potential versions. To speed up the code similarity detection process, LibvDiff proposes a function call-based anchor path filter to minimize the number of functions compared in the target binary. We evaluate the performance of LibvDiff through comprehensive experiments under various compilation settings and two datasets (one with version strings, and the other without version strings), which demonstrate that our approach achieves 94.5% and 78.7% precision in two datasets, outperforming state-of-the-art works (including both academic methods and industry tools) by an average of 54.2% and 160.3%, respectively. By identifying and analyzing OSS binaries in real-world firmware images, we make several interesting findings, such as developers having significant differences in their updates to different OSS, and different vendors may also utilize identical OSS binaries.

DOI: 10.1145/3597503.3623336

Prompting Is All You Need： Automated Android Bug Replay with Large Language Models

作者: Feng, Sidong and Chen, Chunyang
关键词: automated bug replay, large language model, prompt engineering

Abstract

Bug reports are vital for software maintenance that allow users to inform developers of the problems encountered while using the software. As such, researchers have committed considerable resources toward automating bug replay to expedite the process of software maintenance. Nonetheless, the success of current automated approaches is largely dictated by the characteristics and quality of bug reports, as they are constrained by the limitations of manually-crafted patterns and pre-defined vocabulary lists. Inspired by the success of Large Language Models (LLMs) in natural language understanding, we propose AdbGPT, a new lightweight approach to automatically reproduce the bugs from bug reports through prompt engineering, without any training and hard-coding effort. AdbGPT leverages few-shot learning and chain-of-thought reasoning to elicit human knowledge and logical reasoning from LLMs to accomplish the bug replay in a manner similar to a developer. Our evaluations demonstrate the effectiveness and efficiency of our AdbGPT to reproduce 81.3% of bug reports in 253.6 seconds, outperforming the state-of-the-art baselines and ablation studies. We also conduct a small-scale user study to confirm the usefulness of AdbGPT in enhancing developers’ bug replay capabilities.

DOI: 10.1145/3597503.3608137

Towards Reliable AI： Adequacy Metrics for Ensuring the Quality of System-level Testing of Autonomous Vehicles

作者: Neelofar, Neelofar and Aleti, Aldeida
关键词: No keywords

Abstract

AI-powered systems have gained widespread popularity in various domains, including Autonomous Vehicles (AVs). However, ensuring their reliability and safety is challenging due to their complex nature. Conventional test adequacy metrics, designed to evaluate the effectiveness of traditional software testing, are often insufficient or impractical for these systems. White-box metrics, which are specifically designed for these systems, leverage neuron coverage information. These coverage metrics necessitate access to the underlying AI model and training data, which may not always be available. Furthermore, the existing adequacy metrics exhibit weak correlations with the ability to detect faults in the generated test suite, creating a gap that we aim to bridge in this study.In this paper, we introduce a set of black-box test adequacy metrics called “Test suite Instance Space Adequacy” (TISA) metrics, which can be used to gauge the effectiveness of a test suite. The TISA metrics offer a way to assess both the diversity and coverage of the test suite and the range of bugs detected during testing. Additionally, we introduce a framework that permits testers to visualise the diversity and coverage of the test suite in a two-dimensional space, facilitating the identification of areas that require improvement.We evaluate the efficacy of the TISA metrics by examining their correlation with the number of bugs detected in system-level simulation testing of AVs. A strong correlation, coupled with the short computation time, indicates their effectiveness and efficiency in estimating the adequacy of testing AVs.

DOI: 10.1145/3597503.3623314

作者: Zhang, Yakun and Zhang, Wenjie and Ran, Dezhi and Zhu, Qihao and Dou, Chengfeng and Hao, Dan and Xie, Tao and Zhang, Lu
关键词: test migration, GUI testing, deep learning

Abstract

GUI test case migration is to migrate GUI test cases from a source app to a target app. The key of test case migration is widget matching. Recently, researchers have proposed various approaches by formulating widget matching as a matching task. However, since these matching approaches depend on static word embeddings without using contextual information to represent widgets and manually formulated matching functions, there are main limitations of these matching approaches when handling complex matching relations in apps. To address the limitations, we propose the first learning-based widget matching approach named TEMdroid (TEst Migration) for test case migration. Unlike the existing approaches, TEMdroid uses BERT to capture contextual information and learns a matching model to match widgets. Additionally, to balance the significant imbalance between positive and negative samples in apps, we design a two-stage training strategy where we first train a hard-negative sample miner to mine hard-negative samples, and further train a matching model using positive samples and mined hard-negative samples. Our evaluation on 34 apps shows that TEM-droid is effective in event matching (i.e., widget matching and target event synthesis) and test case migration. For event matching, TEM-droid’s Top1 accuracy is 76%, improving over 17% compared to baselines. For test case migration, TEMdroid’s F1 score is 89%, also 7% improvement compared to the baseline approach.

DOI: 10.1145/3597503.3623322

Large Language Models are Edge-Case Generators： Crafting Unusual Programs for Fuzzing Deep Learning Libraries

作者: Deng, Yinlin and Xia, Chunqiu Steven and Yang, Chenyuan and Zhang, Shizhuo Dylan and Yang, Shujing and Zhang, Lingming
关键词: No keywords

Abstract

Bugs in Deep Learning (DL) libraries may affect almost all downstream DL applications, and it is crucial to ensure the quality of such systems. It is challenging to generate valid input programs for fuzzing DL libraries, since the input programs need to satisfy both the syntax/semantics of the supported languages (e.g., Python) and the tensor/operator constraints for constructing valid computational graphs. Recently, the TitanFuzz work demonstrates that modern Large Language Models (LLMs) can be directly leveraged to implicitly learn all the language and DL computation constraints to generate valid programs for fuzzing DL libraries (and beyond). However, LLMs tend to generate ordinary programs following similar patterns/tokens with typical programs seen in their massive pre-training corpora (e.g., GitHub), while fuzzing favors unusual inputs that cover edge cases or are unlikely to be manually produced.To fill this gap, this paper proposes FuzzGPT, the first approach to priming LLMs to synthesize unusual programs for fuzzing. FuzzGPT is mainly built on the well-known hypothesis that historical bug-triggering programs may include rare/valuable code ingredients important for bug finding. Meanwhile, while traditional techniques leveraging such historical information require intensive human efforts to both design dedicated generators and ensure the syntactic/semantic validity of generated programs, FuzzGPT demonstrates that this process can be fully automated via the intrinsic capabilities of LLMs (including fine-tuning and in-context learning), while being generalizable and applicable to challenging domains. While FuzzGPT can be applied with different LLMs, this paper focuses on the powerful GPT-style models: Codex and CodeGen. Moreover, FuzzGPT also shows the potential of directly leveraging the instruction-following capability of the recent ChatGPT for effective fuzzing. The experimental study on two popular DL libraries (PyTorch and TensorFlow) shows that FuzzGPT can substantially outperform TitanFuzz, detecting 76 bugs, with 49 already confirmed as previously unknown bugs, including 11 high-priority bugs or security vulnerabilities.

DOI: 10.1145/3597503.3623343

Deeply Reinforcing Android GUI Testing with Deep Reinforcement Learning

作者: Lan, Yuanhong and Lu, Yifei and Li, Zhong and Pan, Minxue and Yang, Wenhua and Zhang, Tian and Li, Xuandong
关键词: android testing, deep reinforcement learning, graph embedding

Abstract

As the scale and complexity of Android applications continue to grow in response to increasing market and user demands, quality assurance challenges become more significant. While previous studies have demonstrated the superiority of Reinforcement Learning (RL) in Android GUI testing, its effectiveness remains limited, particularly in large, complex apps. This limitation arises from the ineffectiveness of Tabular RL in learning the knowledge within the large state-action space of the App Under Test (AUT) and from the suboptimal utilization of the acquired knowledge when employing more advanced RL techniques. To address such limitations, this paper presents DQT, a novel automated Android GUI testing approach based on deep reinforcement learning. DQT preserves widgets’ structural and semantic information with graph embedding techniques, building a robust foundation for identifying similar states or actions and distinguishing different ones. Moreover, a specially designed Deep Q-Network (DQN) effectively guides curiosity-driven exploration by learning testing knowledge from runtime interactions with the AUT and sharing it across states or actions. Experiments conducted on 30 diverse open-source apps demonstrate that DQT outperforms existing state-of-the-art testing approaches in both code coverage and fault detection, particularly for large, complex apps. The faults detected by DQT have been reproduced and reported to developers; so far, 21 of the reported issues have been explicitly confirmed, and 14 have been fixed.

DOI: 10.1145/3597503.3623344

Unveiling Memorization in Code Models

作者: Yang, Zhou and Zhao, Zhipeng and Wang, Chenyu and Shi, Jieke and Kim, Dongsun and Han, Donggyun and Lo, David
关键词: open-source software, memorization, code generation

Abstract

The availability of large-scale datasets, advanced architectures, and powerful computational resources have led to effective code models that automate diverse software engineering activities. The datasets usually consist of billions of lines of code from both open-source and private repositories. A code model memorizes and produces source code verbatim, which potentially contains vulnerabilities, sensitive information, or code with strict licenses, leading to potential security and privacy issues.This paper investigates an important problem: to what extent do code models memorize their training data? We conduct an empirical study to explore memorization in large pre-trained code models. Our study highlights that simply extracting 20,000 outputs (each having 512 tokens) from a code model can produce over 40,125 code snippets that are memorized from the training data. To provide a better understanding, we build a taxonomy of memorized contents with 3 categories and 14 subcategories. The results show that the prompts sent to the code models affect the distribution of memorized contents. We identify several key factors of memorization. Specifically, given the same architecture, larger models suffer more from memorization problem. A code model produces more memorization when it is allowed to generate longer outputs. We also find a strong positive correlation between the number of an output’s occurrences in the training data and that in the generated outputs, which indicates that a potential way to reduce memorization is to remove duplicates in the training data. We then identify effective metrics that infer whether an output contains memorization accurately. We also make suggestions to deal with memorization.

DOI: 10.1145/3597503.3639074

Code Search is All You Need? Improving Code Suggestions with Code Search

作者: Chen, Junkai and Hu, Xing and Li, Zhenhao and Gao, Cuiyun and Xia, Xin and Lo, David
关键词: code suggestion, code search, language model

Abstract

Modern integrated development environments (IDEs) provide various automated code suggestion techniques (e.g., code completion and code generation) to help developers improve their efficiency. Such techniques may retrieve similar code snippets from the code base or leverage deep learning models to provide code suggestions. However, how to effectively enhance the code suggestions using code retrieval has not been systematically investigated. In this paper, we study and explore a retrieval-augmented framework for code suggestions. Specifically, our framework leverages different retrieval approaches and search strategies to search similar code snippets. Then the retrieved code is used to further enhance the performance of language models on code suggestions. We conduct experiments by integrating different language models into our framework and compare the results with their original models. We find that our framework noticeably improves the performance of both code completion and code generation by up to 53.8% and 130.8% in terms of BLEU-4, respectively. Our study highlights that integrating the retrieval process into code suggestions can improve the performance of code suggestions by a large margin.

DOI: 10.1145/3597503.3639085

On Extracting Specialized Code Abilities from Large Language Models： A Feasibility Study

作者: Li, Zongjie and Wang, Chaozheng and Ma, Pingchuan and Liu, Chaowei and Wang, Shuai and Wu, Daoyuan and Gao, Cuiyun and Liu, Yang
关键词: large language models, imitation attacks

Abstract

Recent advances in large language models (LLMs) significantly boost their usage in software engineering. However, training a well-performing LLM demands a substantial workforce for data collection and annotation. Moreover, training datasets may be proprietary or partially open, and the process often requires a costly GPU cluster. The intellectual property value of commercial LLMs makes them attractive targets for imitation attacks, but creating an imitation model with comparable parameters still incurs high costs. This motivates us to explore a practical and novel direction: slicing commercial black-box LLMs using medium-sized backbone models.In this paper, we explore the feasibility of launching imitation attacks on LLMs to extract their specialized code abilities, such as “code synthesis” and “code translation.” We systematically investigate the effectiveness of launching code ability extraction attacks under different code-related tasks with multiple query schemes, including zero-shot, in-context, and Chain-of-Thought. We also design response checks to refine the outputs, leading to an effective imitation training process. Our results show promising outcomes, demonstrating that with a reasonable number of queries, attackers can train a medium-sized backbone model to replicate specialized code behaviors similar to the target LLMs. We summarize our findings and insights to help researchers better understand the threats posed by imitation attacks, including revealing a practical attack surface for generating adversarial code examples against LLMs.

DOI: 10.1145/3597503.3639091

When Neural Code Completion Models Size up the Situation： Attaining Cheaper and Faster Completion through Dynamic Model Inference

作者: Sun, Zhensu and Du, Xiaoning and Song, Fu and Wang, Shangwen and Li, Li
关键词: No keywords

Abstract

Leveraging recent advancements in large language models, modern neural code completion models have demonstrated the capability to generate highly accurate code suggestions. However, their massive size poses challenges in terms of computational costs and environmental impact, hindering their widespread adoption in practical scenarios. Dynamic inference emerges as a promising solution, as it allocates minimal computation during inference while maintaining the model’s performance. In this research, we explore dynamic inference within the context of code completion. Initially, we conducted an empirical investigation on GPT-2, focusing on the inference capabilities of intermediate layers for code completion. We found that 54.4% of tokens can be accurately generated using just the first layer, signifying significant computational savings potential. Moreover, despite using all layers, the model still fails to predict 14.5% of tokens correctly, and the subsequent completions continued from them are rarely considered helpful, with only a 4.2% Acceptance Rate. These findings motivate our exploration of dynamic inference in code completion and inspire us to enhance it with a decision-making mechanism that stops the generation of incorrect code. We thus propose a novel dynamic inference method specifically tailored for code completion models. This method aims not only to produce correct predictions with largely reduced computation but also to prevent incorrect predictions proactively. Our extensive evaluation shows that it can averagely skip 1.7 layers out of 16 layers in the models, leading to an 11.2% speedup with only a marginal 1.1% reduction in ROUGE-L.

DOI: 10.1145/3597503.3639120

GrammarT5： Grammar-Integrated Pretrained Encoder-Decoder Neural Model for Code

作者: Zhu, Qihao and Liang, Qingyuan and Sun, Zeyu and Xiong, Yingfei and Zhang, Lu and Cheng, Shengyu
关键词: neural networks, pretrained model, text tagging

Abstract

Pretrained models for code have exhibited promising performance across various code-related tasks, such as code summarization, code completion, code translation, and bug detection. However, despite their success, the majority of current models still represent code as a token sequence, which may not adequately capture the essence of the underlying code structure.In this work, we propose GrammarT5, a grammar-integrated encoder-decoder pretrained neural model for code. GrammarT5 employs a novel grammar-integrated representation, Tokenized Grammar Rule Sequence (TGRS), for code. TGRS is constructed based on the grammar rule sequence utilized in syntax-guided code generation and integrates syntax information with code tokens within an appropriate input length. Furthermore, we suggest attaching language flags to help GrammarT5 differentiate between grammar rules of various programming languages. Finally, we introduce two novel pretraining tasks—Edge Prediction (EP), and Sub-Tree Prediction (STP) to learn syntactic information.Experiments were conducted on five code-related tasks using eleven datasets, demonstrating that GrammarT5 achieves state-of-the-art (SOTA) performance on most tasks in comparison to models of the same scale. Additionally, the paper illustrates that the proposed pretraining tasks and language flags can enhance GrammarT5 to better capture the syntax and semantics of code.

DOI: 10.1145/3597503.3639125

On Calibration of Pre-trained Code Models

作者: Zhou, Zhenhao and Sha, Chaofeng and Peng, Xin
关键词: pre-trained code models, model calibration, model reliability

Abstract

Pre-trained code models have achieved notable success in the field of Software Engineering (SE). However, existing studies have predominantly focused on improving model performance, with limited attention given to other critical aspects such as model calibration. Model calibration, which refers to the accurate estimation of predictive uncertainty, is a vital consideration in practical applications. Therefore, in order to advance the understanding of model calibration in SE, we conduct a comprehensive investigation into the calibration of pre-trained code models in this paper. Our investigation focuses on five pre-trained code models and four code understanding tasks, including analyses of calibration in both in-distribution and out-of-distribution settings. Several key insights are uncovered: (1) pre-trained code models may suffer from the issue of over-confidence; (2) temperature scaling and label smoothing are effective in calibrating code models in in-distribution data; (3) the issue of over-confidence in pre-trained code models worsens in different out-of-distribution settings, and the effectiveness of temperature scaling and label smoothing diminishes. All materials used in our experiments are available at https://github.com/queserasera22/Calibration-of-Pretrained-Code-Models.

DOI: 10.1145/3597503.3639126

Traces of Memorisation in Large Language Models for Code

作者: Al-Kaswan, Ali and Izadi, Maliheh and van Deursen, Arie
关键词: large language models, privacy, memorisation, data leakage

Abstract

Large language models have gained significant popularity because of their ability to generate human-like text and potential applications in various fields, such as Software Engineering. Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet. The content of these datasets is memorised and can be extracted by attackers with data extraction attacks. In this work, we explore memorisation in large language models for code and compare the rate of memorisation with large language models trained on natural language. We adopt an existing benchmark for natural language and construct a benchmark for code by identifying samples that are vulnerable to attack. We run both benchmarks against a variety of models, and perform a data extraction attack. We find that large language models for code are vulnerable to data extraction attacks, like their natural language counterparts. From the training data that was identified to be potentially extractable we were able to extract 47% from a CodeGen-Mono-16B code completion model. We also observe that models memorise more, as their parameter count grows, and that their pre-training data are also vulnerable to attack. We also find that data carriers are memorised at a higher rate than regular code or documentation and that different model architectures memorise different samples. Data leakage has severe outcomes, so we urge the research community to further investigate the extent of this phenomenon using a wider range of models and extraction techniques in order to build safeguards to mitigate this issue.

DOI: 10.1145/3597503.3639133

Language Models for Code Completion： A Practical Evaluation

作者: Izadi, Maliheh and Katzy, Jonathan and Van Dam, Tim and Otten, Marc and Popescu, Razvan Mihai and Van Deursen, Arie
关键词: automatic code completion, transformers, language models, IDE, evaluation, open source, InCoder, UniXcoder, CodeGPT

Abstract

Transformer-based language models for automatic code completion have shown great promise so far, yet the evaluation of these models rarely uses real data. This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code. We first developed an open-source IDE extension, Code4Me, for the online evaluation of the models. We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions. These models were then evaluated using six standard metrics across twelve programming languages. Next, we conducted a qualitative study of 1690 real-world completion requests to identify the reasons behind the poor model performance. A comparative analysis of the models’ performance in online and offline settings was also performed, using benchmark synthetic datasets and two masking strategies.Our findings suggest that while developers utilize code completion across various languages, the best results are achieved for mainstream languages such as Python and Java. InCoder outperformed the other models across all programming languages, highlighting the significance of training data and objectives. Our study also revealed that offline evaluations do not accurately reflect real-world scenarios. Upon qualitative analysis of the models’ predictions, we found that 66.3% of failures were due to models’ limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote. Given these findings, we propose several strategies to overcome the current limitations. These include refining training objectives, improving resilience to typographical errors, adopting hybrid approaches, and enhancing implementations and usability.

DOI: 10.1145/3597503.3639138

Learning in the Wild： Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models

作者: Gao, Shuzheng and Mao, Wenxin and Gao, Cuiyun and Li, Li and Hu, Xing and Xia, Xin and Lyu, Michael R.
关键词: No keywords

Abstract

Pre-trained code models have recently achieved substantial improvements in many code intelligence tasks. These models are first pre-trained on large-scale unlabeled datasets in a task-agnostic manner using self-supervised learning, and then fine-tuned on labeled datasets in downstream tasks. However, the labeled datasets are usually limited in size (i.e., human intensive efforts), which may hinder the performance of pre-trained code models in specific tasks. To mitigate this, one possible solution is to leverage the large-scale unlabeled data in the tuning stage by pseudo-labeling, i.e., generating pseudo labels for unlabeled data and further training the pre-trained code models with the pseudo-labeled data. However, directly employing the pseudo-labeled data can bring a large amount of noise, i.e., incorrect labels, leading to suboptimal performance. How to effectively leverage the noisy pseudo-labeled data is a challenging yet under-explored problem.In this paper, we propose a novel approach named HINT to improve pre-trained code models with large-scale unlabeled datasets by better utilizing the pseudo-labeled data. HINT includes two main modules: Hybrid pseudo-labeled data selection and Noise-tolerant Training. In the hybrid pseudo-data selection module, considering the robustness issue, apart from directly measuring the quality of pseudo labels through training loss, we propose to further employ a retrieval-based method to filter low-quality pseudo-labeled data. The noise-tolerant training module aims to further mitigate the influence of errors in pseudo labels by training the model with a noise-tolerant loss function and by regularizing the consistency of model predictions. We evaluate the effectiveness of HINT on three popular code intelligence tasks, including code summarization, defect detection, and assertion generation. We build our method on top of three popular open-source pre-trained code models. The experimental results show that HINT can better leverage those unlabeled data in a task-specific way and provide complementary benefits for pre-trained models, e.g., improving the best baseline model by 15.33%, 16.50%, and 8.98% on code summarization, defect detection, and assertion generation, respectively.

DOI: 10.1145/3597503.3639216

Evaluating Large Language Models in Class-Level Code Generation

作者: Du, Xueying and Liu, Mingwei and Wang, Kaixin and Wang, Hanlin and Liu, Junwei and Chen, Yixuan and Feng, Jiayi and Sha, Chaofeng and Peng, Xin and Lou, Yiling
关键词: class-level code generation, large language model, benchmark

Abstract

Recently, many large language models (LLMs) have been proposed, showing advanced proficiency in code generation. Meanwhile, many efforts have been dedicated to evaluating LLMs on code generation benchmarks such as HumanEval. Although being very helpful for comparing different LLMs, existing evaluation focuses on a simple code generation scenario (i.e., function-level or statement-level code generation), which mainly asks LLMs to generate one single code unit (e.g., a function or a statement) for the given natural language description. Such evaluation focuses on generating independent and often small-scale code units, thus leaving it unclear how LLMs perform in real-world software development scenarios.To fill this knowledge gap, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e., class-level code generation. Compared with existing code generation benchmarks, it better reflects real-world software development scenarios due to it comprising broader contextual dependencies and multiple, interdependent units of code. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on the new benchmark ClassEval, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we find that all LLMs perform much worse on class-level code generation compared to the method-level. While GPT models still dominate other LLMs on class-level code generation, the performance rankings of other models on method-level code generation no longer holds for class-level code generation. Besides, most models (except GPT models) perform better when generating the class method by method; and they have the limited ability of generating dependent code. Based on our findings, we call for software engineering (SE) researchers’ expertise to build more LLM benchmarks based on practical and complicated software development scenarios.

DOI: 10.1145/3597503.3639219

Lost in Translation： A Study of Bugs Introduced by Large Language Models while Translating Code

作者: Pan, Rangeet and Ibrahimzada, Ali Reza and Krishna, Rahul and Sankar, Divya and Wassi, Lambert Pouguem and Merler, Michele and Sobolev, Boris and Pavuluri, Raju and Sinha, Saurabh and Jabbarvand, Reyhaneh
关键词: code translation, bug taxonomy, llm

Abstract

Code translation aims to convert source code from one programming language (PL) to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are exploring their potential to automate code translation. The prerequisite for advancing the state of LLM-based code translation is to understand their promises and limitations over existing techniques. To that end, we present a large-scale empirical study to investigate the ability of general LLMs and code LLMs for code translation across pairs of different languages, including C, C++, Go, Java, and Python. Our study, which involves the translation of 1,700 code samples from three benchmarks and two real-world projects, reveals that LLMs are yet to be reliably used to automate code translation—with correct translations ranging from 2.1% to 47.3% for the studied LLMs. Further manual investigation of unsuccessful translations identifies 15 categories of translation bugs. We also compare LLM-based code translation with traditional non-LLM-based approaches. Our analysis shows that these two classes of techniques have their own strengths and weaknesses. Finally, insights from our study suggest that providing more context to LLMs during translation can help them produce better results. To that end, we propose a prompt-crafting approach based on the symptoms of erroneous translations; this improves the performance of LLM-based code translation by 5.5% on average. Our study is the first of its kind, in terms of scale and breadth, that provides insights into the current limitations of LLMs in code translation and opportunities for improving them. Our dataset—consisting of 1,700 code samples in five PLs with 10K+ tests, 43K+ translated code, 1,748 manually labeled bugs, and 1,365 bug-fix pairs—can help drive research in this area.

DOI: 10.1145/3597503.3639226

Out of Context： How important is Local Context in Neural Program Repair?

作者: Prenner, Julian Aron and Robbes, Romain
关键词: automated program repair, data-driven software engineering

Abstract

Deep learning source code models have been applied very successfully to the problem of automated program repair. One of the standing issues is the small input window of current models which often cannot fully fit the context code required for a bug fix (e.g., method or class declarations of a project). Instead, input is often restricted to the local context, that is, the lines below and above the bug location. In this work we study the importance of this local context on repair success: how much local context is needed?; is context before or after the bug location more important? how is local context tied to the bug type? To answer these questions we train and evaluate Transformer models in many different local context configurations on three datasets and two programming languages. Our results indicate that overall repair success increases with the size of the local context (albeit not for all bug types) and confirm the common practice that roughly 50–60% of the input window should be used for context leading the bug. Our results are not only relevant for researchers working on Transformer-based APR tools but also for benchmark and dataset creators who must decide what and how much context to include in their datasets.

DOI: 10.1145/3597503.3639086

Automated Program Repair, What Is It Good For? Not Absolutely Nothing!

作者: Eladawy, Hadeel and Le Goues, Claire and Brun, Yuriy
关键词: automated program repair, debugging, human factors, user study

Abstract

Industrial deployments of automated program repair (APR), e.g., at Facebook and Bloomberg, signal a new milestone for this exciting and potentially impactful technology. In these deployments, developers use APR-generated patch suggestions as part of a human-driven debugging process. Unfortunately, little is known about how using patch suggestions affects developers during debugging. This paper conducts a controlled user study with 40 developers with a median of 6 years of experience. The developers engage in debugging tasks on nine naturally-occurring defects in real-world, open-source, Java projects, using Recoder, SimFix, and TBar, three state-of-the-art APR tools. For each debugging task, the developers either have access to the project’s tests, or, also, to code suggestions that make all the tests pass. These suggestions are either developer-written or APR-generated, which can be correct or deceptive. Deceptive suggestions, which are a common APR occurrence, make all the available tests pass but fail to generalize to the intended specification. Through a total of 160 debugging sessions, we find that access to a code suggestion significantly increases the odds of submitting a patch. Access to correct APR suggestions increase the odds of debugging success by 14,000% as compared to having access only to tests, but access to deceptive suggestions decrease the odds of success by 65%. Correct suggestions also speed up debugging. Surprisingly, we observe no significant difference in how novice and experienced developers are affected by APR, suggesting that APR may find uses across the experience spectrum. Overall, developers come away with a strong positive impression of APR, suggesting promise for APR-mediated, human-driven debugging, despite existing challenges in APR-generated repair quality.

DOI: 10.1145/3597503.3639095

Rust-lancet： Automated Ownership-Rule-Violation Fixing with Behavior Preservation

作者: Yang, Wenzhang and Song, Linhai and Xue, Yinxing
关键词: rust, program repair, compiler error

Abstract

As a relatively new programming language, Rust is designed to provide both memory safety and runtime performance. To achieve this goal, Rust conducts rigorous static checks against its safety rules during compilation, effectively eliminating memory safety issues that plague C/C++ programs. Although useful, the safety rules pose programming challenges to Rust programmers, since programmers can easily violate safety rules when coding in Rust, leading their code to be rejected by the Rust compiler, a fact underscored by a recent user study. There exists a desire to automate the process of fixing safety-rule violations to enhance Rust’s programmability.In this paper, we concentrate on Rust’s ownership rules and develop rust-lancet to automatically fix their violations. We devise three strategies for altering code, each intended to modify a Rust program and make it pass Rust’s compiler checks. Additionally, we introduce mental semantics to model the behaviors of Rust programs that cannot be compiled due to ownership-rule violations. We design an approach to verify whether modified programs preserve their original behaviors before patches are applied. We apply rust-lancet to 160 safety-rule violations from two sources, successfully fixing 102 violations under the optimal configuration — more than rustc and six LLM-based techniques. Notably, rust-lancet avoids generating any incorrect patches, a distinction from all other baseline techniques. We also verify the effectiveness of each fixing strategy and behavior preservation validation and affirm the rationale behind these components.

DOI: 10.1145/3597503.3639103

Exploring Experiences with Automated Program Repair in Practice

作者: Meem, Fairuz Nawer and Smith, Justin and Johnson, Brittany
关键词: automated program repair, software bugs, software tools

Abstract

Automated program repair, also known as APR, is an approach for automatically repairing software faults. There is a large amount of research on automated program repair, but very little offers in-depth insights into how practitioners think about and employ APR in practice. To learn more about practitioners’ perspectives and experiences with current APR tools and techniques, we administered a survey, which received valid responses from 331 software practitioners. We analyzed survey responses to gain insights regarding factors that correlate with APR awareness, experience, and use. We established a strong correlation between APR awareness and tool use and attributes including job position, company size, total coding experience, and preferred language of software practitioners. We also found that practitioners are using other forms of support, such as co-workers and ChatGPT, more frequently than APR tools when fixing software defects. We learned about the drawbacks that practitioners encounter while utilizing existing APR tools and the impact that each drawback has on their practice. Our findings provide implications for research and practice centered on development, adoption, and use of APR.

DOI: 10.1145/3597503.3639182

PyTy： Repairing Static Type Errors in Python

作者: Chow, Yiu Wai and Di Grazia, Luca and Pradel, Michael
关键词: automatic program repair, type annotation, transfer learning

Abstract

Gradual typing enables developers to annotate types of their own choosing, offering a flexible middle ground between no type annotations and a fully statically typed language. As more and more code bases get type-annotated, static type checkers detect an increasingly large number of type errors. Unfortunately, fixing these errors requires manual effort, hampering the adoption of gradual typing in practice. This paper presents PyTy, an automated program repair approach targeted at statically detectable type errors in Python. The problem of repairing type errors deserves specific attention because it exposes particular repair patterns, offers a warning message with hints about where and how to apply a fix, and because gradual type checking serves as an automatic way to validate fixes. We addresses this problem through three contributions: (i) an empirical study that investigates how developers fix Python type errors, showing a diverse set of fixing strategies with some recurring patterns; (ii) an approach to automatically extract type error fixes, which enables us to create a dataset of 2,766 error-fix pairs from 176 GitHub repositories, named PyTyDefects; (iii) the first learning-based repair technique for fixing type errors in Python. Motivated by the relative data scarcity of the problem, the neural model at the core of PyTy is trained via cross-lingual transfer learning. Our evaluation shows that PyTy offers fixes for ten frequent categories of type errors, successfully addressing 85.4% of 281 real-world errors. This effectiveness outperforms state-of-the-art large language models asked to repair type errors (by 2.1x) and complements a previous technique aimed at type errors that manifest at runtime. Finally, 20 out of 30 pull requests with PyTy-suggested fixes have been merged by developers, showing the usefulness of PyTy in practice.

DOI: 10.1145/3597503.3639184

Out of Sight, Out of Mind： Better Automatic Vulnerability Repair by Broadening Input Ranges and Sources

作者: Zhou, Xin and Kim, Kisub and Xu, Bowen and Han, Donggyun and Lo, David
关键词: No keywords

Abstract

The advances of deep learning (DL) have paved the way for automatic software vulnerability repair approaches, which effectively learn the mapping from the vulnerable code to the fixed code. Nevertheless, existing DL-based vulnerability repair methods face notable limitations: 1) they struggle to handle lengthy vulnerable code, 2) they treat code as natural language texts, neglecting its inherent structure, and 3) they do not tap into the valuable expert knowledge present in the expert system. To address this, we propose VulMaster, a Transformer-based neural network model that excels at generating vulnerability repairs by comprehensively understanding the entire vulnerable code, irrespective of its length. This model also integrates diverse information, encompassing vulnerable code structures and expert knowledge from the CWE system. We evaluated VulMaster on a real-world C/C++ vulnerability repair dataset comprising 1,754 projects with 5,800 vulnerable functions. The experimental results demonstrated that VulMaster exhibits substantial improvements compared to the learning-based state-of-the-art vulnerability repair approach. Specifically, VulMaster improves the EM, BLEU, and CodeBLEU scores from 10.2% to 20.0%, 21.3% to 29.3%, and 32.5% to 40.9%, respectively.

DOI: 10.1145/3597503.3639222

Strengthening Supply Chain Security with Fine-grained Safe Patch Identification

作者: Luo, Changhua and Meng, Wei and Wang, Shuai
关键词: supply chain security, fine-grained patch analysis, differential symbolic execution

Abstract

Enhancing supply chain security is crucial, often involving the detection of patches in upstream software. However, current security patch analysis works yield relatively low recall rates (i.e., many security patches are missed). In this work, we offer a new solution to detect safe patches and assist downstream developers in patch propagation. Specifically, we develop SPatch to detect fine-grained safe patches. SPatch leverages fine-grained patch analysis and a new differential symbolic execution technique to analyze the functional impacts of code changes.We evaluated SPatch on various software, including the Linux kernel and OpenSSL, and demonstrated that it outperformed existing methods in detecting safe patches, resulting in observable security benefits. In our case studies, we updated hundreds of functions in modern software using safe patches detected by SPatch without causing any regression issues. Our detected safe security patches have been merged into the latest version of downstream software like ProtonVPN.

DOI: 10.1145/3597503.3639104

Comprehensive Semantic Repair of Obsolete GUI Test Scripts for Mobile Applications

作者: Cao, Shaoheng and Pan, Minxue and Pei, Yu and Yang, Wenhua and Zhang, Tian and Wang, Linzhang and Li, Xuandong
关键词: GUI test script repair, Android testing, regression testing

Abstract

Graphical User Interface (GUI) testing is one of the primary approaches for testing mobile apps. Test scripts serve as the main carrier of GUI testing, yet they are prone to obsolescence when the GUIs change with the apps’ evolution. Existing repair approaches based on GUI layouts or images prove effective when the GUI changes between the base and updated versions are minor, however, they may struggle with substantial changes. In this paper, a novel approach named COSER is introduced as a solution to repairing broken scripts, which is capable of addressing larger GUI changes compared to existing methods. COSER incorporates both external semantic information from the GUI elements and internal semantic information from the source code to provide a unique and comprehensive solution. The efficacy of COSER was demonstrated through experiments conducted on 20 Android apps, resulting in superior performance when compared to the state-of-the-art tools METER and GUIDER. In addition, a tool that implements the COSER approach is available for practical use and future research.

DOI: 10.1145/3597503.3639108

Constraint Based Program Repair for Persistent Memory Bugs

作者: Huang, Zunchen and Wang, Chao
关键词: No keywords

Abstract

We propose a constraint based method for repairing bugs associated with the use of persistent memory (PM) in application software. Our method takes a program execution trace and the violated property as input and returns a suggested repair, which is a combination of inserting new PM instructions and reordering these instructions to eliminate the property violation. Compared with the state-of-the-art approach, our method has three advantages. First, it can repair both durability and crash consistency bugs whereas the state-of-the-art approach can only repair the relatively-simple durability bugs. Second, our method can discover new repair strategies instead of relying on repair strategies hard-coded into the repair tool. Third, our method uses a novel symbolic encoding to model PM semantics, which allows our symbolic analysis to be more efficient than the explicit enumeration of possible scenarios and thus explore a large number of repairs quickly. We have evaluated our method on benchmark programs from the well-known Intel PMDK library as well as real applications such as Memcached, Recipe, and Redis. The results show that our method can repair all of the 41 known bugs in these benchmarks, while the state-of-the-art approach cannot repair any of the crash consistency bugs.

DOI: 10.1145/3597503.3639204

Xpert： Empowering Incident Management with Query Recommendations via Large Language Models

作者: Jiang, Yuxuan and Zhang, Chaoyun and He, Shilin and Yang, Zhihao and Ma, Minghua and Qin, Si and Kang, Yu and Dang, Yingnong and Rajmohan, Saravan and Lin, Qingwei and Zhang, Dongmei
关键词: incident management, query generation, large language model

Abstract

Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents occurring within these systems can lead to service disruptions and adversely affect user experience. To swiftly resolve such incidents, on-call engineers depend on crafting domain-specific language (DSL) queries to analyze telemetry data. However, writing these queries can be challenging and time-consuming. This paper presents a thorough empirical study on the utilization of queries of KQL, a DSL employed for incident management in a large-scale cloud management system at Microsoft. The findings obtained underscore the importance and viability of KQL queries recommendation to enhance incident management.Building upon these valuable insights, we introduce Xpert, an end-to-end machine learning framework that automates KQL recommendation process. By leveraging historical incident data and large language models, Xpert generates customized KQL queries tailored to new incidents. Furthermore, Xpert incorporates a novel performance metric called Xcore, enabling a thorough evaluation of query quality from three comprehensive perspectives. We conduct extensive evaluations of Xpert, demonstrating its effectiveness in offline settings. Notably, we deploy Xpert in the real production environment of a large-scale incident management system in Microsoft, validating its efficiency in supporting incident management. To the best of our knowledge, this paper represents the first empirical study of its kind, and Xpert stands as a pioneering DSL query recommendation framework designed for incident management.

DOI: 10.1145/3597503.3639081

Tensor-Aware Energy Accounting

作者: Babakol, Timur and Liu, Yu David
关键词: No keywords

Abstract

With the rapid growth of Artificial Intelligence (AI) applications supported by deep learning (DL), the energy efficiency of these applications has an increasingly large impact on sustainability. We introduce Smaragdine, a new energy accounting system for tensor-based DL programs implemented with TensorFlow. At the heart of Smaragdine is a novel white-box methodology of energy accounting: Smaragdine is aware of the internal structure of the DL program, which we call tensor-aware energy accounting. With Smaragdine, the energy consumption of a DL program can be broken down into units aligned with its logical hierarchical decomposition structure. We apply Smaragdine for understanding the energy behavior of BERT, one of the most widely used language models. Layer-by-layer and tensor-by-tensor, Smaragdine is capable of identifying the highest energy/power-consuming components of BERT. Furthermore, we conduct two case studies on how Smaragdine supports downstream toolchain building, one on the comparative energy impact of hyperparameter tuning of BERT, the other on the energy behavior evolution when BERT evolves to its next generation, ALBERT.

DOI: 10.1145/3597503.3639156

Programming Assistant for Exception Handling with CodeBERT

作者: Cai, Yuchen and Yadavally, Aashish and Mishra, Abhishek and Montejo, Genesis and Nguyen, Tien
关键词: AI4SE, large language models, automated exception handling

Abstract

With practical code reuse, the code fragments from developers’ forums often migrate to applications. Owing to the incomplete nature of such fragments, they often lack the details on exception handling. The adaptation for exception handling to the codebase is not trivial as developers must learn and memorize what API methods could cause exceptions and what exceptions need to be handled. We propose Neurex, an exception handling recommender that learns from complete code, and accepts a given Java code snippet and recommends 1) if a try-catch block is needed, 2) what statements need to be placed in a try block, and 3) what exception types need to be caught in the catch clause. Inspired by the sequence chunking techniques in natural language processing, we design Neurex via a multi-tasking model with the fine-tuning of the large language model CodeBERT for these three exception handling recommendation tasks. Via the large language model, Neurex can learn the surrounding context, leading to better learning the dependencies among the API elements, and the relations between the statements and the corresponding exception types needed to be handled.Our empirical evaluation shows that Neurex correctly performs all three exception handling recommendation tasks in 71.5% of the cases with a F1-score of 70.2%, which is a relative improvement of 166% over the baseline. It achieves high F1-score from 98.2%-99.7% in try-catch block necessity checking (a relative improvement of up to 55.9% over the baselines). It also correctly decides both the need for try-catch block(s) and the statements to be placed in try blocks with the F1-scores of 74.7% and 87.1% at the instance and statement levels, an improvement of 129.1% and 44.9% over the baseline, respectively. Our extrinsic evaluation shows that Neurex relatively improves over the baseline by 56.5% in F1-score for detecting exception-related bugs in incomplete Android code snippets.

DOI: 10.1145/3597503.3639188

An Empirical Study on Noisy Label Learning for Program Understanding

作者: Wang, Wenhan and Li, Yanzhou and Li, Anran and Zhang, Jian and Ma, Wei and Liu, Yang
关键词: program understanding, deep learning, noisy label learning

Abstract

Recently, deep learning models have been widely applied in program understanding tasks, and these models achieve state-of-the-art results on many benchmark datasets. A major challenge of deep learning for program understanding is that the effectiveness of these approaches depends on the quality of their datasets, and these datasets often contain noisy data samples. A typical kind of noise in program understanding datasets is label noise, which means that the target outputs for some inputs are incorrect.Researchers have proposed various approaches to alleviate the negative impact of noisy labels, and formed a new research topic: noisy label learning (NLL). In this paper, we conduct an empirical study on the effectiveness of noisy label learning on deep learning for program understanding datasets. We evaluate various NLL approaches and deep learning models on three tasks: program classification, vulnerability detection, and code summarization. From the evaluation results, we come to the following findings: 1) small trained-from-scratch models are prone to label noises in program understanding, while large pre-trained models are highly robust against them. 2) NLL approaches significantly improve the program classification accuracies for small models on noisy training sets, but they only slightly benefit large pre-trained models in classification accuracies. 3) NLL can effectively detect synthetic noises in program understanding, but struggle in detecting real-world noises. We believe our findings can provide insights on the abilities of NLL in program understanding, and shed light on future works in tackling noises in software engineering datasets. We have released our code at https://github.com/jacobwwh/noise_SE.

DOI: 10.1145/3597503.3639217

An Empirical Study on Low GPU Utilization of Deep Learning Jobs

作者: Gao, Yanjie and He, Yichen and Li, Xinze and Zhao, Bo and Lin, Haoxiang and Liang, Yoyo and Zhong, Jing and Zhang, Hongyu and Wang, Jingzhou and Zeng, Yonghua and Gui, Keli and Tong, Jie and Yang, Mao
关键词: deep learning jobs, GPU utilization, empirical study

Abstract

Deep learning plays a critical role in numerous intelligent software applications. Enterprise developers submit and run deep learning jobs on shared, multi-tenant platforms to efficiently train and test models. These platforms are typically equipped with a large number of graphics processing units (GPUs) to expedite deep learning computations. However, certain jobs exhibit rather low utilization of the allocated GPUs, resulting in substantial resource waste and reduced development productivity. This paper presents a comprehensive empirical study on low GPU utilization of deep learning jobs, based on 400 real jobs (with an average GPU utilization of 50% or less) collected from Microsoft’s internal deep learning platform. We discover 706 low-GPU-utilization issues through meticulous examination of job metadata, execution logs, runtime metrics, scripts, and programs. Furthermore, we identify the common root causes and propose corresponding fixes. Our main findings include: (1) Low GPU utilization of deep learning jobs stems from insufficient GPU computations and interruptions caused by non-GPU tasks; (2) Approximately half (46.03%) of the issues are attributed to data operations; (3) 45.18% of the issues are related to deep learning models and manifest during both model training and evaluation stages; (4) Most (84.99%) low-GPU-utilization issues could be fixed with a small number of code/script modifications. Based on the study results, we propose potential research directions that could help developers utilize GPUs better in cloud-based platforms.

DOI: 10.1145/3597503.3639232

Using an LLM to Help With Code Understanding

作者: Nam, Daye and Macvean, Andrew and Hellendoorn, Vincent and Vasilescu, Bogdan and Myers, Brad
关键词: No keywords

Abstract

Understanding code is challenging, especially when working in new and complex development environments. Code comments and documentation can help, but are typically scarce or hard to navigate. Large language models (LLMs) are revolutionizing the process of writing code. Can they do the same for helping understand it? In this study, we provide a first investigation of an LLM-based conversational UI built directly in the IDE that is geared towards code understanding. Our IDE plugin queries OpenAI’s GPT-3.5-turbo model with four high-level requests without the user having to write explicit prompts: to explain a highlighted section of code, provide details of API calls used in the code, explain key domain-specific terms, and provide usage examples for an API. The plugin also allows for open-ended prompts, which are automatically contextualized to the LLM with the program being edited. We evaluate this system in a user study with 32 participants, which confirms that using our plugin can aid task completion more than web search. We additionally provide a thorough analysis of the ways developers use, and perceive the usefulness of, our system, among others finding that the usage and benefits differ between students and professionals. We conclude that in-IDE prompt-less interaction with LLMs is a promising future direction for tool builders.

DOI: 10.1145/3597503.3639187

Enhancing Exploratory Testing by Large Language Model and Knowledge Graph

作者: Su, Yanqi and Liao, Dianshu and Xing, Zhenchang and Huang, Qing and Xie, Mulong and Lu, Qinghua and Xu, Xiwei
关键词: exploratory testing, knowledge graph, AI chain, prompt engineering

Abstract

Exploratory testing leverages the tester’s knowledge and creativity to design test cases for effectively uncovering system-level bugs from the end user’s perspective. Researchers have worked on test scenario generation to support exploratory testing based on a system knowledge graph, enriched with scenario and oracle knowledge from bug reports. Nevertheless, the adoption of this approach is hindered by difficulties in handling bug reports of inconsistent quality and varied expression styles, along with the infeasibility of the generated test scenarios. To overcome these limitations, we utilize the superior natural language understanding (NLU) capabilities of Large Language Models (LLMs) to construct a System KG of User Tasks and Failures (SysKG-UTF). Leveraging the system and bug knowledge from the KG, along with the logical reasoning capabilities of LLMs, we generate test scenarios with high feasibility and coherence. Particularly, we design chain-of-thought (CoT) reasoning to extract human-like knowledge and logical reasoning from LLMs, simulating a developer’s process of validating test scenario feasibility. Our evaluation shows that our approach significantly enhances the KG construction, particularly for bug reports with low quality. Furthermore, our approach generates test scenarios with high feasibility and coherence. The user study further proves the effectiveness of our generated test scenarios in supporting exploratory testing. Specifically, 8 participants find 36 bugs from 8 seed bugs in two hours using our test scenarios, a significant improvement over the 21 bugs found by the state-of-the-art baseline.

DOI: 10.1145/3597503.3639157

LLMParser： An Exploratory Study on Using Large Language Models for Log Parsing

作者: Ma, Zeyang and Chen, An Ran and Kim, Dong Jae and Chen, Tse-Hsun and Wang, Shaowei
关键词: log parsing, log analysis, large language model

Abstract

Logs are important in modern software development with runtime information. Log parsing is the first step in many log-based analyses, that involve extracting structured information from unstructured log data. Traditional log parsers face challenges in accurately parsing logs due to the diversity of log formats, which directly impacts the performance of downstream log-analysis tasks. In this paper, we explore the potential of using Large Language Models (LLMs) for log parsing and propose LLMParser, an LLM-based log parser based on generative LLMs and few-shot tuning. We leverage four LLMs, Flan-T5-small, Flan-T5-base, LLaMA-7B, and ChatGLM-6B in LLMParsers. Our evaluation of 16 open-source systems shows that LLMParser achieves statistically significantly higher parsing accuracy than state-of-the-art parsers (a 96% average parsing accuracy). We further conduct a comprehensive empirical analysis on the effect of training size, model size, and pre-training LLM on log parsing accuracy. We find that smaller LLMs may be more effective than more complex LLMs; for instance where Flan-T5-base achieves comparable results as LLaMA-7B with a shorter inference time. We also find that using LLMs pre-trained using logs from other systems does not always improve parsing accuracy. While using pre-trained Flan-T5-base shows an improvement in accuracy, pre-trained LLaMA results in a decrease (decrease by almost 55% in group accuracy). In short, our study provides empirical evidence for using LLMs for log parsing and highlights the limitations and future research direction of LLM-based log parsers.

DOI: 10.1145/3597503.3639150

Make LLM a Testing Expert： Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions

作者: Liu, Zhe and Chen, Chunyang and Wang, Junjie and Chen, Mengzhuo and Wu, Boyu and Che, Xing and Wang, Dandan and Wang, Qing
关键词: automated GUI testing, large language model

Abstract

Automated Graphical User Interface (GUI) testing plays a crucial role in ensuring app quality, especially as mobile applications have become an integral part of our daily lives. Despite the growing popularity of learning-based techniques in automated GUI testing due to their ability to generate human-like interactions, they still suffer from several limitations, such as low testing coverage, inadequate generalization capabilities, and heavy reliance on training data. Inspired by the success of Large Language Models (LLMs) like ChatGPT in natural language understanding and question answering, we formulate the mobile GUI testing problem as a Q&A task. We propose GPTDroid, asking LLM to chat with the mobile apps by passing the GUI page information to LLM to elicit testing scripts, and executing them to keep passing the app feedback to LLM, iterating the whole process. Within this framework, we have also introduced a functionality-aware memory prompting mechanism that equips the LLM with the ability to retain testing knowledge of the whole process and conduct long-term, functionality-based reasoning to guide exploration. We evaluate it on 93 apps from Google Play and demonstrate that it outperforms the best baseline by 32% in activity coverage, and detects 31% more bugs at a faster rate. Moreover, GPTDroid identifies 53 new bugs on Google Play, of which 35 have been confirmed and fixed.

DOI: 10.1145/3597503.3639180

RogueOne： Detecting Rogue Updates via Differential Data-flow Analysis Using Trust Domains

作者: Sofaer, Raphael J. and David, Yaniv and Kang, Mingqing and Yu, Jianjia and Cao, Yinzhi and Yang, Junfeng and Nieh, Jason
关键词: JavaScript, malicious updates, malware detection, Node.js, supplychain security

Abstract

Rogue updates, an important type of software supply-chain attack in which attackers conceal malicious code inside updates to benign software, are a growing problem due to their stealth and effectiveness. We design and implement RogueOne, a system for detecting rogue updates to JavaScript packages. RogueOne uses a novel differential data-flow analysis to capture how an update changes a package’s interactions with external APIs. Using an efficient form of abstract interpretation that can exclude unchanged code in a package, it constructs an object data-flow relationship graph (ODRG) that tracks data-flows among objects. RogueOne then maps objects to trust domains, a novel abstraction which summarizes trust relationships in a package. Objects are assigned a trust domain based on whether they originate in the target package, a dependency, or in a system API. RogueOne uses the ODRG to build a set of data-flows across trust domains. It compares data-flow sets across package versions to detect untrustworthy new interactions with external APIs. We evaluated RogueOne on hundreds of npm packages, demonstrating its effectiveness at detecting rogue updates and distinguishing them from benign ones. RogueOne achieves high accuracy and can be more than seven times as effective in detecting rogue updates and avoiding false positives compared to other systems built to detect malicious packages.

DOI: 10.1145/3597503.3639199

ACAV： A Framework for Automatic Causality Analysis in Autonomous Vehicle Accident Recordings

作者: Sun, Huijia and Poskitt, Christopher M. and Sun, Yang and Sun, Jun and Chen, Yuqi
关键词: autonomous driving system, test reduction, causality

Abstract

The rapid progress of autonomous vehicles (AVs) has brought the prospect of a driverless future closer than ever. Recent fatalities, however, have emphasized the importance of safety validation through large-scale testing. Multiple approaches achieve this fully automatically using high-fidelity simulators, i.e., by generating diverse driving scenarios and evaluating autonomous driving systems (ADSs) against different test oracles. While effective at finding violations, these approaches do not identify the decisions and actions that caused them—information that is critical for improving the safety of ADSs. To address this challenge, we propose ACAV, an automated framework designed to conduct causality analyses for AV accident recordings in two stages. First, we apply feature extraction schemas based on the messages exchanged between ADS modules, and use a weighted voting method to discard frames of the recording unrelated to the accident. Second, we use safety specifications to identify safety-critical frames and deduce causal events by applying CAT—our causal analysis tool—to a station-time graph. We evaluated ACAV on the Apollo ADS, finding that it can identify five distinct types of causal events in 93.64% of 110 accident recordings generated by an AV testing engine. We further evaluated ACAV on 1206 accident recordings collected from versions of Apollo injected with specific faults, finding that it can correctly identify causal events in 96.44% of the accidents triggered by prediction errors, and 85.73% of the accidents triggered by planning errors.

DOI: 10.1145/3597503.3639175

Efficiently Trimming the Fat： Streamlining Software Dependencies with Java Reflection and Dependency Analysis

作者: Song, Xiaohu and Wang, Ying and Cheng, Xiao and Liang, Guangtai and Wang, Qianxiang and Zhu, Zhiliang
关键词: bloated dependencies, java reflection, dependency management

Abstract

Numerous third-party libraries introduced into client projects are not actually required, resulting in modern software being gradually bloated. Software developers may spend much unnecessary effort to manage the bloated dependencies: keeping the library versions up-to-date, making sure that heterogeneous licenses are compatible, and resolving dependency conflict or vulnerability issues.However, the prior debloating techniques can easily produce false alarms of bloated dependencies since they are less effective in analyzing Java reflections. Besides, the solutions given by the existing approaches for removing bloated dependencies may induce new issues that are not conducive to dependency management. To address the above limitations, in this paper, we developed a technique, Slimming, to remove bloated dependencies from software projects reliably. Slimming statically analyzes the Java reflections that are commonly leveraged by popular frameworks (e.g., Spring Boot) and resolves the reflective targets via parsing configuration files (*.xml, *.yml and *.properties). By modeling string manipulations, Slimming fully resolves the string arguments of our concerned reflection APIs to identify all the required dependencies. More importantly, it helps developers analyze the debloating solutions by weighing the benefits against the costs of dependency management. Our evaluation results show that the static reflection analysis capability of Slimming outperforms all the other existing techniques with 97.0% of Precision and 98.8% of Recall. Compared with the prior debloating techniques, Slimming can reliably remove the bloated dependencies with a 100% test passing ratio and improve the rationality of debloating solutions. In our large-scale study in the Maven ecosystem, Slimming reported 484 bloated dependencies to 66 open-source projects. 38 reports (57.6%) have been confirmed by developers.

DOI: 10.1145/3597503.3639123

Symbol-Specific Sparsification of Interprocedural Distributive Environment Problems

作者: Karakaya, Kadiray and Bodden, Eric
关键词: static analysis, sparse analysis, IFDS, IDE, constant propagation

Abstract

Previous work has shown that one can often greatly speed up static analysis by computing data flows not for every edge in the program’s control-flow graph but instead only along definition-use chains. This yields a so-called sparse static analysis. Recent work on SparseDroid has shown that specifically taint analysis can be “sparsified” with extraordinary effectiveness because the taint state of one variable does not depend on those of others. This allows one to soundly omit more flow-function computations than in the general case.In this work, we now assess whether this result carries over to the more generic setting of so-called Interprocedural Distributive Environment (IDE) problems. Opposed to taint analysis, IDE comprises distributive problems with large or even infinitely broad domains, such as typestate analysis or linear constant propagation. Specifically, this paper presents Sparse IDE, a framework that realizes sparsification for any static analysis that fits the IDE framework.We implement Sparse IDE in SparseHeros, as an extension to the popular Heros IDE solver, and evaluate its performance on real-world Java libraries by comparing it to the baseline IDE algorithm. To this end, we design, implement and evaluate a linear constant propagation analysis client on top of SparseHeros. Our experiments show that, although IDE analyses can only be sparsified with respect to symbols and not (numeric) values, Sparse IDE can nonetheless yield significantly lower runtimes and often also memory consumptions compared to the original IDE.

DOI: 10.1145/3597503.3639092

LibAlchemy： A Two-Layer Persistent Summary Design for Taming Third-Party Libraries in Static Bug-Finding Systems

作者: Wu, Rongxin and He, Yuxuan and Huang, Jiafeng and Wang, Chengpeng and Tang, Wensheng and Shi, Qingkai and Xiao, Xiao and Zhang, Charles
关键词: static bug-finding, function summary, third-party library

Abstract

Despite the benefits of using third-party libraries (TPLs), the misuse of TPL functions raises quality and security concerns. Using traditional static analysis to detect bugs caused by TPL function is non-trivial. One promising solution would be to automatically generate and persist the summaries of TPL functions offline and then reuse these summaries in compositional static analysis online. However, when dealing with millions of lines of TPL code, the summaries designed by existing studies suffer from an unresolved paradox. That is, a highly precise form of summary leads to an unaffordable space and time overhead, while an imprecise one seriously hurts its precision or recall.To address the paradox, we propose a novel two-layer summary design. The first layer utilizes a line-sized program representation known as the program dependence graph to compactly encode path conditions, while the second layer encodes bug-type-specific properties. We implemented our idea as a tool called LibAlchemy and evaluated it on fifteen mature and extensively checked open-source projects. Experimental results show that LibAlchemy can check over ten million lines of code within ten hours. LibAlchemy has detected 55 true bugs with a high precision of 90.16%, eleven of which have been assigned CVE IDs. Compared to whole-program analysis and the conventional design of path-sensitively precise summaries, LibAlchemy achieves an 18.56x and 12.77x speedup and saves 91.49% and 90.51% of memory usage, respectively.

DOI: 10.1145/3597503.3639132

Is unsafe an Achilles’ Heel? A Comprehensive Study of Safety Requirements in Unsafe Rust Programming

作者: Cui, Mohan and Sun, Shuran and Xu, Hui and Zhou, Yangfan
关键词: unsafe rust, safety property, rustdoc, CVE, user survey, undefined behavior

Abstract

Rust is an emerging, strongly-typed programming language focusing on efficiency and memory safety. With increasing projects adopting Rust, knowing how to use Unsafe Rust is crucial for Rust security. We observed that the description of safety requirements needs to be unified in Unsafe Rust programming. Current unsafe API documents in the standard library exhibited variations, including inconsistency and insufficiency. To enhance Rust security, we suggest unsafe API documents to list systematic descriptions of safety requirements for users to follow.In this paper, we conducted the first comprehensive empirical study on safety requirements across unsafe boundaries. We studied unsafe API documents in the standard library and defined 19 safety properties (SP). We then completed the data labeling on 416 unsafe APIs while analyzing their correlation to find interpretable results. To validate the practical usability and SP coverage, we categorized existing Rust CVEs until 2023-07-08 and performed a statistical analysis of std unsafe API usage toward the crates.io ecosystem. In addition, we conducted a user survey to gain insights into four aspects from experienced Rust programmers. We finally received 50 valid responses and confirmed our classification with statistical significance.

DOI: 10.1145/3597503.3639136

Generating REST API Specifications through Static Analysis

作者: Huang, Ruikai and Motwani, Manish and Martinez, Idel and Orso, Alessandro
关键词: REST APIs, openapi specifications, documentation, static analysis

Abstract

Web Application Programming Interfaces (APIs) allow services to be accessed over the network. RESTful (or REST) APIs, which use the REpresentation State Transfer (REST) protocol, are a popular type of web API. To use or test REST APIs, developers use specifications written in standards such as OpenAPI. However, creating and maintaining these specifications is time-consuming and error-prone, especially as software evolves, leading to incomplete or inconsistent specifications that negatively affect the use and testing of the APIs. To address this problem, we present Respector (REST API specification generator), the first technique to employ static and symbolic program analysis to generate specifications for REST APIs from their source code. We evaluated Respector on 15 real-world APIs with promising results in terms of precision and recall in inferring endpoint methods, endpoint parameters, method responses, and parameter attributes, including constraints leading to successful HTTP responses or errors. Furthermore, these results could be further improved with additional engineering. Comparing the Respector-generated specifications with the developer-provided ones shows that Respector was able to identify many missing end-point methods, parameters, constraints, and responses, along with some inconsistencies between developer-provided specifications and API implementations. Finally, Respector outperformed several techniques that infer specifications from annotations within API implementations or by invoking the APIs.

DOI: 10.1145/3597503.3639137

A Framework For Inferring Properties of User-Defined Functions

作者: Liu, Xinyu and Arulraj, Joy and Orso, Alessandro
关键词: UDF properties, DBMSs, static analysis

Abstract

User-defined functions (UDFs) are widely used to enhance the capabilities of DBMSs. However, using UDFs comes with a significant performance penalty because DBMSs treat UDFs as black boxes, which hinders their ability to optimize queries that invoke such UDFs. To mitigate this problem, in this paper we present LAMBDA, a technique and framework for improving DBMSs’ performance in the presence of UDFs. The core idea of LAMBDA is to statically infer properties of UDFs that facilitate UDF processing. Taking one such property as an example, if DBMSs know that a UDF is pure, that is it returns the same result given the same arguments, they can leverage a cache to avoid repetitive UDF invocations that have the same call arguments.We reframe the problem of analyzing UDF properties as a data flow problem. We tackle the data flow problem by building LAMBDA on top of an extensible abstract interpretation framework and developing an analysis model that is tailored for UDFs. Currently, LAMBDA supports inferring four properties from UDFs that are widely used across DBMSs. We evaluate LAMBDA on a benchmark that is derived from production query workloads and UDFs. Our evaluation results show that (1) LAMBDA conservatively and efficiently infers the considered UDF properties, and (2) inferring such properties improves UDF performance, with a time reduction ranging from 10% to 99%. In addition, when applied to 20 production UDFs, LAMBDA caught five instances in which developers provided incorrect UDF property annotations. We qualitatively compare LAMBDA against Froid, a state-of-the-art framework for improving UDF performance, and explain how LAMBDA can optimize UDFs that are not supported by Froid.

DOI: 10.1145/3597503.3639147

Precise Sparse Abstract Execution via Cross-Domain Interaction

作者: Cheng, Xiao and Wang, Jiawei and Sui, Yulei
关键词: abstract execution, sparse analysis, cross-domain interaction

Abstract

Sparse static analysis offers a more scalable solution compared to its non-sparse counterpart. The basic idea is to first conduct a fast pointer analysis that over-approximates the value-flows and propagates the data-flow facts sparsely along only the pre-computed value-flows instead of all control flow points. Current sparse techniques focus on improving the scalability of the main analysis while maintaining its precision. However, their pointer analyses in both the offline and main phases are inherently imprecise because they rely solely on a single memory address domain without considering values from other domains like the interval domain. Consequently, this leads to conservative alias results, like arrayinsensitivity, which leaves substantial room for precision improvement of the main data-flow analysis.This paper presents CSA, a new Cross-domain Sparse Abstract execution that interweaves correlations between values across multiple abstract domains (e.g., memory address and interval domains). Unlike traditional sparse analysis without cross-domain interaction, CSA performs correlation tracking by establishing implications of values from one domain to another. This correlation tracking enables online bidirectional refinement: CSA refines spurious alias relations using interval domain information and also enhances the precision of interval analysis with refined alias results. This contributes to increasingly improved precision and scalability as the main analysis progresses. To improve the efficiency of correlation tracking, we propose an equivalent correlation tracking approach that groups (virtual) memory addresses with equivalent implication results to minimize redundant value joins and storage associated.We apply CSA on two common assertion-based checking clients, buffer overflow and null dereference detection. Experimental results show that CSA outperforms five open-source tools (Infer, Cppcheck, IKOS, Sparrow and KLEE) on ten large-scale projects. CSA finds 111 real bugs with 68.51% precision, detecting 46.05% more bugs than Infer and exhibiting 12.11% more precision rate than KLEE. CSA records 96.63% less false positives on real-world projects than the version without cross-domain interaction. CSA also exhibits an average speedup of 2.47\texttimes{

DOI: 10.1145/3597503.3639220

Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice Systems

作者: Zhang, Chenxi and Dong, Zhen and Peng, Xin and Zhang, Bicheng and Chen, Miao
关键词: microservice, root cause analysis, tracing

Abstract

Modern microservice systems have become increasingly complicated due to the dynamic and complex interactions and runtime environment. It leads to the system vulnerable to performance issues caused by a variety of reasons, such as the runtime environments, communications, coordinations, or implementations of services. Traces record the detailed execution process of a request through the system and have been widely used in performance issues diagnosis in microservice systems. By identifying the execution processes and attribute value combinations that are common in anomalous traces but rare in normal traces, engineers may localize the root cause of a performance issue into a smaller scope. However, due to the complex structure of traces and the large number of attribute combinations, it is challenging to find the root cause from the huge search space. In this paper, we propose TraceContrast, a trace-based multi-dimensional root cause localization approach. TraceContrast uses a sequence representation to describe the complex structure of a trace with attributes of each span. Based on the representation, it combines contrast sequential pattern mining and spectrum analysis to localize multi-dimensional root causes efficiently. Experimental studies on a widely used microservice benchmark show that TraceContrast outperforms existing approaches in both multi-dimensional and instance-dimensional root cause localization with significant accuracy advantages. Moreover, Trace-Contrast is efficient and its efficiency can be further improved by parallel execution.

DOI: 10.1145/3597503.3639088

ReClues： Representing and indexing failures in parallel debugging with program variables

作者: Song, Yi and Zhang, Xihao and Xie, Xiaoyuan and Liu, Quanming and Gao, Ruizhi and Xing, Chenliang
关键词: failure proximity, clustering, failure indexing, parallel debugging, program variable

Abstract

Failures with different root causes can greatly disrupt multi-fault localization, therefore, categorizing failures into distinct groups according to the culprit fault is highly important. In such a failure indexing task, the crux lies in the failure proximity, which comprises two points, i.e., how to effectively represent failures (e.g., extract the signature of failures) and how to properly measure the distance between those proxies for failures. Existing research has proposed a variety of failure proximities. The majority of them extract signatures of failures from execution coverage or suspiciousness ranking lists, and accordingly employ the Euclid or the Kendall tau distances, etc. However, such strategies may not properly reflect the essential characteristics of failures, thus resulting in unsatisfactory effectiveness. In this paper, we propose a new failure proximity, namely, the program variable-based failure proximity, and further present a novel failure indexing approach, ReClues. Specifically, ReClues utilizes the run-time values of program variables to represent failures, and designs a set of rules to measure the similarity between them. Experimental results demonstrate the competitiveness of ReClues: it can achieve 44.12% and 27.59% improvements in faults number estimation, as well as 47.56% and 26.27% improvements in clustering effectiveness, compared with the state-of-the-art technique in this field, in simulated and real-world environments, respectively.

DOI: 10.1145/3597503.3639098

PyAnalyzer： An Effective and Practical Approach for Dependency Extraction from Python Code

作者: Jin, Wuxia and Xu, Shuo and Chen, Dawei and He, Jiajun and Zhong, Dinghong and Fan, Ming and Chen, Hongxu and Zhang, Huijia and Liu, Ting
关键词: dependency extraction, Python, dynamic features

Abstract

Dependency extraction based on static analysis lays the groundwork for a wide range of applications. However, dynamic language features in Python make code behaviors obscure and nondeterministic; consequently, it poses huge challenges for static analyses to resolve symbol-level dependencies. Although prosperous techniques and tools are adequately available, they still lack sufficient capabilities to handle object changes, first-class citizens, varying call sites, and library dependencies. To address the fundamental difficulty for dynamic languages, this work proposes an effective and practical method namely PyAnalyzer for dependency extraction. PyAnalyzer uniformly models functions, classes, and modules into first-class heap objects, propagating the dynamic changes of these objects and class inheritance. This manner better simulates dynamic features like duck typing, object changes, and first-class citizens, resulting in high recall results without compromising precision. Moreover, PyAnalyzer leverages optional type annotations as a shortcut to express varying call sites and resolve library dependencies on demand. We collected two micro-benchmarks (278 small programs), two macro-benchmarks (59 real-world applications), and 191 real-world projects (10MSLOC) for comprehensive comparisons with 7 advanced techniques (i.e., Understand, Sourcetrail, Depends, ENRE19, PySonar2, PyCG, and Type4Py). The results demonstrated that PyAnalyzer achieves a high recall and hence improves the F1 by 24.7% on average, at least 1.4x faster without an obvious compromise of memory efficiency. Our work will benefit diverse client applications.

DOI: 10.1145/3597503.3640325

Detecting Automatic Software Plagiarism via Token Sequence Normalization

作者: Sa\u{g
关键词: software plagiarism detection, plagiarism obfuscation, obfuscation attacks, code normalization, PDG, tokenization

Abstract

While software plagiarism detectors have been used for decades, the assumption that evading detection requires programming proficiency is challenged by the emergence of automated plagiarism generators. These generators enable effortless obfuscation attacks, exploiting vulnerabilities in existing detectors by inserting statements to disrupt the matching of related programs. Thus, we present a novel, language-independent defense mechanism that leverages program dependence graphs, rendering such attacks infeasible. We evaluate our approach with multiple real-world datasets and show that it defeats plagiarism generators by offering resilience against automated obfuscation while maintaining a low rate of false positives.

DOI: 10.1145/3597503.3639192

A First Look at the Inheritance-Induced Redundant Test Execution

作者: Kim, Dong Jae and Yang, Jinqiu and Chen, Tse-Hsun
关键词: software testing, software evolution, software maintenance

Abstract

Inheritance, a fundamental aspect of object-oriented design, has been leveraged to enhance code reuse and facilitate efficient software development. However, alongside its benefits, inheritance can introduce tight coupling and complex relationships between classes, posing challenges for software maintenance. Although there are many studies on inheritance in source code, there is limited study on using inheritance in test code. In this paper, we take the first step by studying inheritance in test code, with a focus on redundant test executions caused by inherited test cases. We empirically study the prevalence of test inheritance and its characteristics. We also propose a hybrid approach that combines static and dynamic analysis to identify and locate inheritance-induced redundant test cases. Our findings reveal that (1) inheritance is widely utilized in the test code, (2) inheritance-induced redundant test executions are prevalent, accounting for 13% of all execution test cases, (3) bypassing these redundancies can help reduce 14% of the test execution time, and finally, (4) our study highlights the need for careful refactoring decisions to minimize redundant test cases and identifies the need for further research on test code quality.

DOI: 10.1145/3597503.3639166

Hypertesting of Programs： Theoretical Foundation and Automated Test Generation

作者: Pasqua, Michele and Ceccato, Mariano and Tonella, Paolo
关键词: search-based testing, hyperproperties, information flows, security testing, code coverage criteria

Abstract

Hyperproperties are used to define correctness requirements that involve relations between multiple program executions. This allows, for instance, to model security and concurrency requirements, which cannot be expressed by means of trace properties.In this paper, we propose a novel systematic approach for automated testing of hyperproperties. Our contribution is both foundational and practical. On the foundational side, we define a hyper-testing framework, which includes a novel hypercoverage adequacy criterion designed to guide the synthesis of test cases for hyperproperties. On the practical side, we instantiate such framework by implementing HyperFuzz and HyperEvo, two test generators targeting the Non-Interference security requirement, that rely respectively on fuzzing and search algorithms.Experimental results show that the proposed hypercoverage adequacy criterion correlates with the capability of a hypertest to expose hyperproperty violations and that both HyperFuzz and HyperEvo achieve high hypercoverage and high vulnerability exposure with no false alarms (by construction). While they both outperform the state-of-the-art dynamic taint analysis tool Phosphor, HyperEvo is more effective than HyperFuzz on some benchmark programs.

DOI: 10.1145/3597503.3640323

Ripples of a Mutation — An Empirical Study of Propagation Effects in Mutation Testing

作者: Du, Hang and Palepu, Vijay Krishna and Jones, James A.
关键词: software fault infection, error propagation, mutation testing, dynamic analysis, empirical study

Abstract

The mechanics of how a fault reveals itself as a test failure is of keen interest to software researchers and practitioners alike. An improved understanding of how faults translate to failures can guide improvements in broad facets of software testing, ranging from test suite design to automated program repair, which are premised on the understanding that the presence of faults would alter some test executions.In this work, we study such effects by mutations, as applicable in mutation testing. Mutation testing enables the generation of a large corpus of faults; thereby harvesting a large pool of mutated test runs for analysis. Specifically, we analyze more than 1.1 million mutated test runs to study if and how the underlying mutations induce infections that propagate their way to observable failures.We adopt a broad-spectrum approach to analyze such a large pool of mutated runs. For every mutated test run, we are able to determine: (a) if the mutation induced a state infection; (b) if the infection propagated through the end of the test run; and © if the test failed in the presence of a propagated infection.By examining such infection-, propagation- and revealability-effects for more than 43,000 mutations executed across 1.1 million test runs we are able to arrive at some surprising findings. Our results find that once state infection is observed, propagation is frequently detected; however, a propagated infection does not always reveal itself as a test failure. We also find that a significant portion of survived mutants in our study could have been killed by observing propagated state infections that were left undetected. Finally, we also find that different mutation operators can demonstrate substantial differences in their specific impacts on the execution-to-failure ripples of the resulting mutations.

DOI: 10.1145/3597503.3639179

Fast Deterministic Black-box Context-free Grammar Inference

作者: Arefin, Mohammad Rifat and Shetiya, Suraj and Wang, Zili and Csallner, Christoph
关键词: grammar inference, oracle, nested language concepts, bracket-implied nesting structure, deterministic synthesis

Abstract

Black-box context-free grammar inference is a hard problem as in many practical settings it only has access to a limited number of example programs. The state-of-the-art approach Arvada heuristically generalizes grammar rules starting from flat parse trees and is non-deterministic to explore different generalization sequences. We observe that many of Arvada’s generalization steps violate common language concept nesting rules. We thus propose to pre-structure input programs along these nesting rules, apply learnt rules recursively, and make black-box context-free grammar inference deterministic. The resulting TreeVada yielded faster runtime and higher-quality grammars in an empirical comparison. The TreeVada source code, scripts, evaluation parameters, and training data are open-source and publicly available (https://doi.org/10.6084/m9.figshare.23907738).

DOI: 10.1145/3597503.3639214

CIT4DNN： Generating Diverse and Rare Inputs for Neural Networks Using Latent Space Combinatorial Testing

作者: Dola, Swaroopa and McDaniel, Rory and Dwyer, Matthew B. and Soffa, Mary Lou
关键词: deep neural networks, test generation, test coverage, combinatorial interaction testing

Abstract

Deep neural networks (DNN) are being used in a wide range of applications including safety-critical systems. Several DNN test generation approaches have been proposed to generate fault-revealing test inputs. However, the existing test generation approaches do not systematically cover the input data distribution to test DNNs with diverse inputs, and none of the approaches investigate the relationship between rare inputs and faults. We propose cit4dnn, an automated black-box approach to generate DNN test sets that are feature-diverse and that comprise rare inputs. cit4dnn constructs diverse test sets by applying combinatorial interaction testing to the latent space of generative models and formulates constraints over the geometry of the latent space to generate rare and fault-revealing test inputs. Evaluation on a range of datasets and models shows that cit4dnn generated tests are more feature diverse than the state-of-the-art, and can target rare fault-revealing testing inputs more effectively than existing methods.

DOI: 10.1145/3597503.3639106

Knowledge Graph Driven Inference Testing for Question Answering Software

作者: Wang, Jun and Li, Yanhui and Chen, Zhifei and Chen, Lin and Zhang, Xiaofang and Zhou, Yuming
关键词: question answering, software testing, knowledge graph, inference rules

Abstract

In the wake of developments in the field of Natural Language Processing, Question Answering (QA) software has penetrated our daily lives. Due to the data-driven programming paradigm, QA software inevitably contains bugs, i.e., misbehaving in real-world applications. Current testing techniques for testing QA software include two folds, reference-based testing and metamorphic testing.This paper adopts a different angle to achieve testing for QA software: we notice that answers to questions would have inference relations, i.e., the answers to some questions could be logically inferred from the answers to other questions. If these answers on QA software do not satisfy the inference relations, an inference bug is detected. To generate the questions with the inference relations automatically, we propose a novel testing method Knowledge Graph driven Inference Testing (KGIT), which employs facts in the Knowledge Graph (KG) as the seeds to logically construct test cases containing questions and contexts with inference relations. To evaluate the effectiveness of KGIT, we conduct an extensive empirical study with more than 2.8 million test cases generated from the large-scale KG YAGO4 and three QA models based on the state-of-the-art QA model structure. The experimental results show that our method (a) could detect a considerable number of inference bugs in all three studied QA models and (b) is helpful in retraining QA models to improve their inference ability.

DOI: 10.1145/3597503.3639109

DeepSample： DNN sampling-based testing for operational accuracy assessment

作者: Guerriero, Antonio and Pietrantuono, Roberto and Russo, Stefano
关键词: software testing, deep neural networks, sampling

Abstract

Deep Neural Networks (DNN) are core components for classification and regression tasks of many software systems. Companies incur in high costs for testing DNN with datasets representative of the inputs expected in operation, as these need to be manually labelled. The challenge is to select a representative set of test inputs as small as possible to reduce the labelling cost, while sufficing to yield unbiased high-confidence estimates of the expected DNN accuracy. At the same time, testers are interested in exposing as many DNN mispredictions as possible to improve the DNN, ending up in the need for techniques pursuing a threefold aim: small dataset size, trustworthy estimates, mispredictions exposure.This study presents DeepSample, a family of DNN testing techniques for cost-effective accuracy assessment based on probabilistic sampling. We investigate whether, to what extent, and under which conditions probabilistic sampling can help to tackle the outlined challenge. We implement five new sampling-based testing techniques, and perform a comprehensive comparison of such techniques and of three further state-of-the-art techniques for both DNN classification and regression tasks. Results serve as guidance for best use of sampling-based testing for faithful and high-confidence estimates of DNN accuracy in operation at low cost.

DOI: 10.1145/3597503.3639584

MAFT： Efficient Model-Agnostic Fairness Testing for Deep Neural Networks via Zero-Order Gradient Search

作者: Wang, Zhaohui and Zhang, Min and Yang, Jingran and Shao, Bojie and Zhang, Min
关键词: software bias, fairness testing, test case generation, deep neural network

Abstract

Deep neural networks (DNNs) have shown powerful performance in various applications and are increasingly being used in decisionmaking systems. However, concerns about fairness in DNNs always persist. Some efficient white-box fairness testing methods about individual fairness have been proposed. Nevertheless, the development of black-box methods has stagnated, and the performance of existing methods is far behind that of white-box methods. In this paper, we propose a novel black-box individual fairness testing method called Model-Agnostic Fairness Testing (MAFT). By leveraging MAFT, practitioners can effectively identify and address discrimination in DL models, regardless of the specific algorithm or architecture employed. Our approach adopts lightweight procedures such as gradient estimation and attribute perturbation rather than non-trivial procedures like symbol execution, rendering it significantly more scalable and applicable than existing methods. We demonstrate that MAFT achieves the same effectiveness as state-of-the-art white-box methods whilst improving the applicability to large-scale networks. Compared to existing black-box approaches, our approach demonstrates distinguished performance in discovering fairness violations w.r.t effectiveness (~ 14.69\texttimes{

DOI: 10.1145/3597503.3639181

Concrete Constraint Guided Symbolic Execution

作者: Sun, Yue and Yang, Guowei and Lv, Shichao and Li, Zhi and Sun, Limin
关键词: symbolic execution, data dependency analysis

Abstract

Symbolic execution is a popular program analysis technique. It systematically explores all feasible paths of a program but its scalability is largely limited by the path explosion problem, which causes the number of paths proliferates at runtime. A key idea in existing methods to mitigate this problem is to guide the selection of states for path exploration, which primarily relies on the features to represent program states. In this paper, we propose concrete constraint guided symbolic execution, which aims to cover more concrete branches and ultimately improve the overall code coverage during symbolic execution. Our key insight is based on the fact that symbolic execution strives to cover all symbolic branches while concrete branches are neglected, and directing symbolic execution toward uncovered concrete branches has a great potential to improve the overall code coverage. The experimental results demonstrate that our approach can improve the ability of KLEE to both increase code coverage and find more security violations on 10 open-source C programs.

DOI: 10.1145/3597503.3639078

SpecBCFuzz： Fuzzing LTL Solvers with Boundary Conditions

作者: Carvalho, Luiz and Degiovanni, Renzo and Cordy, Maxime and Aguirre, Nazareno and Le Traon, Yves and Papadakis, Mike
关键词: fuzzing, search-based software engineering, linear-time temporal logic

Abstract

LTL solvers check the satisfiability of Linear-time Temporal Logic (LTL) formulas and are widely used for verifying and testing critical software systems. Thus, potential bugs in the solvers’ implementations can have a significant impact. We present SpecBCFuzz, a fuzzing method for finding bugs in LTL solvers, that is guided by boundary conditions (BCs), corner cases whose (un)satisfiability depends on rare traces. SpecBCFuzz implements a search-based algorithm that fuzzes LTL formulas giving relevance to BCs. It integrates syntactic and semantic similarity metrics to explore the vicinity of the seeded formulas with BCs. We evaluate SpecBCFuzz on 21 different configurations (including the latest and past releases) of four mature and state-of-the-art LTL solvers (NuSMV, Black, Aalta, and PLTL) that implement a diverse set of satisfiability algorithms. SpecBCFuzz produces 368,716 bug-triggering formulas, detecting bugs in 18 out of the 21 solvers’ configurations we study. Overall, SpecBCFuzz reveals: soundness issues (wrong answers given by a solver) in Aalta and PLTL; crashes, e.g., segmentation faults, in NuSMV, Black and Aalta; flaky behaviors (different responses across re-runs of the solver on the same formula) in NuSMV and Aalta; performance bugs (large time performance degradation between successive versions of the solver on the same formula) in Black, Aalta and PLTL; and no bug in NuSMV BDD (all versions), suggesting that the latter is currently the most robust solver.

DOI: 10.1145/3597503.3639087

RPG： Rust Library Fuzzing with Pool-based Fuzz Target Generation and Generic Support

作者: Xu, Zhiwu and Wu, Bohao and Wen, Cheng and Zhang, Bin and Qin, Shengchao and He, Mengda
关键词: No keywords

Abstract

Rust libraries are ubiquitous in Rust-based software development. Guaranteeing their correctness and reliability requires thorough analysis and testing. Fuzzing is a popular bug-finding solution, yet it requires writing fuzz targets for libraries. Recently, some automatic fuzz target generation methods have been proposed. However, two challenges remain: (1) how to generate diverse API sequences that prioritize unsafe code and interactions to reveal bugs in Rust libraries; (2) how to provide support for the generic APIs and verify both syntactic and semantic validity of the fuzz targets to enable more comprehensive testing of Rust libraries. In this paper, we propose RPG, an automatic fuzz target synthesis technique to support Rust library fuzzing. RPG uses a pool-based search to generate diverse and unsafe API sequences, and synthesizes fuzz targets with generic support and validity check. The experimental results demonstrate that RPG enhances both the quality of the generated fuzz targets and the bug-finding ability through pool-based generation and generic support, substantially outperforming the state-of-the-art. Moreover, RPG has discovered 25 previously unknown bugs from 50 well-known Rust libraries available on Crates.io.

DOI: 10.1145/3597503.3639102

Deep Combination of CDCL(T) and Local Search for Satisfiability Modulo Non-Linear Integer Arithmetic Theory

作者: Zhang, Xindi and Li, Bohan and Cai, Shaowei
关键词: SMT(NIA), CDCL(T), local search, hybrid method

Abstract

Satisfiability Modulo Theory (SMT) generalizes the propositional satisfiability problem (SAT) by extending support for various first-order background theories. In this paper, we focus on the SMT problems in Non-Linear Integer Arithmetic (NIA) theory, referred to as SMT(NIA), which has wide applications in software engineering. The dominant paradigm for SMT(NIA) is the CDCL(T) framework, while recently stochastic local search (SLS) has also shown its effectiveness. However, the cooperation between the two methods has not been studied yet. Motivated by the great success of the deep cooperation of CDCL and SLS for SAT, we propose a two-layer hybrid approach for SMT(NIA). The outer-layer interleaves between the inner-layer and an independent SLS solver. In the inner-layer, we take CDCL(T) as the main body, and design DCL(T)-guided SLS solver, which is invoked at branches corresponding to skeleton solutions and returns useful information to improve the branching heuristics of CDCL(T). We implement our ideas on top of the CDCL(T) tactic of Z3 with an SLS solver called LocalSMT, resulting in a hybrid solver dubbed HybridSMT. Extensive experiments are carried out on the standard SMT(NIA) benchmarks from SMT-LIB, where most of the instances are from real-world software engineering applications of termination and non-termination analysis. Experiment results show that HybridSMT significantly improves the CDCL(T) solver in Z3. Moreover, our solver can solve 10.36% more instances than the currently best SMT(NIA) solver, and is more efficient for software verification instances.

DOI: 10.1145/3597503.3639105

Fuzz4All： Universal Fuzzing with Large Language Models

作者: Xia, Chunqiu Steven and Paltenghi, Matteo and Le Tian, Jia and Pradel, Michael and Zhang, Lingming
关键词: No keywords

Abstract

Fuzzing has achieved tremendous success in discovering bugs and vulnerabilities in various software systems. Systems under test (SUTs) that take in programming or formal language as inputs, e.g., compilers, runtime engines, constraint solvers, and software libraries with accessible APIs, are especially important as they are fundamental building blocks of software development. However, existing fuzzers for such systems often target a specific language, and thus cannot be easily applied to other languages or even other versions of the same language. Moreover, the inputs generated by existing fuzzers are often limited to specific features of the input language, and thus can hardly reveal bugs related to other or new features. This paper presents Fuzz4All, the first fuzzer that is universal in the sense that it can target many different input languages and many different features of these languages. The key idea behind Fuzz4All is to leverage large language models (LLMs) as an input generation and mutation engine, which enables the approach to produce diverse and realistic inputs for any practically relevant language. To realize this potential, we present a novel autoprompting technique, which creates LLM prompts that are well-suited for fuzzing, and a novel LLM-powered fuzzing loop, which iteratively updates the prompt to create new fuzzing inputs. We evaluate Fuzz4All on nine systems under test that take in six different languages (C, C++, Go, SMT2, Java, and Python) as inputs. The evaluation shows, across all six languages, that universal fuzzing achieves higher coverage than existing, language-specific fuzzers. Furthermore, Fuzz4All has identified 98 bugs in widely used systems, such as GCC, Clang, Z3, CVC5, OpenJDK, and the Qiskit quantum computing platform, with 64 bugs already confirmed by developers as previously unknown.

DOI: 10.1145/3597503.3639121

Are We There Yet? Unraveling the State-of-the-Art Smart Contract Fuzzers

作者: Wu, Shuohan and Li, Zihao and Yan, Luyi and Chen, Weimin and Jiang, Muhui and Wang, Chenxu and Luo, Xiapu and Zhou, Hao
关键词: smart contract, fuzzing, evaluation

Abstract

Given the growing importance of smart contracts in various applications, ensuring their security and reliability is critical. Fuzzing, an effective vulnerability detection technique, has recently been widely applied to smart contracts. Despite numerous studies, a systematic investigation of smart contract fuzzing techniques remains lacking. In this paper, we fill this gap by: 1) providing a comprehensive review of current research in contract fuzzing, and 2) conducting an in-depth empirical study to evaluate state-of-the-art contract fuzzers’ usability. To guarantee a fair evaluation, we employ a carefully-labeled benchmark and introduce a set of pragmatic performance metrics, evaluating fuzzers from five complementary perspectives. Based on our findings, we provide direction for the future research and development of contract fuzzers.

DOI: 10.1145/3597503.3639152

Uncover the Premeditated Attacks： Detecting Exploitable Reentrancy Vulnerabilities by Identifying Attacker Contracts

作者: Yang, Shuo and Chen, Jiachi and Huang, Mingyuan and Zheng, Zibin and Huang, Yuan
关键词: smart contract, dataflow analysis, reentrancy, attacker identification, ethereum

Abstract

Reentrancy, a notorious vulnerability in smart contracts, has led to millions of dollars in financial loss. However, current smart contract vulnerability detection tools suffer from a high false positive rate in identifying contracts with reentrancy vulnerabilities. Moreover, only a small portion of the detected reentrant contracts can actually be exploited by hackers, making these tools less effective in securing the Ethereum ecosystem in practice.In this paper, we propose BlockWatchdog, a tool that focuses on detecting reentrancy vulnerabilities by identifying attacker contracts. These attacker contracts are deployed by hackers to exploit vulnerable contracts automatically. By focusing on attacker contracts, BlockWatchdog effectively detects truly exploitable reentrancy vulnerabilities by identifying reentrant call flow. Additionally, BlockWatchdog is capable of detecting new types of reentrancy vulnerabilities caused by poor designs when using ERC tokens or user-defined interfaces, which cannot be detected by current rule-based tools. We implement BlockWatchdog using cross-contract static dataflow techniques based on attack logic obtained from an empirical study that analyzes attacker contracts from 281 attack incidents. BlockWatchdog is evaluated on 421,889 Ethereum contract bytecodes and identifies 113 attacker contracts that target 159 victim contracts, leading to the theft of Ether and tokens valued at approximately 908.6 million USD. Notably, only 18 of the identified 159 victim contracts can be reported by current reentrancy detection tools.

DOI: 10.1145/3597503.3639153

Crossover in Parametric Fuzzing

作者: Hough, Katherine and Bell, Jonathan
关键词: fuzz testing, test input generation, generator-based fuzzing, parametric fuzzing, dynamic analysis

Abstract

Parametric fuzzing combines evolutionary and generator-based fuzzing to create structured test inputs that exercise unique execution behaviors. Parametric fuzzers internally represent inputs as bit strings referred to as “parameter sequences”. Interesting parameter sequences are saved by the fuzzer and perturbed to create new inputs without the need for type-specific operators. However, existing work on parametric fuzzing only uses mutation operators, which modify a single input; it does not incorporate crossover, an evolutionary operator that blends multiple inputs together. Crossover operators aim to combine advantageous traits from multiple inputs. However, the nature of parametric fuzzing limits the effectiveness of traditional crossover operators. In this paper, we propose linked crossover, an approach for using dynamic execution information to identify and exchange analogous portions of parameter sequences. We created an implementation of linked crossover for Java and evaluated linked crossover’s ability to preserve advantageous traits. We also evaluated linked crossover’s impact on fuzzer performance on seven real-world Java projects and found that linked crossover consistently performed as well as or better than three state-of-the-art parametric fuzzers and two other forms of crossover on both long and short fuzzing campaigns.

DOI: 10.1145/3597503.3639160

Practical Non-Intrusive GUI Exploration Testing with Visual-based Robotic Arms

作者: Yu, Shengcheng and Fang, Chunrong and Du, Mingzhe and Ling, Yuchen and Chen, Zhenyu and Su, Zhendong
关键词: GUI testing, non-intrusive testing, GUI understanding, robotic arm

Abstract

Graphical User Interface (GUI) testing has been a significant topic in the software engineering community. Most existing GUI testing frameworks are intrusive and can only support some specific platforms, which are quite limited. With the development of distinct scenarios, diverse embedded systems or customized operating systems on different devices do not support existing intrusive GUI testing frameworks. Some approaches adopt robotic arms to replace the interface invoking of mobile apps under test and use computer vision technologies to identify GUI elements. However, some challenges remain unsolved with such approaches. First, existing approaches assume that GUI screens are fixed so that they cannot be adapted to diverse systems with different screen conditions. Second, existing approaches use XY-plane robotic arm system, which cannot flexibly simulate human testing operations. Third, existing approaches ignore the compatibility bugs of apps and only focus on the crash bugs. To sum up, a more practical approach is required for the non-intrusive scenario.In order to solve the remaining challenges, we propose a practical non-intrusive GUI testing framework with visual-based robotic arms, namely RoboTest. RoboTest integrates a set of novel GUI screen and widget detection algorithm that is adaptive to detecting screens of different sizes and then to extracting GUI widgets from the detected screens. Then, a complete set of widely-used testing operations are applied with a 4-DOF robotic arm, which can more effectively and flexibly simulate human testing operations. During the app exploration, RoboTest integrates the specially designed Principle of Proximity-guided (PoP-guided) exploration strategy, which chooses close widgets of the previous operation targets to reduce the robotic arm movement overhead and improve exploration efficiency. Moreover, RoboTest can effectively detect some compatibility bugs beyond crash bugs with a GUI comparison on different devices of the same test operations. We evaluate RoboTest with 20 real-world mobile apps, together with a case study on a representative industrial embedded system. The results show that RoboTest can effectively, efficiently, and generally explore the AUT to find bugs and reduce app exploration time overhead from the robotic arm movement.

DOI: 10.1145/3597503.3639161

FuzzInMem： Fuzzing Programs via In-memory Structures

作者: Liu, Xuwei and You, Wei and Ye, Yapeng and Zhang, Zhuo and Huang, Jianjun and Zhang, Xiangyu
关键词: fuzzing, software testing, program synthesis

Abstract

In recent years, coverage-based greybox fuzzing has proven to be an effective and practical technique for discovering software vulnerabilities. The availability of American Fuzzy Loop (AFL) has facilitated numerous advances in overcoming challenges in fuzzing. However, the issue of mutating complex file formats, such as PDF, remains unresolved due to strict constraints. Existing fuzzers often produce mutants that fail to parse by applications, limited by bit/byte mutations performed on input files. Our observation is that most in-memory representations of file formats are simple, and well-designed applications have built-in printer functions to emit these structures as files. Thus, we propose a new technique that mutates the in-memory structures of inputs and utilizes printer functions to regenerate mutated files. Unlike prior approaches that require complex analysis to learn file format constraints, our technique leverages the printer function to preserve format constraints. We implement a prototype called FuzzInMem and compare it with AFL as well as other state-of-the-art fuzzers, including AFL++, Mopt, Weizz, and FormatFuzzer. The results show that FuzzInMem is scalable and substantially outperforms general-purpose fuzzers in terms of valid seed generation and path coverage. By applying FuzzInMem to real-world applications, we found 29 unique vulnerabilities and were awarded 5 CVEs.

DOI: 10.1145/3597503.3639172

Extrapolating Coverage Rate in Greybox Fuzzing

作者: Liyanage, Danushka and Lee, Seongmin and Tantithamthavorn, Chakkrit and B"{o
关键词: greybox fuzzing, extrapolation, coverage rate, adaptive bias, statistical method

Abstract

A fuzzer can literally run forever. However, as more resources are spent, the coverage rate continuously drops, and the utility of the fuzzer declines. To tackle this coverage-resource tradeoff, we could introduce a policy to stop a campaign whenever the coverage rate drops below a certain threshold value, say 10 new branches covered per 15 minutes. During the campaign, can we predict the coverage rate at some point in the future? If so, how well can we predict the future coverage rate as the prediction horizon or the current campaign length increases? How can we tackle the statistical challenge of adaptive bias, which is inherent in greybox fuzzing (i.e., samples are not independent and identically distributed)?In this paper, we i) evaluate existing statistical techniques to predict the coverage rate U(t0 + k) at any time t0 in the campaign after a period of k units of time in the future and ii) develop a new extrapolation methodology that tackles the adaptive bias. We propose to efficiently simulate a large number of blackbox campaigns from the collected coverage data, estimate the coverage rate for each of these blackbox campaigns and conduct a simple regression to extrapolate the coverage rate for the greybox campaign.Our empirical evaluation using the Fuzztastic fuzzer benchmark demonstrates that our extrapolation methodology exhibits at least one order of magnitude lower error compared to the existing benchmark for 4 out of 5 experimental subjects we investigated. Notably, compared to the existing extrapolation methodology, our extrapolator excels in making long-term predictions, such as those extending up to three times the length of the current campaign.

DOI: 10.1145/3597503.3639198

CERT： Finding Performance Issues in Database Systems Through the Lens of Cardinality Estimation

作者: Ba, Jinsheng and Rigger, Manuel
关键词: database, performance issue, cardinality estimation

Abstract

Database Management Systems (DBMSs) process a given query by creating a query plan, which is subsequently executed, to compute the query’s result. Deriving an efficient query plan is challenging, and both academia and industry have invested decades into researching query optimization. Despite this, DBMSs are prone to performance issues, where a DBMS produces an unexpectedly inefficient query plan that might lead to the slow execution of a query. Finding such issues is a longstanding problem and inherently difficult, because no ground truth information on an expected execution time exists. In this work, we propose Cardinality Estimation Restriction Testing (CERT), a novel technique that finds performance issues through the lens of cardinality estimation. Given a query on a database, CERT derives a more restrictive query (e.g., by replacing a LEFT JOIN with an INNER JOIN), whose estimated number of rows should not exceed the estimated number of rows for the original query. CERT tests cardinality estimation specifically, because it was shown to be the most important part for query optimization; thus, we expect that finding and fixing cardinality-estimation issues might result in the highest performance gains. In addition, we found that other kinds of query optimization issues can be exposed by unexpected estimated cardinalities, which can also be found by CERT. CERT is a black-box technique that does not require access to the source code; DBMSs expose query plans via the EXPLAIN statement. CERT eschews executing queries, which is costly and prone to performance fluctuations. We evaluated CERT on three widely used and mature DBMSs, MySQL, TiDB, and CockroachDB. CERT found 13 unique issues, of which 2 issues were fixed and 9 confirmed by the developers. We expect that this new angle on finding performance bugs will help DBMS developers in improving DMBSs’ performance.

DOI: 10.1145/3597503.3639076

Optimistic Prediction of Synchronization-Reversal Data Races

作者: Shi, Zheng and Mathur, Umang and Pavlogiannis, Andreas
关键词: No keywords

Abstract

Dynamic data race detection has emerged as a key technique for ensuring reliability of concurrent software in practice. However, dynamic approaches can often miss data races owing to non-determinism in the thread scheduler. Predictive race detection techniques cater to this shortcoming by inferring alternate executions that may expose data races without re-executing the underlying program. More formally, the dynamic data race prediction problem asks, given a trace σ of an execution of a concurrent program, can σ be correctly reordered to expose a data race? Existing state-of-the art techniques for data race prediction either do not scale to executions arising from real world concurrent software, or only expose a limited class of data races, such as those that can be exposed without reversing the order of synchronization operations.In general, exposing data races by reasoning about synchronization reversals is an intractable problem. In this work, we identify a class of data races, called Optimistic Sync(hronization)-Reversal races that can be detected in a tractable manner and often include non-trivial data races that cannot be exposed by prior tractable techniques. We also propose a sound algorithm OSR for detecting all optimistic sync-reversal data races in overall quadratic time, and show that the algorithm is optimal by establishing a matching lower bound. Our experiments demonstrate the effectiveness of OSR— on our extensive suite of benchmarks, OSR reports the largest number of data races, and scales well to large execution traces.

DOI: 10.1145/3597503.3639099

Mozi： Discovering DBMS Bugs via Configuration-Based Equivalent Transformation

作者: Liang, Jie and Wu, Zhiyong and Fu, Jingzhou and Wang, Mingzhe and Sun, Chengnian and Jiang, Yu
关键词: DBMS testing, configuration, test oracle

Abstract

Testing database management systems (DBMSs) is a complex task. Traditional approaches, such as metamorphic testing, need a precise comprehension of the SQL specification to create diverse inputs with equivalent semantics. The vagueness and intricacy of the SQL specification make it challenging to accurately model query semantics, thereby posing difficulties in testing the correctness and performance of DBMSs. To address this, we propose Mozi, a framework that finds DBMS bugs via configuration-based equivalent transformation. The key idea behind Mozi is to compare the results of equivalent DBMSs with different configurations, rather than between semantically equivalent queries. The framework involves analyzing the query plan, changing configurations to transform the DBMS to an equivalent one, and re-executing the query to compare the results using various test oracles. For example, detecting differences in query results indicates correctness bugs, while observing faster execution times on the optimization-closed DBMS suggests performance bugs.We demonstrate the effectiveness of Mozi by evaluating it on four widely used DBMSs, namely MySQL, MariaDB, Clickhouse, and PostgreSQL. In the continuous testing, Mozi found a total of 101 previously unknown bugs, including 49 correctness and 52 performance bugs in four DBMSs. Among them, 90 bugs are confirmed and 57 bugs have been fixed. In addition, Mozi can be extended to other DBMS fuzzers for testing various types of bugs. With Mozi, testing DBMSs becomes simpler and more effective, potentially saving time and effort that would otherwise be spent on precisely modeling SQL specifications for testing purposes.

DOI: 10.1145/3597503.3639112

FlakeSync： Automatically Repairing Async Flaky Tests

作者: Rahman, Shanto and Shi, August
关键词: flaky test repair, async flaky tests

Abstract

Regression testing is an important part of the development process but suffers from the presence of flaky tests. Flaky tests nondeterministically pass or fail when run on the same code, misleading developers about the correctness of their changes. A common type of flaky tests are async flaky tests that flakily fail due to timing-related issues such as asynchronous waits that do not return in time or different thread interleavings during execution. Developers commonly try to repair async flaky tests by inserting or increasing some wait time, but such repairs are unreliable.We propose FlakeSync, a technique for automatically repairing async flaky tests by introducing synchronization for a specific test execution. FlakeSync works by identifying a critical point, representing some key part of code that must be executed early w.r.t. other concurrently executing code, and a barrier point, representing the part of code that should wait until the critical point has been executed. FlakeSync can modify code to check when the critical point is executed and have the barrier point keep waiting until the critical point has been executed, essentially synchronizing these two parts of code for the specific test execution. Our evaluation of FlakeSync on known flaky tests from prior work shows that FlakeSync can automatically repair 83.75% of async flaky tests, and the resulting changes add a median overhead of only 1.00X the original test runtime. We submitted 10 pull requests with our changes to developers, with 3 already accepted and none rejected.

DOI: 10.1145/3597503.3639115

Testing the Limits： Unusual Text Inputs Generation for Mobile App Crash Detection with Large Language Model

作者: Liu, Zhe and Chen, Chunyang and Wang, Junjie and Chen, Mengzhuo and Wu, Boyu and Tian, Zhilin and Huang, Yuekai and Hu, Jun and Wang, Qing
关键词: Android GUI testing, large language model, in-context learning

Abstract

Mobile applications have become a ubiquitous part of our daily life, providing users with access to various services and utilities. Text input, as an important interaction channel between users and applications, plays an important role in core functionality such as search queries, authentication, messaging, etc. However, certain special text (e.g., -18 for Font Size) can cause the app to crash, and generating diversified unusual inputs for fully testing the app is highly demanded. Nevertheless, this is also challenging due to the combination of explosion dilemma, high context sensitivity, and complex constraint relations. This paper proposes InputBlaster which leverages the LLM to automatically generate unusual text inputs for mobile app crash detection. It formulates the unusual inputs generation problem as a task of producing a set of test generators, each of which can yield a batch of unusual text inputs under the same mutation rule. In detail, InputBlaster leverages LLM to produce the test generators together with the mutation rules serving as the reasoning chain, and utilizes the in-context learning schema to demonstrate the LLM with examples for boosting the performance. InputBlaster is evaluated on 36 text input widgets with cash bugs involving 31 popular Android apps, and results show that it achieves 78% bug detection rate, with 136% higher than the best baseline. Besides, we integrate it with the automated GUI testing tool and detect 37 unseen crashes in real-world apps.

DOI: 10.1145/3597503.3639118

Towards Finding Accounting Errors in Smart Contracts

作者: Zhang, Brian
关键词: blockchain, smart contract, accounting error, type checking

Abstract

Bugs in smart contracts may have devastating effects as they tend to cause financial loss. According to a recent study, accounting bugs are the most common kind of bugs in smart contracts that are beyond automated tools during pre-deployment auditing. The reason lies in that these bugs are usually in the core business logic and hence contract-specific. They are analogous to functional bugs in traditional software, which are largely beyond automated bug finding tools whose effectiveness hinges on uniform and machine checkable characteristics of bugs. It was also reported that accounting bugs are the second-most difficult to find through manual auditing, due to the need of understanding underlying business models. We observe that a large part of business logic in smart contracts can be modeled by a few primitive operations like those in a bank, such as deposit, withdraw, loan, and pay-off, or by their combinations. The properties of these operations can be clearly defined and checked by an abstract type system that models high-order information such as token units, scaling factors, and financial types. We hence develop a novel type propagation and checking system with the aim of identifying accounting bugs. Our evaluation on a large set of 57 existing accounting bugs in 29 real-world projects shows that 58% of the accounting bugs are type errors. Our system catches 87.9% of these type errors. In addition, applying our technique to auditing a large project in a very recent auditing contest has yielded the identification of 6 zero-day accounting bugs with 4 leading to direct fund loss.

DOI: 10.1145/3597503.3639128

MultiTest： Physical-Aware Object Insertion for Testing Multi-sensor Fusion Perception Systems

作者: Gao, Xinyu and Wang, Zhijie and Feng, Yang and Ma, Lei and Chen, Zhenyu and Xu, Baowen
关键词: testing, multi-sensor fusion, perception systems

Abstract

Multi-sensor fusion stands as a pivotal technique in addressing numerous safety-critical tasks and applications, e.g., self-driving cars and automated robotic arms. With the continuous advancement in data-driven artificial intelligence (AI), MSF’s potential for sensing and understanding intricate external environments has been further amplified, bringing a profound impact on intelligent systems and specifically on their perception systems. Similar to traditional software, adequate testing is also required for AI-enabled MSF systems. Yet, existing testing methods primarily concentrate on single-sensor perception systems (e.g., image-based and point cloud-based object detection systems). There remains a lack of emphasis on generating multi-modal test cases for MSF systems.To address these limitations, we design and implement MultiTest, a fitness-guided metamorphic testing method for complex MSF perception systems. MultiTest employs a physical-aware approach to synthesize realistic multi-modal object instances and insert them into critical positions of background images and point clouds. A fitness metric is designed to guide and boost the test generation process. We conduct extensive experiments with five SOTA perception systems to evaluate MultiTest from the perspectives of: (1) generated test cases’ realism, (2) fault detection capabilities, and (3) performance improvement. The results show that MultiTest can generate realistic and modality-consistent test data and effectively detect hundreds of diverse faults of an MSF system under test. Moreover, retraining an MSF system on the test cases generated by MultiTest can improve the system’s robustness. Our replication package and synthesized testing dataset are publicly available at https://sites.google.com/view/msftest.

DOI: 10.1145/3597503.3639191

JLeaks： A Featured Resource Leak Repository Collected From Hundreds of Open-Source Java Projects

作者: Liu, Tianyang and Ji, Weixing and Dong, Xiaohui and Yao, Wuhuang and Wang, Yizhuo and Liu, Hui and Peng, Haiyang and Wang, Yuxuan
关键词: resource leak, defect repository, open-source projects, java language

Abstract

High-quality defect repositories are vital in defect detection, localization, and repair. However, existing repositories collected from open-source projects are either small-scale or inadequately labeled and packed. This paper systematically summarizes the programming APIs of system resources (i.e., file, socket, and thread) in Java. Additionally, this paper demonstrates the exceptions that may cause resource leaks in the chained and nested streaming operations. A semi-automatic toolchain is built to improve the efficiency of defect extraction, including automatic building for large legacy Java projects. Accordingly, 1,094 resource leaks were collected from 321 open-source projects on GitHub. This repository, named JLeaks, was built by round-by-round filtering and cross-validation, involving the review of approximately 3,185 commits from hundreds of projects. JLeaks is currently the largest resource leak repository, and each defect in JLeaks is well-labeled and packed, including causes, locations, patches, source files, and compiled bytecode files for 254 defects. We have conducted a detailed analysis of JLeaks for defect distribution, root causes, and fix approaches. We compare JLeaks with two well-known resource leak repositories, and the results show that JLeaks is more informative and complete, with high availability, uniqueness, and consistency. Additionally, we show the usability of JLeaks in two application scenarios. Future studies can leverage our repository to encourage better design and implementation of defect-related algorithms and tools.

DOI: 10.1145/3597503.3639162

S3C： Spatial Semantic Scene Coverage for Autonomous Vehicles

作者: Woodlief, Trey and Toledo, Felipe and Elbaum, Sebastian and Dwyer, Matthew B
关键词: coverage, scene graph, autonomous vehicles, perception

Abstract

Autonomous vehicles (AVs) must be able to operate in a wide range of scenarios including those in the long tail distribution that include rare but safety-critical events. The collection of sensor input and expected output datasets from such scenarios is crucial for the development and testing of such systems. Yet, approaches to quantify the extent to which a dataset covers test specifications that capture critical scenarios remain limited in their ability to discriminate between inputs that lead to distinct behaviors, and to render interpretations that are relevant to AV domain experts. To address this challenge, we introduce S3C, a framework that abstracts sensor inputs to coverage domains that account for the spatial semantics of a scene. The approach leverages scene graphs to produce a sensor-independent abstraction of the AV environment that is interpretable and discriminating. We provide an implementation of the approach and a study for camera-based autonomous vehicles operating in simulation. The findings show that S3C outperforms existing techniques in discriminating among classes of inputs that cause failures, and offers spatial interpretations that can explain to what extent a dataset covers a test specification. Further exploration of S3C with open datasets complements the study findings, revealing the potential and shortcomings of deploying the approach in the wild.

DOI: 10.1145/3597503.3639178

FlashSyn： Flash Loan Attack Synthesis via Counter Example Driven Approximation

作者: Chen, Zhiyang and Beillahi, Sidi Mohamed and Long, Fan
关键词: program synthesis, program analysis, blockchain, smart contracts, vulnerability detection, flash loan

Abstract

In decentralized finance (DeFi), lenders can offer flash loans to borrowers, i.e., loans that are only valid within a blockchain transaction and must be repaid with fees by the end of that transaction. Unlike normal loans, flash loans allow borrowers to borrow large assets without upfront collaterals deposits. Malicious adversaries use flash loans to gather large assets to exploit vulnerable DeFi protocols.In this paper, we introduce a new framework for automated synthesis of adversarial transactions that exploit DeFi protocols using flash loans. To bypass the complexity of a DeFi protocol, we propose a new technique to approximate the DeFi protocol functional behaviors using numerical methods (polynomial approximation and nearest-neighbor interpolation). We then construct an optimization query using the approximated functions of the DeFi protocol to find an adversarial attack constituted of a sequence of functions invocations with optimal parameters that gives the maximum profit. To improve the accuracy of the approximation, we propose a novel counterexample driven approximation refinement technique. We implement our framework in a tool named FlashSyn. We evaluate FlashSyn on 16 DeFi protocols that were victims to flash loan attacks and 2 DeFi protocols from Damn Vulnerable DeFi challenges. FlashSyn automatically synthesizes an adversarial attack for 16 of the 18 benchmarks, demonstrating its effectiveness in finding possible flash loan attacks.

DOI: 10.1145/3597503.3639190

Testing Graph Database Systems via Equivalent Query Rewriting

作者: Mang, Qiuyang and Fang, Aoyang and Yu, Boxi and Chen, Hanfei and He, Pinjia
关键词: graph databases, metamorphic testing, query rewriting

Abstract

Graph Database Management Systems (GDBMS), which utilize graph models for data storage and execute queries via graph traversals, have seen ubiquitous usage in real-world scenarios such as recommendation systems, knowledge graphs, and social networks. Much like Relational Database Management Systems (RDBMS), GDBMS are not immune to bugs. These bugs typically manifest as logic errors that yield incorrect results (e.g., omitting a node that should be included), performance bugs (e.g., long execution time caused by redundant graph scanning), and exception issues (e.g., unexpected or missing exceptions).This paper adapts Equivalent Query Rewriting (EQR) to GDBMS testing. EQR rewrites a GDBMS query into equivalent ones that trigger distinct query plans, and checks whether they exhibit discrepancies in system behaviors. To facilitate the realization of EQR, we propose a general concept called Abstract Syntax Graph (ASG). Its core idea is to embed the semantics of a base query into the paths of a graph, which can be utilized to generate new queries with customized properties (e.g., equivalence). Given a base query, an ASG is constructed and then an equivalent query can be generated by finding paths collectively carrying the complete semantics of the base query. To this end, we further design Random Walk Covering (RWC), a simple yet effective path covering algorithm. As a practical implementation of these ideas, we develop a tool GRev, which has successfully detected 22 previously unknown bugs across 5 popular GDBMS, with 15 of them being confirmed. In particular, 14 of the detected bugs are related to improper implementation of graph data retrieval in GDBMS, which is challenging to identify for existing techniques.

DOI: 10.1145/3597503.3639200

ROSInfer： Statically Inferring Behavioral Component Models for ROS-based Robotics Systems

作者: D"{u
关键词: No keywords

Abstract

Robotics systems are complex, safety-critical systems that can consist of hundreds of software components that interact with each other dynamically during run time. Software components of robotics systems often exhibit reactive, periodic, and state-dependent behavior. Incorrect component composition can lead to unexpected behavior, such as components passively waiting for initiation messages that never arrive. Model-based software analysis is a common technique to identify incorrect behavioral composition by checking desired properties of given behavioral models that are based on component state machines. However, writing state machine models for hundreds of software components manually is a labor-intensive process. This motivates work on automated model inference. In this paper, we present an approach to infer behavioral models for systems based on the Robot Operating System (ROS) using static analysis by exploiting assumptions about the usage of the ROS API and ecosystem. Our approach is based on searching for common behavioral patterns that ROS developers use for implementing reactive, periodic, and state-dependent behavior using the ROS framework API. We evaluate our approach and our tool ROSInfer on five complex real-world ROS systems with a total of 534 components. For this purpose we manually created 155 models of components from the source code to be used as a ground truth and available data set for other researchers. ROSInfer can infer causal triggers for 87% of component architectural behaviors in the 534 components.

DOI: 10.1145/3597503.3639206

Finding XPath Bugs in XML Document Processors via Differential Testing

作者: Li, Shuxin and Rigger, Manuel
关键词: XML processors, XPath generation, differential testing

Abstract

Extensible Markup Language (XML) is a widely used file format for data storage and transmission. Many XML processors support XPath, a query language that enables the extraction of elements from XML documents. These systems can be affected by logic bugs, which are bugs that cause the processor to return incorrect results. In order to tackle such bugs, we propose a new approach, which we realized as a system called XPress. As a test oracle, XPress relies on differential testing, which compares the results of multiple systems on the same test input, and identifies bugs through discrepancies in their outputs. As test inputs, XPress generates both XML documents and XPath queries. Aiming to generate meaningful queries that compute non-empty results, XPress selects a so-called targeted node to guide the XPath expression generation process. Using the targeted node, XPress generates XPath expressions that reference existing context related to the targeted node, such as its tag name and attributes, while also guaranteeing that a predicate evaluates to true before further expanding the query. We tested our approach on six mature XML processors, BaseX, eXist-DB, Saxon, PostgreSQL, libXML2, and a commercial database system. In total, we have found 27 unique bugs in these systems, of which 25 have been verified by the developers, and 20 of which have been fixed. XPress is efficient, as it finds 12 unique bugs in BaseX in 24 hours, which is 2\texttimes{

DOI: 10.1145/3597503.3639208

Sedar： Obtaining High-Quality Seeds for DBMS Fuzzing via Cross-DBMS SQL Transfer

作者: Fu, Jingzhou and Liang, Jie and Wu, Zhiyong and Jiang, Yu
关键词: DBMS fuzzing, initial seeds, vulnerability detection

Abstract

Effective DBMS fuzzing relies on high-quality initial seeds, which serve as the starting point for mutation. These initial seeds should incorporate various DBMS features to explore the state space thoroughly. While built-in test cases are typically used as initial seeds, many DBMSs lack comprehensive test cases, making it difficult to apply state-of-the-art fuzzing techniques directly.To address this, we propose Sedar which produces initial seeds for a target DBMS by transferring test cases from other DBMSs. The underlying insight is that many DBMSs share similar functionalities, allowing seeds that cover deep execution paths in one DBMS to be adapted for other DBMSs. The challenge lies in converting these seeds to a format supported by the grammar of the target database. Sedar follows a three-step process to generate seeds. First, it executes existing SQL test cases within the DBMS they were designed for and captures the schema information during execution. Second, it utilizes large language models (LLMs) along with the captured schema information to guide the generation of new test cases based on the responses from the LLM. Lastly, to ensure that the test cases can be properly parsed and mutated by fuzzers, Sedar temporarily comments out unparsable sections for the fuzzers and uncomments them after mutation. We integrate Sedar into the DBMS fuzzers Sqirrel and Griffin, targeting DBMSs such as Virtuoso, MonetDB, DuckDB, and ClickHouse. Evaluation results demonstrate significant improvements in both fuzzers. Specifically, compared to Sqirrel and Griffin with non-transferred seeds, Sedar enhances code coverage by 72.46%-214.84% and 21.40%-194.46%; compared to Sqirrel and Griffin with native test cases of these DBMSs as initial seeds, incorporating the transferred seeds of Sedar results in an improvement in code coverage by 4.90%-16.20% and 9.73%-28.41%. Moreover, Sedar discovered 70 new vulnerabilities, with 60 out of them being uniquely found by Sedar with transferred seeds, and 19 of them have been assigned with CVEs.

DOI: 10.1145/3597503.3639210

Automatically Detecting Reflow Accessibility Issues in Responsive Web Pages

作者: Chiou, Paul T. and Winn, Robert and Alotaibi, Ali S. and Halfond, William G. J.
关键词: web accessibility, WCAG, software testing, response web design, reflow, keyboard accessibility, inclusive design

Abstract

Many web applications today use responsive design to adjust the view of web pages to match the screen size of end users. People with disabilities often use an alternative view either due to zooming on a desktop device to enlarge text or viewing within a smaller viewport when using assistive technologies. When web pages are not implemented to correctly adjust the page’s content across different screen sizes, it can lead to both a loss of content and functionalities between the different versions. Recent studies show that these reflow accessibility issues are among the most prevalent modern web accessibility issues. In this paper, we present a novel automated technique to automatically detect reflow accessibility issues in web pages for keyboard users. The evaluation of our approach on real-world web pages demonstrated its effectiveness in detecting reflow accessibility issues, outperforming state-of-the-art techniques.

DOI: 10.1145/3597503.3639229

Towards More Practical Automation of Vulnerability Assessment

作者: Pan, Shengyi and Bao, Lingfeng and Zhou, Jiayuan and Hu, Xing and Xia, Xin and Li, Shanping
关键词: software security, vulnerability assessment, CVSS

Abstract

It is increasingly suggested to identify emerging software vulnerabilities (SVs) through relevant development activities (e.g., issue reports) to allow early warnings to open source software (OSS) users. However, the support for the following assessment of the detected SVs has not yet been explored. SV assessment characterizes the detected SVs to prioritize limited remediation resources on the critical ones. To fill this gap, we aim to enable early vulnerability assessment based on SV-related issue reports (SIR). Besides, we observe the following concerns of the existing assessment techniques: 1) the assessment output lacks rationale and practical value; 2) the associations between Common Vulnerability Scoring System (CVSS) metrics have been ignored; 3) insufficient evaluation scenarios and metrics. We address these concerns to enhance the practicality of our proposed early vulnerability assessment approach (namely proEVA). Specifically, based on the observation of strong associations between CVSS metrics, we propose a prompt-based model to exploit such relations for CVSS metrics prediction. Moreover, we design a curriculum-learning (CL) schedule to guide the model better learn such hidden associations during training. Aside from the standard classification metrics adopted in existing works, we propose two severity-aware metrics to provide a more comprehensive evaluation regarding the prioritization of the high-severe SVs. Experimental results show that proEVA significantly outperforms the baselines in both types of metrics. We further discuss the transferability of the prediction model regarding the upgrade of the assessment system, an important yet overlooked evaluation scenario in existing works. The results verify that proEVA is more efficient and flexible in migrating to different assessment systems.

DOI: 10.1145/3597503.3639110

VGX： Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability Analyses

作者: Nong, Yu and Fang, Richard and Yi, Guangbei and Zhao, Kunsong and Luo, Xiapu and Chen, Feng and Cai, Haipeng
关键词: vulnerability dataset, vulnerability injection, data quality, vulnerability analysis, deep learning, program generation

Abstract

Accompanying the successes of learning-based defensive software vulnerability analyses is the lack of large and quality sets of labeled vulnerable program samples, which impedes further advancement of those defenses. Existing automated sample generation approaches have shown potentials yet still fall short of practical expectations due to the high noise in the generated samples. This paper proposes VGX, a new technique aimed for large-scale generation of high-quality vulnerability datasets. Given a normal program, VGX identifies the code contexts in which vulnerabilities can be injected, using a customized Transformer featured with a new value-flow-based position encoding and pre-trained against new objectives particularly for learning code structure and context. Then, VGX materializes vulnerability-injection code editing in the identified contexts using patterns of such edits obtained from both historical fixes and human knowledge about real-world vulnerabilities.Compared to four state-of-the-art (SOTA) (i.e., pattern-, Transformer-, GNN-, and pattern+Transformer-based) baselines, VGX achieved 99.09–890.06% higher F1 and 22.45%-328.47% higher label accuracy. For in-the-wild sample production, VGX generated 150,392 vulnerable samples, from which we randomly chose 10% to assess how much these samples help vulnerability detection, localization, and repair. Our results show SOTA techniques for these three application tasks achieved 19.15–330.80% higher F1, 12.86–19.31% higher top-10 accuracy, and 85.02–99.30% higher top-50 accuracy, respectively, by adding those samples to their original training data. These samples also helped a SOTA vulnerability detector discover 13 more real-world vulnerabilities (CVEs) in critical systems (e.g., Linux kernel) that would be missed by the original model.

DOI: 10.1145/3597503.3639116

MalCertain： Enhancing Deep Neural Network Based Android Malware Detection by Tackling Prediction Uncertainty

作者: Li, Haodong and Xu, Guosheng and Wang, Liu and Xiao, Xusheng and Luo, Xiapu and Xu, Guoai and Wang, Haoyu
关键词: Android malware detection, uncertainty, DNN

Abstract

The long-lasting Android malware threat has attracted significant research efforts in malware detection. In particular, by modeling malware detection as a classification problem, machine learning based approaches, especially deep neural network (DNN) based approaches, are increasingly being used for Android malware detection and have achieved significant improvements over other detection approaches such as signature-based approaches. However, as Android malware evolve rapidly and the presence of adversarial samples, DNN models trained on early constructed samples often yield poor decisions when used to detect newly emerging samples. Fundamentally, this phenomenon can be summarized as the uncertainly in the data (noise or randomness) and the weakness in the training process (insufficient training data). Overlooking these uncertainties poses risks in the model predictions. In this paper, we take the first step to estimate the prediction uncertainty of DNN models in malware detection and leverage these estimates to enhance Android malware detection techniques. Specifically, besides training a DNN model to predict malware, we employ several uncertainty estimation methods to train a Correction Model that determines whether a sample is correctly or incorrectly predicted by the DNN model. We then leverage the estimated uncertainty output by the Correction Model to correct the prediction results, improving the accuracy of the DNN model. Experimental results show that our proposed MalCertain effectively improves the accuracy of the underlying DNN models for Android malware detection by around 21% and significantly improves the detection effectiveness of adversarial Android malware samples by up to 94.38%. Our research sheds light on the promising direction that leverages prediction uncertainty to improve prediction-based software engineering tasks.

DOI: 10.1145/3597503.3639122

Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks

作者: Liu, Zhongxin and Tang, Zhijie and Zhang, Junwei and Xia, Xin and Yang, Xiaohu
关键词: source code pre-training, program dependence analysis, vulnerability detection, vulnerability classification, vulnerability assessment

Abstract

Vulnerability analysis is crucial for software security. Inspired by the success of pre-trained models on software engineering tasks, this work focuses on using pre-training techniques to enhance the understanding of vulnerable code and boost vulnerability analysis. The code understanding ability of a pre-trained model is highly related to its pre-training objectives. The semantic structure, e.g., control and data dependencies, of code is important for vulnerability analysis. However, existing pre-training objectives either ignore such structure or focus on learning to use it. The feasibility and benefits of learning the knowledge of analyzing semantic structure have not been investigated. To this end, this work proposes two novel pre-training objectives, namely Control Dependency Prediction (CDP) and Data Dependency Prediction (DDP), which aim to predict the statement-level control dependencies and token-level data dependencies, respectively, in a code snippet only based on its source code. During pre-training, CDP and DDP can guide the model to learn the knowledge required for analyzing fine-grained dependencies in code. After pre-training, the pre-trained model can boost the understanding of vulnerable code during fine-tuning and can directly be used to perform dependence analysis for both partial and complete functions. To demonstrate the benefits of our pre-training objectives, we pre-train a Transformer model named PDBERT with CDP and DDP, fine-tune it on three vulnerability analysis tasks, i.e., vulnerability detection, vulnerability classification, and vulnerability assessment, and also evaluate it on program dependence analysis. Experimental results show that PDBERT benefits from CDP and DDP, leading to state-of-the-art performance on the three downstream tasks. Also, PDBERT achieves F1-scores of over 99% and 94% for predicting control and data dependencies, respectively, in partial and complete functions.

DOI: 10.1145/3597503.3639142

Investigating White-Box Attacks for On-Device Models

作者: Zhou, Mingyi and Gao, Xiang and Wu, Jing and Liu, Kui and Sun, Hailong and Li, Li
关键词: No keywords

Abstract

Numerous mobile apps have leveraged deep learning capabilities. However, on-device models are vulnerable to attacks as they can be easily extracted from their corresponding mobile apps. Although the structure and parameters information of these models can be accessed, existing on-device attacking approaches only generate black-box attacks (i.e., indirect white-box attacks), which are less effective and efficient than white-box strategies. This is because mobile deep learning (DL) frameworks like TensorFlow Lite (TFLite) do not support gradient computing (referred to as non-debuggable models), which is necessary for white-box attacking algorithms. Thus, we argue that existing findings may underestimate the harm-fulness of on-device attacks. To validate this, we systematically analyze the difficulties of transforming the on-device model to its debuggable version and propose a Reverse Engineering framework for On-device Models (REOM), which automatically reverses the compiled on-device TFLite model to its debuggable version, enabling attackers to launch white-box attacks. Our empirical results show that our approach is effective in achieving automated transformation (i.e., 92.6%) among 244 TFLite models. Compared with previous attacks using surrogate models, REOM enables attackers to achieve higher attack success rates (10.23%→89.03%) with a hundred times smaller attack perturbations (1.0→0.01). Our findings emphasize the need for developers to carefully consider their model deployment strategies, and use white-box methods to evaluate the vulnerability of on-device models. Our artifacts 1 are available.

DOI: 10.1145/3597503.3639144

Towards Causal Deep Learning for Vulnerability Detection

作者: Rahman, Md Mahbubur and Ceka, Ira and Mao, Chengzhi and Chakraborty, Saikat and Ray, Baishakhi and Le, Wei
关键词: vulnerability detection, causality, spurious features

Abstract

Deep learning vulnerability detection has shown promising results in recent years. However, an important challenge that still blocks it from being very useful in practice is that the model is not robust under perturbation and it cannot generalize well over the out-of-distribution (OOD) data, e.g., applying a trained model to unseen projects in real world. We hypothesize that this is because the model learned non-robust features, e.g., variable names, that have spurious correlations with labels. When the perturbed and OOD datasets no longer have the same spurious features, the model prediction fails. To address the challenge, in this paper, we introduced causality into deep learning vulnerability detection. Our approach CausalVul consists of two phases. First, we designed novel perturbations to discover spurious features that the model may use to make predictions. Second, we applied the causal learning algorithms, specifically, do-calculus, on top of existing deep learning models to systematically remove the use of spurious features and thus promote causal based prediction. Our results show that CausalVul consistently improved the model accuracy, robustness and OOD performance for all the state-of-the-art models and datasets we experimented. To the best of our knowledge, this is the first work that introduces do calculus based causal learning to software engineering models and shows it’s indeed useful for improving the model accuracy, robustness and generalization. Our replication package is located at https://figshare.com/s/0ffda320dcb96c249ef2.

DOI: 10.1145/3597503.3639170

MetaLog： Generalizable Cross-System Anomaly Detection from Logs with Meta-Learning

作者: Zhang, Chenyangguang and Jia, Tong and Shen, Guopeng and Zhu, Pinyan and Li, Ying
关键词: meta-learning, anomaly detection, system logs

Abstract

Log-based anomaly detection plays a crucial role in ensuring the stability of software. However, current approaches for log-based anomaly detection heavily depend on a vast amount of labeled historical data, which is often unavailable in many real-world systems. To mitigate this problem, we leverage the features of the abundant historical labeled logs of mature systems to help construct anomaly detection models of new systems with very few labels, that is, to generalize the model ability trained from labeled logs of mature systems to achieve anomaly detection on new systems with insufficient data labels. Specifically, we propose MetaLog, a generalizable cross-system anomaly detection approach. MetaLog first incorporates a globally consistent semantic embedding module to obtain log event semantic embedding vectors in a shared global space. Then it leverages the meta-learning paradigm to improve the model’s generalization ability. We evaluate MetaLog’s performance on four public log datasets (HDFS, BGL, OpenStack, and Thunderbird) from four different systems. Results show that MetaLog reaches over 80% F1-score when using only 1% labeled logs of the target system, showing similar performance with state-of-the-art supervised anomaly detection models trained with 100% labeled data. Besides, it outperforms state-of-art transfer-learning-based cross-system anomaly detection models by 20% in the same settings of 1% labeled training logs of the target system.

DOI: 10.1145/3597503.3639205

Coca： Improving and Explaining Graph Neural Network-Based Vulnerability Detection Systems

作者: Cao, Sicong and Sun, Xiaobing and Wu, Xiaoxue and Lo, David and Bo, Lili and Li, Bin and Liu, Wei
关键词: contrastive learning, causal inference, explainability

Abstract

Recently, Graph Neural Network (GNN)-based vulnerability detection systems have achieved remarkable success. However, the lack of explainability poses a critical challenge to deploy black-box models in security-related domains. For this reason, several approaches have been proposed to explain the decision logic of the detection model by providing a set of crucial statements positively contributing to its predictions. Unfortunately, due to the weakly-robust detection models and suboptimal explanation strategy, they have the danger of revealing spurious correlations and redundancy issue.In this paper, we propose Coca, a general framework aiming to 1) enhance the robustness of existing GNN-based vulnerability detection models to avoid spurious explanations; and 2) provide both concise and effective explanations to reason about the detected vulnerabilities. Coca consists of two core parts referred to as Trainer and Explainer. The former aims to train a detection model which is robust to random perturbation based on combinatorial contrastive learning, while the latter builds an explainer to derive crucial code statements that are most decisive to the detected vulnerability via dual-view causal inference as explanations. We apply Coca over three typical GNN-based vulnerability detectors. Experimental results show that Coca can effectively mitigate the spurious correlation issue, and provide more useful high-quality explanations.

DOI: 10.1145/3597503.3639168

Improving Smart Contract Security with Contrastive Learning-based Vulnerability Detection

作者: Chen, Yizhou and Sun, Zeyu and Gong, Zhihao and Hao, Dan
关键词: smart contract, vulnerability detection, deep learning, contrastive learning

Abstract

Currently, smart contract vulnerabilities (SCVs) have emerged as a major factor threatening the transaction security of blockchain. Existing state-of-the-art methods rely on deep learning to mitigate this threat. They treat each input contract as an independent entity and feed it into a deep learning model to learn vulnerability patterns by fitting vulnerability labels. It is a pity that they disregard the correlation between contracts, failing to consider the commonalities between contracts of the same type and the differences among contracts of different types. As a result, the performance of these methods falls short of the desired level.To tackle this problem, we propose a novel Contrastive Learning Enhanced Automated Recognition Approach for Smart Contract Vulnerabilities, named Clear. In particular, Clear employs a contrastive learning (CL) model to capture the fine-grained correlation information among contracts and generates correlation labels based on the relationships between contracts to guide the training process of the CL model. Finally, it combines the correlation and the semantic information of the contract to detect SCVs. Through an empirical evaluation of a large-scale real-world dataset of over 40K smart contracts and compare 13 state-of-the-art baseline methods. We show that Clear achieves (1) optimal performance over all baseline methods; (2) 9.73%-39.99% higher F1-score than existing deep learning methods.

DOI: 10.1145/3597503.3639173

On the Effectiveness of Function-Level Vulnerability Detectors for Inter-Procedural Vulnerabilities

作者: Li, Zhen and Wang, Ning and Zou, Deqing and Li, Yating and Zhang, Ruqian and Xu, Shouhuai and Zhang, Chao and Jin, Hai
关键词: vulnerability detection, inter-procedural vulnerability, vulnerability type, patch

Abstract

Software vulnerabilities are a major cyber threat and it is important to detect them. One important approach to detecting vulnerabilities is to use deep learning while treating a program function as a whole, known as function-level vulnerability detectors. However, the limitation of this approach is not understood. In this paper, we investigate its limitation in detecting one class of vulnerabilities known as inter-procedural vulnerabilities, where the to-be-patched statements and the vulnerability-triggering statements belong to different functions. For this purpose, we create the first Inter-Procedural Vulnerability Dataset (InterPVD) based on C/C++ open-source software, and we propose a tool dubbed VulTrigger for identifying vulnerability-triggering statements across functions. Experimental results show that VulTrigger can effectively identify vulnerability-triggering statements and inter-procedural vulnerabilities. Our findings include: (i) inter-procedural vulnerabilities are prevalent with an average of 2.8 inter-procedural layers; and (ii) function-level vulnerability detectors are much less effective in detecting to-be-patched functions of inter-procedural vulnerabilities than detecting their counterparts of intra-procedural vulnerabilities.

DOI: 10.1145/3597503.3639218

A User-centered Security Evaluation of Copilot

作者: Asare, Owura and Nagappan, Meiyappan and Asokan, N.
关键词: user study, code generation, copilot, security, software engineering

Abstract

Code generation tools driven by artificial intelligence have recently become more popular due to advancements in deep learning and natural language processing that have increased their capabilities. The proliferation of these tools may be a double-edged sword because while they can increase developer productivity by making it easier to write code, research has shown that they can also generate insecure code. In this paper, we perform a user-centered evaluation GitHub’s Copilot to better understand its strengths and weaknesses with respect to code security. We conduct a user study where participants solve programming problems (with and without Copilot assistance) that have potentially vulnerable solutions. The main goal of the user study is to determine how the use of Copilot affects participants’ security performance. In our set of participants (n=25), we find that access to Copilot accompanies a more secure solution when tackling harder problems. For the easier problem, we observe no effect of Copilot access on the security of solutions. We also observe no disproportionate impact of Copilot use on particular kinds of vulnerabilities. Our results indicate that there are potential security benefits to using Copilot, but more research is warranted on the effects of the use of code generation tools on technically complex problems with security requirements.

DOI: 10.1145/3597503.3639154

An Empirical Study on Oculus Virtual Reality Applications： Security and Privacy Perspectives

作者: Guo, Hanyang and Dai, Hong-Ning and Luo, Xiapu and Zheng, Zibin and Xu, Gengyang and He, Fengliang
关键词: virtual reality, metaverse, static analysis, security and privacy

Abstract

Although Virtual Reality (VR) has accelerated its prevalent adoption in emerging metaverse applications, it is not a fundamentally new technology. On one hand, most VR operating systems (OS) are based on off-the-shelf mobile OS (e.g., Android). As a result, VR apps also inherit privacy and security deficiencies from conventional mobile apps. On the other hand, in contrast to conventional mobile apps, VR apps can achieve immersive experience via diverse VR devices, such as head-mounted displays, body sensors, and controllers though achieving this requires the extensive collection of privacy-sensitive human biometrics (e.g., hand-tracking and face-tracking data). Moreover, VR apps have been typically implemented by 3D gaming engines (e.g., Unity), which also contain intrinsic security vulnerabilities. Inappropriate use of these technologies may incur privacy leaks and security vulnerabilities although these issues have not received significant attention compared to the proliferation of diverse VR apps. In this paper, we develop a security and privacy assessment tool, namely the VR-SP detector for VR apps. The VR-SP detector has integrated program static analysis tools and privacy-policy analysis methods. Using the VR-SP detector, we conduct a comprehensive empirical study on 500 popular VR apps. We obtain the original apps from the popular Oculus and SideQuest app stores and extract APK files via the Meta Oculus Quest 2 device. We evaluate security vulnerabilities and privacy data leaks of these VR apps by VR app analysis, taint analysis, and privacy-policy analysis. We find that a number of security vulnerabilities and privacy leaks widely exist in VR apps. Moreover, our results also reveal conflicting representations in the privacy policies of these apps and inconsistencies of the actual data collection with the privacy-policy statements of the apps. Based on these findings, we make suggestions for the future development of VR apps.

DOI: 10.1145/3597503.3639082

Fairness Improvement with Multiple Protected Attributes： How Far Are We?

作者: Chen, Zhenpeng and Zhang, Jie M. and Sarro, Federica and Harman, Mark
关键词: fairness improvement, machine learning, protected attributes, intersectional fairness

Abstract

Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effectiveness of these methods with different datasets, metrics, and ML models when considering multiple protected attributes. The results reveal that improving fairness for a single protected attribute can largely decrease fairness regarding unconsidered protected attributes. This decrease is observed in up to 88.3% of scenarios (57.5% on average). More surprisingly, we find little difference in accuracy loss when considering single and multiple protected attributes, indicating that accuracy can be maintained in the multiple-attribute paradigm. However, the effect on precision and recall when handling multiple protected attributes is about five times and eight times that of a single attribute. This has important implications for future fairness research: reporting only accuracy as the ML performance metric, which is currently common in the literature, is inadequate.

DOI: 10.1145/3597503.3639083

An Empirical Study of Data Disruption by Ransomware Attacks

作者: Hou, Yiwei and Guo, Lihua and Zhou, Chijin and Xu, Yiwen and Yin, Zijing and Li, Shanshan and Sun, Chengnian and Jiang, Yu
关键词: ransomware, data disruption, runtime behaviors, mitigation strategies, empirical study

Abstract

The threat of ransomware to the software ecosystem has become increasingly alarming in recent years, raising a demand for large-scale and comprehensive ransomware analysis to help develop more effective countermeasures against unknown attacks. In this paper, we first collect a real-world dataset MarauderMap, consisting of 7,796 active ransomware samples, and analyze their behaviors of disrupting data in victim systems. All samples are executed in isolated testbeds to collect all perspectives of six categories of runtime behaviors, such as API calls, I/O accesses, and network traffic. The total logs volume is up to 1.98 TiB. By assessing collected behaviors, we present six critical findings throughout ransomware attacks’ data reconnaissance, data tampering, and data exfiltration phases. Based on our findings, we propose three corresponding mitigation strategies to detect ransomware during each phase. Experimental results show that they can enhance the capability of state-of-the-art anti-ransomware tools. We report a preliminary result of a 41%-69% increase in detection rate with no additional false positives, showing that our insights are helpful.

DOI: 10.1145/3597503.3639090

Identifying Affected Libraries and Their Ecosystems for Open Source Software Vulnerabilities

作者: Wu, Susheng and Song, Wenyan and Huang, Kaifeng and Chen, Bihuan and Peng, Xin
关键词: open source software, vulnerability quality, affected libraries

Abstract

Software composition analysis (SCA) tools have been widely adopted to identify vulnerable libraries used in software applications. Such SCA tools depend on a vulnerability database to know affected libraries of each vulnerability. However, it is labor-intensive and error prone for a security team to manually maintain the vulnerability database. While several approaches adopt extreme multi-label learning to predict affected libraries for vulnerabilities, they are practically ineffective due to the limited library labels and the unawareness of ecosystems.To address these problems, we first conduct an empirical study to assess the quality of two fields, i.e., affected libraries and their ecosystems, for four vulnerability databases. Our study reveals notable inconsistency and inaccuracy in these two fields. Then, we propose Holmes to identify affected libraries and their ecosystems for vulnerabilities via a learning-to-rank technique. The key idea of Holmes is to gather various evidences about affected libraries and their ecosystems from multiple sources, and learn to rank a pool of libraries based on their relevance to evidences. Our extensive experiments have shown the effectiveness, efficiency and usefulness of Holmes.

DOI: 10.1145/3597503.3639582

Understanding Transaction Bugs in Database Systems

作者: Cui, Ziyu and Dou, Wensheng and Gao, Yu and Wang, Dong and Song, Jiansen and Zheng, Yingying and Wang, Tao and Yang, Rui and Xu, Kang and Hu, Yixin and Wei, Jun and Huang, Tao
关键词: database system, transaction bug, empirical study

Abstract

Transactions are used to guarantee data consistency and integrity in Database Management Systems (DBMSs), and have become an indispensable component in DBMSs. However, faulty designs and implementations of DBMSs’ transaction processing mechanisms can introduce transaction bugs, and lead to severe consequences, e.g., incorrect database states and DBMS crashes. An in-depth understanding of real-world transaction bugs can significantly promote effective techniques in combating transaction bugs in DBMSs.In this paper, we conduct the first comprehensive study on 140 transaction bugs collected from six widely-used DBMSs, i.e., MySQL, PostgreSQL, SQLite, MariaDB, CockroachDB, and TiDB. We investigate these bugs from their bug manifestations, root causes, bug impacts and bug fixing. Our study reveals many interesting findings and provides useful guidance for transaction bug detection, testing, and verification.

DOI: 10.1145/3597503.3639207

When Contracts Meets Crypto： Exploring Developers’ Struggles with Ethereum Cryptographic APIs

作者: Zhang, Jiashuo and Chen, Jiachi and Wan, Zhiyuan and Chen, Ting and Gao, Jianbo and Chen, Zhong
关键词: ethereum, smart contracts, empirical study, cryptography, API usability

Abstract

To empower smart contracts with the promising capabilities of cryptography, Ethereum officially introduced a set of cryptographic APIs that facilitate basic cryptographic operations within smart contracts, such as elliptic curve operations. However, since developers are not necessarily cryptography experts, requiring them to directly interact with these basic APIs has caused real-world security issues and potential usability challenges. To guide future research and solutions to these challenges, we conduct the first empirical study on Ethereum cryptographic practices. Through the analysis of 91,484,856 Ethereum transactions, 500 crypto-related contracts, and 483 StackExchange posts, we provide the first in-depth look at cryptographic tasks developers need to accomplish and identify five categories of obstacles they encounter. Furthermore, we conduct an online survey with 78 smart contract practitioners to explore their perspectives on these obstacles and elicit the underlying reasons. We find that more than half of practitioners face more challenges in cryptographic tasks compared to general business logic in smart contracts. Their feedback highlights the gap between low-level cryptographic APIs and high-level tasks they need to accomplish, emphasizing the need for improved cryptographic APIs, task-based templates, and effective assistance tools. Based on these findings, we provide practical implications for further improvements and outline future research directions.

DOI: 10.1145/3597503.3639131

Curiosity-Driven Testing for Sequential Decision-Making Process

作者: He, Junda and Yang, Zhou and Shi, Jieke and Yang, Chengran and Kim, Kisub and Xu, Bowen and Zhou, Xin and Lo, David
关键词: fuzz testing, sequential decision making, deep learning

Abstract

Sequential decision-making processes (SDPs) are fundamental for complex real-world challenges, such as autonomous driving, robotic control, and traffic management. While recent advances in Deep Learning (DL) have led to mature solutions for solving these complex problems, SDMs remain vulnerable to learning unsafe behaviors, posing significant risks in safety-critical applications. However, developing a testing framework for SDMs that can identify a diverse set of crash-triggering scenarios remains an open challenge. To address this, we propose CureFuzz, a novel curiosity-driven black-box fuzz testing approach for SDMs. CureFuzz proposes a curiosity mechanism that allows a fuzzer to effectively explore novel and diverse scenarios, leading to improved detection of crash-triggering scenarios. Additionally, we introduce a multi-objective seed selection technique to balance the exploration of novel scenarios and the generation of crash-triggering scenarios, thereby optimizing the fuzzing process. We evaluate CureFuzz on various SDMs and experimental results demonstrate that CureFuzz outperforms the state-of-the-art method by a substantial margin in the total number of faults and distinct types of crash-triggering scenarios. We also demonstrate that the crash-triggering scenarios found by CureFuzz can repair SDMs, highlighting CureFuzz as a valuable tool for testing SDMs and optimizing their performance.

DOI: 10.1145/3597503.3639149

GPTScan： Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis

作者: Sun, Yuqiang and Wu, Daoyuan and Xue, Yue and Liu, Han and Wang, Haijun and Xu, Zhengzi and Xie, Xiaofei and Liu, Yang
关键词: No keywords

Abstract

Smart contracts are prone to various vulnerabilities, leading to substantial financial losses over time. Current analysis tools mainly target vulnerabilities with fixed control- or data-flow patterns, such as re-entrancy and integer overflow. However, a recent study on Web3 security bugs revealed that about 80% of these bugs cannot be audited by existing tools due to the lack of domain-specific property description and checking. Given recent advances in Large Language Models (LLMs), it is worth exploring how Generative Pre-training Transformer (GPT) could aid in detecting logic vulnerabilities.In this paper, we propose GPTScan, the first tool combining GPT with static analysis for smart contract logic vulnerability detection. Instead of relying solely on GPT to identify vulnerabilities, which can lead to high false positives and is limited by GPT’s pre-trained knowledge, we utilize GPT as a versatile code understanding tool. By breaking down each logic vulnerability type into scenarios and properties, GPTScan matches candidate vulnerabilities with GPT. To enhance accuracy, GPTScan further instructs GPT to intelligently recognize key variables and statements, which are then validated by static confirmation. Evaluation on diverse datasets with around 400 contract projects and 3K Solidity files shows that GPTScan achieves high precision (over 90%) for token contracts and acceptable precision (57.14%) for large projects like Web3Bugs. It effectively detects ground-truth logic vulnerabilities with a recall of over 70%, including 9 new vulnerabilities missed by human auditors. GPTScan is fast and cost-effective, taking an average of 14.39 seconds and 0.01 USD to scan per thousand lines of Solidity code. Moreover, static confirmation helps GPTScan reduce two-thirds of false positives.

DOI: 10.1145/3597503.3639117

PS3： Precise Patch Presence Test based on Semantic Symbolic Signature

作者: Zhan, Qi and Hu, Xing and Li, Zhiyang and Xia, Xin and Lo, David and Li, Shanping
关键词: patch presence test, binary analysis, software security

Abstract

During software development, vulnerabilities have posed a significant threat to users. Patches are the most effective way to combat vulnerabilities. In a large-scale software system, testing the presence of a security patch in every affected binary is crucial to ensure system security. Identifying whether a binary has been patched for a known vulnerability is challenging, as there may only be small differences between patched and vulnerable versions. Existing approaches mainly focus on detecting patches that are compiled in the same compiler options. However, it is common for developers to compile programs with very different compiler options in different situations, which causes inaccuracy for existing methods. In this paper, we propose a new approach named PS3, referring to precise patch presence test based on semantic-level symbolic signature. PS3 exploits symbolic emulation to extract signatures that are stable under different compiler options. Then PS3 can precisely test the presence of the patch by comparing the signatures between the reference and the target at semantic level.To evaluate the effectiveness of our approach, we constructed a dataset consisting of 3,631 (CVE, binary) pairs of 62 recent CVEs in four C/C++ projects. The experimental results show that PS3 achieves scores of 0.82, 0.97, and 0.89 in terms of precision, recall, and F1 score, respectively. PS3 outperforms the state-of-the-art baselines by improving 33% in terms of F1 score and remains stable in different compiler options.

DOI: 10.1145/3597503.3639134

PrettySmart： Detecting Permission Re-delegation Vulnerability for Token Behaviors in Smart Contracts

作者: Zhong, Zhijie and Zheng, Zibin and Dai, Hong-Ning and Xue, Qing and Chen, Junjia and Nan, Yuhong
关键词: smart contract, permission control, vulnerability detection

Abstract

As an essential component in Ethereum and other blockchains, token assets have been interacted with by diverse smart contracts. Effective permission policies of smart contracts must prevent token assets from being manipulated by unauthorized adversaries. Recent efforts have studied the accessibility of privileged functions or state variables to unauthorized users. However, little attention is paid to how publicly accessible functions of smart contracts can be manipulated by adversaries to steal users’ digital assets. This attack is mainly caused by the permission re-delegation (PRD) vulnerability. In this work, we propose PrettySmart, a bytecode-level Permission re-delegation vulnerability detector for Smart contracts. Our study begins with an empirical study on 0.43 million open-source smart contracts, revealing that five types of widely-used permission constraints dominate 98% of the studied contracts. Accordingly, we propose a mechanism to infer these permission constraints, as well as an algorithm to identify constraints that can be bypassed by unauthorized adversaries. Based on the identification of permission constraints, we propose to detect whether adversaries could manipulate the privileged token management functionalities of smart contracts. The experimental results on real-world datasets demonstrate the effectiveness of the proposed PrettySmart, which achieves the highest precision score and detects 118 new PRD vulnerabilities.

DOI: 10.1145/3597503.3639140

Combining Structured Static Code Information and Dynamic Symbolic Traces for Software Vulnerability Prediction

作者: Wang, Huanting and Tang, Zhanyong and Tan, Shin Hwei and Wang, Jie and Liu, Yuzhe and Fang, Hejun and Xia, Chunwei and Wang, Zheng
关键词: software vulnerability detection, deep learning, symbolic execution

Abstract

Deep learning (DL) has emerged as a viable means for identifying software bugs and vulnerabilities. The success of DL relies on having a suitable representation of the problem domain. However, existing DL-based solutions for learning program representations have limitations - they either cannot capture the deep, precise program semantics or suffer from poor scalability. We present Concoction, the first DL system to learn program presentations by combining static source code information and dynamic program execution traces. Concoction employs unsupervised active learning techniques to determine a subset of important paths to collect dynamic symbolic execution traces. By implementing a focused symbolic execution solution, Concoction brings the benefits of static and dynamic code features while reducing the expensive symbolic execution overhead. We integrate Concoction with fuzzing techniques to detect function-level code vulnerabilities in C programs from 20 open-source projects. In 200 hours of automated concurrent test runs, Concoction has successfully uncovered vulnerabilities in all tested projects, identifying 54 unique vulnerabilities and yielding 37 new, unique CVE IDs. Concoction also significantly outperforms 16 prior methods by providing higher accuracy and lower false positive rates.

DOI: 10.1145/3597503.3639212

SCVHunter： Smart Contract Vulnerability Detection Based on Heterogeneous Graph Attention Network

作者: Luo, Feng and Luo, Ruijie and Chen, Ting and Qiao, Ao and He, Zheyuan and Song, Shuwei and Jiang, Yu and Li, Sixing
关键词: blockchain, smart contract, vulnerability detection

Abstract

Smart contracts are integral to blockchain’s growth, but their vulnerabilities pose a significant threat. Traditional vulnerability detection methods rely heavily on expert-defined complex rules that are labor-intensive and dificult to adapt to the explosive expansion of smart contracts. Some recent studies of neural network-based vulnerability detection also have room for improvement. Therefore, we propose SCVHunter, an extensible framework for smart contract vulnerability detection. Specifically, SCVHunter designs a heterogeneous semantic graph construction phase based on intermediate representations and a vulnerability detection phase based on a heterogeneous graph attention network for smart contracts. In particular, SCVHunter allows users to freely point out more important nodes in the graph, leveraging expert knowledge in a simpler way to aid the automatic capture of more information related to vulnerabilities. We tested SCVHunter on reentrancy, block info dependency, nested call, and transaction state dependency vulnerabilities. Results show remarkable performance, with accuracies of 93.72%, 91.07%, 85.41%, and 87.37% for these vulnerabilities, surpassing previous methods.

DOI: 10.1145/3597503.3639213

Safeguarding DeFi Smart Contracts against Oracle Deviations

作者: Deng, Xun and Beillahi, Sidi Mohamed and Minwalla, Cyrus and Du, Han and Veneris, Andreas and Long, Fan
关键词: blockchain, decentralized finance, smart contracts, oracle deviation, static program analysis, code summary, parameter optimization

Abstract

This paper presents OVer, a framework designed to automatically analyze the behavior of decentralized finance (DeFi) protocols when subjected to a “skewed” oracle input. OVer firstly performs symbolic analysis on the given contract and constructs a model of constraints. Then, the framework leverages an SMT solver to identify parameters that allow its secure operation. Furthermore, guard statements may be generated for smart contracts that may use the oracle values, thus effectively preventing oracle manipulation attacks. Empirical results show that OVer can successfully analyze all 10 benchmarks collected, which encompass a diverse range of DeFi protocols. Additionally, this paper illustrates that current parameters utilized in the majority of benchmarks are inadequate to ensure safety when confronted with significant oracle deviations. It shows that existing ad-hoc control mechanisms such as introducing delays are often in-sufficient or even detrimental to protect the DeFi protocols against the oracle deviation in the real-world.

DOI: 10.1145/3597503.3639225

MalwareTotal： Multi-Faceted and Sequence-Aware Bypass Tactics against Static Malware Detection

作者: He, Shuai and Fu, Cai and Hu, Hong and Chen, Jiahe and Lv, Jianqiang and Jiang, Shuai
关键词: anti-malware software robustness, black-box attacks, binary manipulation

Abstract

Recent methods have demonstrated that machine learning (ML) based static malware detection models are vulnerable to adversarial attacks. However, the generated malware often fails to generalize to production-level anti-malware software (AMS), as they usually involve multiple detection methods. This calls for universal solutions to the problem of malware variants generation. In this work, we demonstrate how the proposed method, MalwareTotal, has allowed malware variants to continue to abound in ML-based, signature-based, and hybrid anti-malware software. Given a malicious binary, we develop sequential bypass tactics that enable malicious behavior to be concealed within multi-faceted manipulations. Through 12 experiments on real-world malware, we demonstrate that an attacker can consistently bypass detection (98.67%, and 100% attack success rate against ML-based methods EMBER and MalConv, respectively; 95.33%, 92.63%, and 98.52% attack success rate against production-level anti-malware software ClamAV, AMS A, and AMS B, respectively) without modifying the malware functionality. We further demonstrate that our approach outperforms state-of-the-art adversarial malware generation techniques both in attack success rate and query consumption (the number of queries to the target model). Moreover, the samples generated by our method have demonstrated transferability in the real-world integrated malware detector, VirusTotal. In addition, we show that common mitigation such as adversarial training on known attacks cannot effectively defend against the proposed attack. Finally, we investigate the value of the generated adversarial examples as a means of hardening victim models through an adversarial training procedure, and demonstrate that the accuracy of the retrained model against generated adversarial examples increases by 88.51 percentage points.

DOI: 10.1145/3597503.3639141

Semantic-Enhanced Static Vulnerability Detection in Baseband Firmware

作者: Liu, Yiming and Zhang, Cen and Li, Feng and Li, Yeting and Zhou, Jianhua and Wang, Jian and Zhan, Lanlan and Liu, Yang and Huo, Wei
关键词: cellular baseband, static taint analysis, vulnerabilities

Abstract

Cellular network is the infrastructure of mobile communication. Baseband firmware, which carries the implementation of cellular network, has critical security impact on its vulnerabilities. To handle the inherent complexity in cellular communication, cellular protocols are usually implemented as message-centric systems, containing the common message processing phase and message specific handling phase. Though the latter takes most of the code (99.67%) and exposed vulnerabilities (74%), it is rather under-studied: existing detectors either cannot sufficiently analyze it or focused on analyzing the former phase.To fill this gap, we proposed a novel semantic-enhanced static vulnerability detector named BVFinder focusing on message specific phase vulnerability detection. Generally, it identifies a vulnerability by locating whether a predefined sensitive memory operation is tainted by any attacker-controllable input. Specifically, to reach high automation and preciseness, it made two key improvements: a semantic-based taint source identification and an enhanced taint propagation. The former employs semantic search techniques to identify registers and memory offsets that carry attacker-controllable inputs. This is achieved by matching the inputs to their corresponding message and data types using textual features and addressing patterns within the assemblies. On the other hand, the latter technology guarantees effective taint propagation by employing additional indirect call resolution algorithms.The evaluation shows that BVFinder outperforms the state-of-the-art detectors by detecting three to four times of amount of vulnerabilities in the dataset. Till now, BVFinder has found four zero-day vulnerabilities, with four CVEs and 12,410 USD bounty assigned. These vulnerabilities can potentially cause remote code execution to phones using Samsung shannon baseband, affecting hundreds of millions of end devices.

DOI: 10.1145/3597503.3639158

作者: Zhang, Mingxue and Meng, Wei and Zhou, You and Ren, Kui
关键词: privacy regulation, compliance analysis, GDPR, CCPA

Abstract

Privacy regulations like GDPR and CCPA have greatly affected online advertising and tracking strategies. To comply with the regulations, websites need to display consent management UIs (i.e., cookie banners) implemented under the corresponding technical frameworks, allowing users to specify consents regarding their personal data processing. Although prior works have investigated the cookie banner compliance problems with GDPR, the technical specification has significantly changed. The compliance status under the latest framework remains unclear. There also lacks a systematic study of CCPA banner compliance. More importantly, most work have focused on detecting the regulation violations, whereas little is known about the possible culprits and causes.In this paper, we develop CSChecker, a browser-based tool that monitors and records consent strings on websites. We use CSChecker to analyze the GDPR and CCPA cookie banners, and reveal previously unknown compliance problems under both frameworks. We also discover and analyze possible miscreants leading to the violations, e.g., consent management providers that return wrong consent data. The comparison of the two frameworks inspires several suggestions about the design of cookie banners, the implementation of opt-out mechanisms, and the enforcement of user consent choices.

DOI: 10.1145/3597503.3639159

Raisin： Identifying Rare Sensitive Functions for Bug Detection

作者: Huang, Jianjun and Nie, Jianglei and Gong, Yuanjun and You, Wei and Liang, Bin and Bian, Pan
关键词: rare sensitive function, bug detection, analogical reasoning, embedding

Abstract

Mastering the knowledge about the bug-prone functions (i.e., sensitive functions) is important to detect bugs. Some automated techniques have been proposed to identify the sensitive functions in large software systems, based on machine learning or natural language processing. However, the existing statistics-based techniques are not directly applicable to a special kind of sensitive functions, i.e., the rare sensitive functions, which have very few invocations even in large systems. Unfortunately, the rare ones can also introduce bugs. Therefore, how to effectively identify such functions is a problem deserving attention.This study is the first to explore the identification of rare sensitive functions. We propose a context-based analogical reasoning technique to automatically infer rare sensitive functions. A 1+context scheme is devised, where a function and its context are embedded into a pair of vectors, enabling pair-wise analogical reasoning. Considering that the rarity of the functions may lead to low-quality embedding vectors, we propose a weighted subword embedding method that can highlight the semantics of the key subwords to facilitate effective embedding. In addition, frequent sensitive functions are utilized to filter out reasoning candidates. We implement a prototype called Raisin and apply it to identify the rare sensitive functions and detect bugs in large open-source code bases. We successfully discover thousands of previously unknown rare sensitive functions and detect 21 bugs confirmed by the developers. Some of the rare sensitive functions cause bugs even with a solitary invocation in the kernel. It is demonstrated that identifying them is necessary to enhance software reliability.

DOI: 10.1145/3597503.3639165

REDriver： Runtime Enforcement for Autonomous Vehicles

作者: Sun, Yang and Poskitt, Christopher M. and Zhang, Xiaodong and Sun, Jun
关键词: No keywords

Abstract

Autonomous driving systems (ADSs) integrate sensing, perception, drive control, and several other critical tasks in autonomous vehicles, motivating research into techniques for assessing their safety. While there are several approaches for testing and analysing them in high-fidelity simulators, ADSs may still encounter additional critical scenarios beyond those covered once they are deployed on real roads. An additional level of confidence can be established by monitoring and enforcing critical properties when the ADS is running. Existing work, however, is only able to monitor simple safety properties (e.g., avoidance of collisions) and is limited to blunt enforcement mechanisms such as hitting the emergency brakes. In this work, we propose REDriver, a general and modular approach to runtime enforcement, in which users can specify a broad range of properties (e.g., national traffic laws) in a specification language based on signal temporal logic (STL). REDriver monitors the planned trajectory of the ADS based on a quantitative semantics of STL, and uses a gradient-driven algorithm to repair the trajectory when a violation of the specification is likely. We implemented REDriver for two versions of Apollo (i.e., a popular ADS), and subjected it to a benchmark of violations of Chinese traffic laws. The results show that REDriver significantly improves Apollo’s conformance to the specification with minimal overhead.

DOI: 10.1145/3597503.3639151

Scalable Relational Analysis via Relational Bound Propagation

作者: Stevens, Clay and Bagheri, Hamid
关键词: formal methods, bounded model checking, bound tightening

Abstract

Bounded formal analysis techniques (such as bounded model checking) are incredibly powerful tools for today’s software engineers. However, such techniques often suffer from scalability challenges when applied to large-scale, real-world systems. It can be very difficult to ensure the bounds are set properly, which can have a profound impact on the performance and scalability of any bounded formal analysis. In this paper, we propose a novel approach—relational bound propagation—which leverages the semantics of the underlying relational logic formula encoded by the specification to automatically tighten the bounds for any relational specification. Our approach applies two sets of semantic rules to propagate the bounds on the relations via the abstract syntax tree of the formula, first upward to higher-level expressions on those relations then downward from those higher-level expressions to the relations. Thus, relational bound propagation can reduce the number of variables examined by the analysis and decrease the cost of performing the analysis. This paper presents formal definitions of these rules, all of which have been rigorously proven. We realize our approach in an accompanying tool, Propter, and present experimental results using Propter that test the efficacy of relational bound propagation to decrease the cost of relational bounded model checking. Our results demonstrate that relational bound propagation reduces the number of primary variables in 63.58% of tested specifications by an average of 30.68% (N=519) and decreases the analysis time for the subject specifications by an average of 49.30%. For large-scale, real-world specifications, Propter was able to reduce total analysis time by an average of 68.14% (N=25) while introducing comparatively little overhead (6.14% baseline analysis time).

DOI: 10.1145/3597503.3639171

Translation Validation for JIT Compiler in the V8 JavaScript Engine

作者: Kwon, Seungwan and Kwon, Jaeseong and Kang, Wooseok and Lee, Juneyoung and Heo, Kihong
关键词: translation validation, Javascript engine, JIT compiler, IR, semantics, fuzzing

Abstract

We present TurboTV, a translation validator for the JavaScript (JS) just-in-time (JIT) compiler of V8. While JS engines have become a crucial part of various software systems, their emerging adaption of JIT compilation makes it increasingly challenging to ensure their correctness. We tackle this problem with an SMT-based translation validation (TV) that checks whether a specific compilation is semantically correct. We formally define the semantics of IR of TurboFan (JIT compiler of V8) as SMT encoding. For efficient validation, we design a staged strategy for JS JIT compilers. This allows us to decompose the whole correctness checking into simpler ones. Furthermore, we utilize fuzzing to achieve practical TV. We generate a large number of JS functions using a fuzzer to trigger various optimization passes of TurboFan and validate their compilation using TurboTV. Lastly, we demonstrate that TurboTV can also be used for cross-language TV. We show that TurboTV can validate the translation chain from LLVM IR to TurboFan IR, collaborating with an off-the-shelf TV tool for LLVM. We evaluated TurboTV on various sets of JS and LLVM programs. TurboTV effectively validated a large number of compilations of TurboFan with a low false positive rate and discovered a new miscompilation in LLVM.

DOI: 10.1145/3597503.3639189

Verifying Declarative Smart Contracts

作者: Chen, Haoxian and Lu, Lan and Massey, Brendan and Wang, Yuepeng and Loo, Boon Thau
关键词: No keywords

Abstract

Smart contracts manage a large number of digital assets nowadays. Bugs in these contracts have led to significant financial loss. Verifying the correctness of smart contracts is, therefore, an important task. This paper presents an automated safety verification tool, DCV, that targets declarative smart contracts written in De-Con, a logic-based domain-specific language for smart contract implementation and specification. DCV proves safety properties by mathematical induction and can automatically infer inductive invariants using heuristic patterns, without annotations from the developer. Our evaluation on 23 benchmark contracts shows that DCV is effective in verifying smart contracts adapted from public repositories, and can verify contracts not supported by other tools. Furthermore, DCV significantly outperforms baseline tools in verification time.

DOI: 10.1145/3597503.3639203

ChatGPT Incorrectness Detection in Software Reviews

作者: Tanzil, Minaoar Hossain and Khan, Junaed Younus and Uddin, Gias
关键词: large language model, chatGPT, hallucination, testing

Abstract

We conducted a survey of 135 software engineering (SE) practitioners to understand how they use Generative AI-based chatbots like ChatGPT for SE tasks. We find that they want to use ChatGPT for SE tasks like software library selection but often worry about the truthfulness of ChatGPT responses. We developed a suite of techniques and a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. CID is based on the iterative prompting to ChatGPT by asking it contextually similar but textually divergent questions (using an approach that utilizes metamorphic relationships in texts). The underlying principle in CID is that for a given question, a response that is different from other responses (across multiple incarnations of the question) is likely an incorrect response. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 – 0.75.

DOI: 10.1145/3597503.3639194

ChatGPT-Resistant Screening Instrument for Identifying Non-Programmers

作者: Serafini, Raphael and Otto, Clemens and Horstmann, Stefan Albert and Naiakshina, Alena
关键词: chatgpt, programmer screening, developer study, study protection

Abstract

To ensure the validity of software engineering and IT security studies with professional programmers, it is essential to identify participants without programming skills. Existing screening questions are efficient, cheating robust, and effectively differentiate programmers from non-programmers. However, the release of ChatGPT raises concerns about their continued effectiveness in identifying non-programmers. In a simulated attack, we showed that Chat-GPT can easily solve existing screening questions. Therefore, we designed new ChatGPT-resistant screening questions using visual concepts and code comprehension tasks. We evaluated 28 screening questions in an online study with 121 participants involving programmers and non-programmers. Our results showed that questions using visualizations of well-known programming concepts performed best in differentiating between programmers and non-programmers. Participants prompted to use ChatGPT struggled to solve the tasks. They considered ChatGPT ineffective and changed their strategy after a few screening questions. In total, we present six ChatGPT-resistant screening questions that effectively identify non-programmers. We provide recommendations on setting up a ChatGPT-resistant screening instrument that takes less than three minutes to complete by excluding 99.47% of non-programmers while including 94.83% of programmers.

DOI: 10.1145/3597503.3639075

Uncovering the Causes of Emotions in Software Developer Communication Using Zero-shot LLMs

作者: Imran, Mia Mohammad and Chatterjee, Preetha and Damevski, Kostadin
关键词: No keywords

Abstract

Understanding and identifying the causes behind developers’ emotions (e.g., Frustration caused by ‘delays in merging pull requests’) can be crucial towards finding solutions to problems and fostering collaboration in open-source communities. Effectively identifying such information in the high volume of communications across the different project channels, such as chats, emails, and issue comments, requires automated recognition of emotions and their causes. To enable this automation, large-scale software engineering-specific datasets that can be used to train accurate machine learning models are required. However, such datasets are expensive to create with the variety and informal nature of software projects’ communication channels.In this paper, we explore zero-shot LLMs that are pre-trained on massive datasets but without being fine-tuned specifically for the task of detecting emotion causes in software engineering: ChatGPT, GPT-4, and flan-alpaca. Our evaluation indicates that these recently available models can identify emotion categories when given detailed emotions, although they perform worse than the top-rated models. For emotion cause identification, our results indicate that zero-shot LLMs are effective at recognizing the correct emotion cause with a BLEU-2 score of 0.598. To highlight the potential use of these techniques, we conduct a case study of the causes of Frustration in the last year of development of a popular open-source project, revealing several interesting insights.

DOI: 10.1145/3597503.3639223

Development in times of hype： How freelancers explore Generative AI?

作者: Dolata, Mateusz and Lange, Norbert and Schwabe, Gerhard
关键词: generative AI, AI-based systems, challenges, freelancers, hype, SE for generative AI, SE4GenAI, hype-induced SE, hype-SE, fashion, product, paradigm, novelty, qualitative research

Abstract

The rise of generative AI has led many companies to hire freelancers to harness its potential. However, this technology presents unique challenges to developers who have not previously engaged with it. Freelancers may find these challenges daunting due to the absence of organizational support and their reliance on positive client feedback. In a study involving 52 freelance developers, we identified multiple challenges associated with developing solutions based on generative AI. Freelancers often struggle with aspects they perceive as unique to generative AI such as unpredictability of its output, the occurrence of hallucinations, and the inconsistent effort required due to trial-and-error prompting cycles. Further, the limitations of specific frameworks, such as token limits and long response times, add to the complexity. Hype-related issues, such as inflated client expectations and a rapidly evolving technological ecosystem, further exacerbate the difficulties. To address these issues, we propose Software Engineering for Generative AI (SE4GenAI) and Hype-Induced Software Engineering (HypeSE) as areas where the software engineering community can provide effective guidance. This support is essential for freelancers working with generative AI and other emerging technologies.

DOI: 10.1145/3597503.3639111

How Far Are We? The Triumphs and Trials of Generative AI in Learning Software Engineering

作者: Choudhuri, Rudrajit and Liu, Dylan and Steinmacher, Igor and Gerosa, Marco and Sarma, Anita
关键词: empirical study, software engineering, generative AI, ChatGPT

Abstract

Conversational Generative AI (convo-genAI) is revolutionizing Software Engineering (SE) as engineers and academics embrace this technology in their work. However, there is a gap in understanding the current potential and pitfalls of this technology, specifically in supporting students in SE tasks. In this work, we evaluate through a between-subjects study (N=22) the effectiveness of ChatGPT, a convo-genAI platform, in assisting students in SE tasks. Our study did not find statistical differences in participants’ productivity or self-efficacy when using ChatGPT as compared to traditional resources, but we found significantly increased frustration levels. Our study also revealed 5 distinct faults arising from violations of Human-AI interaction guidelines, which led to 7 different (negative) consequences on participants.

DOI: 10.1145/3597503.3639201

Breaking the Flow： A Study of Interruptions During Software Engineering Activities

作者: Ma, Yimeng and Huang, Yu and Leach, Kevin
关键词: No keywords

Abstract

In software engineering, interruptions during tasks can have significant implications for productivity and well-being. While previous studies have investigated the effect of interruptions on productivity, to the best of our knowledge, no prior work has yet distinguished the effect of different types of interruptions on software engineering activities.This study explores the impact of interruptions on software engineering tasks, analyzing in-person and on-screen interruptions with different levels of urgency and dominance. Participants completed code writing, code comprehension, and code review tasks while experiencing interruptions. We collect physiological data using the Empatica EmbracePlus wristband and self-perceived evaluations through surveys. Results show that on-screen interruptions with high dominance of requester significantly increase time spent on code comprehension. In-person and on-screen interruptions combined significantly affect the time spent on code review, with varied effects based on specific interruption combinations. Both interruption type and task significantly influence stress measures, with code comprehension and review tasks associated with lower stress measures compared to code writing. Interestingly, in-person interruptions present a positive impact on physiological measures, indicating reduced stress measures. However, participants’ self-perceived stress scores do not align with physiological data, with higher stress reported during in-person interruptions despite lower physiological stress measures. These findings shed light on and emphasize the potential importance of considering the complex relationship between interruptions, objective measures, and subjective experiences in software development. We discuss insights that we hope can inform interruption management and implications on stress among software engineers. (ChatGPT was used to revise and shorten paragraphs in this manuscript.)

DOI: 10.1145/3597503.3639079

Supporting Web-Based API Searches in the IDE Using Signatures

作者: Bradley, Nick C. and Fritz, Thomas and Holmes, Reid
关键词: API signatures, code search, controlled experiment

Abstract

Developers frequently use the web to locate API examples that help them solve their programming tasks. While sites like Stack Overflow (SO) contain API examples embedded within their textual descriptions, developers cannot access this API knowledge directly. Instead they need to search for and browse results to select relevant SO posts and then read through individual posts to figure out which answers contain information about the APIs that are relevant to their task. This paper introduces an approach, called Scout, that automatically analyzes search results to extract API signature information. These signatures are used to group and rank examples and allow for a unique API-based presentation that reduces the amount of information the developer needs to consider when looking for API information on the web. This succinct representation enables Scout to be integrated fully within an IDE panel so that developers can search and view API examples without losing context on their development task. Scout also uses this integration to automatically augment queries with contextual information that tailors the developer’s queries, and ranks the results according to the developer’s needs. In an experiment with 40 developers, we found that Scout reduces the number of queries developers need to perform by 19% and allows them to solve almost half their tasks directly from the API-based representation, reducing the number of complete SO posts viewed by approximately 64%.

DOI: 10.1145/3597503.3639089

Property-Based Testing in Practice

作者: Goldstein, Harrison and Cutler, Joseph W. and Dickstein, Daniel and Pierce, Benjamin C. and Head, Andrew
关键词: No keywords

Abstract

Property-based testing (PBT) is a testing methodology where users write executable formal specifications of software components and an automated harness checks these specifications against many automatically generated inputs. From its roots in the QuickCheck library in Haskell, PBT has made significant inroads in mainstream languages and industrial practice at companies such as Amazon, Volvo, and Stripe. As PBT extends its reach, it is important to understand how developers are using it in practice, where they see its strengths and weaknesses, and what innovations are needed to make it more effective.We address these questions using data from 30 in-depth interviews with experienced users of PBT at Jane Street, a financial technology company making heavy and sophisticated use of PBT. These interviews provide empirical evidence that PBT’s main strengths lie in testing complex code and in increasing confidence beyond what is available through conventional testing methodologies, and, moreover, that most uses fall into a relatively small number of high-leverage idioms. Its main weaknesses, on the other hand, lie in the relative complexity of writing properties and random data generators and in the difficulty of evaluating their effectiveness. From these observations, we identify a number of potentially high-impact areas for future exploration, including performance improvements, differential testing, additional high-leverage testing scenarios, better techniques for generating random input data, test-case reduction, and methods for evaluating the effectiveness of tests.

DOI: 10.1145/3597503.3639581

Causal Relationships and Programming Outcomes： A Transcranial Magnetic Stimulation Experiment

作者: Ahmad, Hammad and Endres, Madeline and Newman, Kaia and Santiesteban, Priscila and Shedden, Emma and Weimer, Westley
关键词: neurostimulation, spatial ability, code reading, data structures

Abstract

Understanding the relationship between cognition and programming outcomes is important: it can inform interventions that help novices become experts faster. Neuroimaging techniques can measure brain activity, but prior studies of programming report only correlations. We present the first causal neurological investigation of the cognition of programming by using Transcranial Magnetic Stimulation (TMS). TMS permits temporary and noninvasive disruption of specific brain regions. By disrupting brain regions and then measuring programming outcomes, we discover whether a true causal relationship exists. To the best of our knowledge, this is the first use of TMS to study software engineering.Where multiple previous studies reported correlations, we find no direct causal relationships between implicated brain regions and programming. Using a protocol that follows TMS best practices and mitigates for biases, we replicate psychology findings that TMS affects spatial tasks. We then find that neurostimulation can affect programming outcomes. Multi-level regression analysis shows that TMS stimulation of different regions significantly accounts for 2.2% of the variance in task completion time. Our results have implications for interventions in education and training as well as research into causal cognitive relationships.

DOI: 10.1145/3597503.3639096

GenderMag Improves Discoverability in the Field, Especially for Women： An Multi-Year Case Study of Suggest Edit, a Code Review Feature

作者: Murphy-Hill, Emerson and Elizondo, Alberto and Murillo, Ambar and Harbach, Marian and Vasilescu, Bogdan and Carlson, Delphine and Dessloch, Florian
关键词: software features, feature discovery, UX design, gender, inclusion

Abstract

Prior research shows that the GenderMag method can help identify and address usability barriers that are more likely to affect women software users than men. However, the evidence for the effectiveness of GenderMag is limited to small lab studies. In this case study, by combining self-reported gender data from tens of thousands of users of an internal code review tool with software logs data gathered over a five-year period, we quantitatively show that GenderMag helped a team at Google (a) correctly identify discoverability as a usability barrier more likely to affect women than men, and (b) increase discoverability by 2.4x while also achieving gender parity. That is, compared to men using the original code review tool, women and men using the system redesigned with GenderMag were both 2.4x more likely to discover the “Suggest Edit” feature at any given time. Thus, this paper contributes the first large-scale evidence of the effectiveness of GenderMag in the field.

DOI: 10.1145/3597503.3639097

Unraveling the Drivers of Sense of Belonging in Software Delivery Teams： Insights from a Large-Scale Survey

作者: Trinkenreich, Bianca and Gerosa, Marco Aurelio and Steinmacher, Igor
关键词: diversity and inclusion, software engineering, sense of belonging, psychological safety, work appreciation

Abstract

Feeling part of a group is a basic human need that significantly influences an individual’s behavior, long-term engagement, and job satisfaction. A strong sense of belonging holds particular importance within software delivery teams, which grapple with challenges related to well-being and employee retention. However, the specific factors closely associated with the sense of belonging in the context of software delivery teams remain largely unknown. Without a clear understanding of these factors, organizations’ efforts to promote a sense of belonging and diversity and inclusion more broadly may prove ineffective. Based on existing literature, we identified key factors potentially relevant to the sense of belonging in software delivery teams, such as work appreciation and psychological safety, and investigated the interrelation among these factors. We surveyed members of software delivery teams (n=10,781) of a major software delivery organization and used Partial Least Squares-Structural Equation Modeling (PLS-SEM) to evaluate a theoretical model to understand the factors that might contribute to a sense of belonging to the team. We also conducted a multi-group analysis to evaluate how the associations change based on individuals’ leadership involvement and an importance-performance map analysis to find the most critical indicators of belongingness. Our findings indicate a positive association between psychological safety and work appreciation and belonging to the team. Women feel less belonging than men, especially those not in leadership positions. Authoritativeness is negatively associated with belonging, and tenure is positively associated with belonging regardless of the role. Through this research, we seek to provide insights into the sense of belonging to the team and foster a more inclusive and cohesive work environment.

DOI: 10.1145/3597503.3639119

“My GitHub Sponsors profile is live!” Investigating the Impact of Twitter/X Mentions on GitHub Sponsors

作者: Fan, Youmei and Xiao, Tao and Hata, Hideaki and Treude, Christoph and Matsumoto, Kenichi
关键词: open-source software, sponsorship, social media

Abstract

GitHub Sponsors was launched in 2019, enabling donations to open-source software developers to provide financial support, as per GitHub’s slogan: “Invest in the projects you depend on”. However, a 2022 study on GitHub Sponsors found that only two-fifths of developers who were seeking sponsorship received a donation. The study found that, other than internal actions (such as offering perks to sponsors), developers had advertised their GitHub Sponsors profiles on social media, such as Twitter (also known as X). Therefore, in this work, we investigate the impact of tweets that contain links to GitHub Sponsors profiles on sponsorship, as well as their reception on Twitter/X. We further characterize these tweets to understand their context and find that (1) such tweets have the impact of increasing the number of sponsors acquired, (2) compared to other donation platforms such as Open Collective and Patreon, GitHub Sponsors has significantly fewer interactions but is more visible on Twitter/X, and (3) developers tend to contribute more to open-source software during the week of posting such tweets. Our findings are the first step toward investigating the impact of social media on obtaining funding to sustain open-source software.

DOI: 10.1145/3597503.3639127

A Theory of Scientific Programming Efficacy

作者: Pertseva, Elizaveta and Chang, Melinda and Zaman, Ulia and Coblenz, Michael
关键词: scientific programming, qualitative study of programmers

Abstract

Scientists write and maintain software artifacts to construct, validate, and apply scientific theories. Despite the centrality of software in their work, their practices differ significantly from those of professional software engineers. We sought to understand what makes scientists effective at their work and how software engineering practices and tools can be adapted to fit their workflows. We interviewed 25 scientists and support staff to understand their work. Then, we constructed a theory that relates six factors that contribute to their efficacy in creating and maintaining software systems. We present the theory in the form of a cycle of scientific computing efficacy and identify opportunities for improvement based on the six contributing factors.

DOI: 10.1145/3597503.3639139

High Expectations： An Observational Study of Programming and Cannabis Intoxication

作者: He, Wenxin and Parikh, Manasvi and Weimer, Westley and Endres, Madeline
关键词: programming preferences, cannabis, controlled user study, drug policy, preregistered hypotheses

Abstract

Anecdotal evidence of cannabis use by professional programmers abounds. Recent studies have found that some professionals regularly use cannabis while programming, even for work-related tasks. However, accounts of the impacts of cannabis on programming vary widely and are often contradictory. For example, some programmers claim that it impairs their ability to generate correct solutions, while others claim it enhances creativity and focus. There remains a need for an empirical understanding of the true impacts of cannabis on programming. This paper presents the first controlled observational study of cannabis’s effects on programming ability. Based on a within-subjects design with over 70 participants, we find that, at ecologically valid dosages, cannabis significantly impairs programming performance. Programs implemented while high contain more bugs and take longer to write (p < 0.05) — a small to medium effect (0.22 ≤ d ≤ 0.44). We also did not find any evidence that high programmers generate more divergent solutions. However, programmers can accurately assess differences in their programming performance (r = 0.59), even when under the influence of cannabis. We hope that this research will facilitate evidence-based policies and help developers make informed decisions regarding cannabis use while programming.

DOI: 10.1145/3597503.3639145

Mining Pull Requests to Detect Process Anomalies in Open Source Software Development

作者: Liu, Bohan and Zhang, He and Ma, Weigang and Kuang, Hongyu and Yang, Yi and Xu, Jinwei and Gao, Shan and Gao, Jian
关键词: open source software development, process mining, pull request

Abstract

Trustworthy Open Source Software (OSS) development processes are the basis that secures the long-term trustworthiness of software projects and products. With the aim to investigate the trustworthiness of the Pull Request (PR) process, the common model of collaborative development in OSS community, we exploit process mining to identify and analyze the normal and anomalous patterns of PR processes, and propose our approach to identifying anomalies from both control-flow and semantic aspects, and then to analyze and synthesize the root causes of the identified anomalies. We analyze 17531 PRs of 18 OSS projects on GitHub, extracting 26 root causes of control-flow anomalies and 19 root causes of semantic anomalies. We find that most PRs can hardly contain both semantic anomalies and control-flow anomalies, and the internal custom rules in projects may be the key causes for the identified anomalous PRs. We further discover and analyze the patterns of normal PR processes. We find that PRs in the non-fork model (42%) are far more likely than the fork model (5%) to bypass the review process, indicating a higher potential risk. Besides, we analyzed nine poisoned projects whose PR practices were indeed worse. Given the complex and diverse PR processes in OSS community, the proposed approach can help identify and understand not only anomalous PRs but also normal PRs, which offers early risk indications of suspicious incidents (such as poisoning) to OSS supply chain.

DOI: 10.1145/3597503.3639196

How Are Paid and Volunteer Open Source Developers Different? A Study of the Rust Project

作者: Zhang, Yuxia and Qin, Mian and Stol, Klaas-Jan and Zhou, Minghui and Liu, Hui
关键词: open source software, paid developers, volunteers, sustainability

Abstract

It is now commonplace for organizations to pay developers to work on specific open source software (OSS) projects to pursue their business goals. Such paid developers work alongside voluntary contributors, but given the different motivations of these two groups of developers, conflict may arise, which may pose a threat to a project’s sustainability. This paper presents an empirical study of paid developers and volunteers in Rust, a popular open source programming language project. Rust is a particularly interesting case given considerable concerns about corporate participation. We compare volunteers and paid developers through contribution characteristics and long-term participation, and solicit volunteers’ perceptions on paid developers. We find that core paid developers tend to contribute more frequently; commits contributed by onetime paid developers have bigger sizes; peripheral paid developers implement more features; and being paid plays a positive role in becoming a long-term contributor. We also find that volunteers do have some prejudices against paid developers. This study suggests that the dichotomous view of paid vs. volunteer developers is too simplistic and that further subgroups can be identified. Companies should become more sensitive to how they engage with OSS communities, in certain ways as suggested by this study.

DOI: 10.1145/3597503.3639197

Barriers for Students During Code Change Comprehension

作者: Middleton, Justin and Ore, John-Paul and Stolee, Kathryn T
关键词: No keywords

Abstract

Modern code review (MCR) is a key practice for many software engineering organizations, so undergraduate software engineering courses often teach some form of it to prepare students. However, research on MCR describes how many its professional implementations can fail, to say nothing on how these barriers manifest under students’ particular contexts. To uncover barriers students face when evaluating code changes during review, we combine interviews and surveys with an observational study. In a junior-level software engineering course, we first interviewed 29 undergraduate students about their experiences in code review. Next, we performed an observational study that presented 44 students from the same course with eight code change comprehension activities. These activities provided students with pull requests of potential refactorings in a familiar code base, collecting feedback on accuracy and challenges. This was followed by a reflection survey.Building on these methods, we combine (1) a qualitative analysis of the interview transcripts, activity comments, and reflection survey with (2) a quantitative assessment of their performance in identifying behavioral changes in order to outline the barriers that students face during code change comprehension. Our results reveal that students struggle with a number of facets around a program: the context for review, the review tools, the code itself, and the implications of the code changes. These findings - along with our result that student developers tend to overestimate behavioral similarity during code comparison - have implications for future support to help student developers have smoother code review experiences. We motivate a need for several interventions, including sentiment analysis on pull request comments to flag toxicity, scaffolding for code comprehension while reviewing large changes, and behavioral diffing to contrast the evolution of syntax and semantics.

DOI: 10.1145/3597503.3639227

作者: He, Ziyao and Huq, Syed Fatiul and Malek, Sam
关键词: Android, accessibility, advertisement, screen reader

Abstract

Ads are integral to the contemporary Android ecosystem, generating revenue for free-to-use applications. However, injected as third-party content, ads are displayed on native apps in pervasive ways that affect easy navigation. Ads can prove more disruptive for blind users, who rely on screen readers for navigating an app. While the literature has looked into either the accessibility of web advertisements or the privacy and security implications of mobile ads, a research gap on the accessibility of mobile ads remains, which we aim to bridge. We conduct an empirical study analyzing 500 ad screens in Android apps to categorize and examine the accessibility issues therein. Additionally, we conduct 15 qualitative user interviews with blind Android users to better understand the impact of those accessibility issues, how users interact with ads and their preferences. Based on our findings, we discuss the design and practical strategies for developing accessible ads.

DOI: 10.1145/3597503.3639228

DeepLSH： Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection

作者: Remil, Youcef and Bendimerad, Anes and Mathonat, Romain and Ra"{\i
关键词: crash deduplication, stack trace similarity, approximate nearest neighbors, locality-sensitive hashing, siamese neural networks

Abstract

Automatic crash bucketing is a crucial phase in the software development process for efficiently triaging bug reports. It generally consists in grouping similar reports through clustering techniques. However, with real-time streaming bug collection, systems are needed to quickly answer the question: What are the most similar bugs to a new one?, that is, efficiently find near-duplicates. It is thus natural to consider nearest neighbors search to tackle this problem and especially the well-known locality-sensitive hashing (LSH) to deal with large datasets due to its sublinear performance and theoretical guarantees on the similarity search accuracy. Surprisingly, LSH has not been considered in the crash bucketing literature. It is indeed not trivial to derive hash functions that satisfy the so-called locality-sensitive property for the most advanced crash bucketing metrics. Consequently, we study in this paper how to leverage LSH for this task. To be able to consider the most relevant metrics used in the literature, we introduce DeepLSH, a Siamese DNN architecture with an original loss function, that perfectly approximates the locality-sensitivity property even for Jaccard and Cosine metrics for which exact LSH solutions exist. We support this claim with a series of experiments on an original dataset, which we make available.

DOI: 10.1145/3597503.3639146

DivLog： Log Parsing with Prompt Enhanced In-Context Learning

作者: Xu, Junjielong and Yang, Ruichun and Huo, Yintong and Zhang, Chengyu and He, Pinjia
关键词: log parsing, large language model, in-context learning

Abstract

Log parsing, which involves log template extraction from semi-structured logs to produce structured logs, is the first and the most critical step in automated log analysis. However, current log parsers suffer from limited effectiveness for two reasons. First, traditional data-driven log parsers solely rely on heuristics or handcrafted features designed by domain experts, which may not consistently perform well on logs from diverse systems. Second, existing supervised log parsers require model tuning, which is often limited to fixed training samples and causes sub-optimal performance across the entire log source. To address this limitation, we propose DivLog, an effective log parsing framework based on the in-context learning (ICL) ability of large language models (LLMs). Specifically, before log parsing, DivLog samples a small amount of offline logs as candidates by maximizing their diversity. Then, during log parsing, DivLog selects five appropriate labeled candidates as examples for each target log and constructs them into a prompt. By mining the semantics of examples in the prompt, DivLog generates a target log template in a training-free manner. In addition, we design a straightforward yet effective prompt format to extract the output and enhance the quality of the generated log templates. We conducted experiments on 16 widely-used public datasets. The results show that DivLog achieves (1) 98.1% Parsing Accuracy, (2) 92.1% Precision Template Accuracy, and (3) 92.9% Recall Template Accuracy on average, exhibiting state-of-the-art performance.

DOI: 10.1145/3597503.3639155

Where is it? Tracing the Vulnerability-relevant Files from Vulnerability Reports

作者: Sun, Jiamou and Chen, Jieshan and Xing, Zhenchang and Lu, Qinghua and Xu, Xiwei and Zhu, Liming
关键词: vulnerability-relevant file, security, software supply chain

Abstract

With the widely usage of open-source software, supply-chain-based vulnerability attacks, including SolarWind and Log4Shell, have posed significant risks to software security. Currently, people rely on vulnerability advisory databases or commercial software bill of materials (SBOM) to defend against potential risks. Unfortunately, these datasets do not provide finer-grained file-level vulnerability information, compromising their effectiveness. Previous works have not adequately addressed this issue, and mainstream vulnerability detection methods have their drawbacks that hinder resolving this gap. Driven by the real needs, we propose a framework that can trace the vulnerability-relevant file for each disclosed vulnerability. Our approach uses NVD descriptions with metadata as the inputs, and employs a series of strategies with a LLM model, search engine, heuristic-based text matching method and a deep learning classifier to recommend the most likely vulnerability-relevant file, effectively enhancing the completeness of existing NVD data. Our experiments confirm that the efficiency of the proposed framework, with CodeBERT achieving 0.92 AUC and 0.85 MAP, and our user study proves our approach can help with vulnerability-relevant file detection effectively. To the best of our knowledge, our work is the first one focusing on tracing vulnerability-relevant files, laying the groundwork of building finer-grained vulnerability-aware software bill of materials.

DOI: 10.1145/3597503.3639202

Demystifying and Detecting Misuses of Deep Learning APIs

作者: Wei, Moshi and Harzevili, Nima Shiri and Huang, Yuekai and Yang, Jinqiu and Wang, Junjie and Wang, Song
关键词: API misuse, deep learning APIs, empirical study, detection

Abstract

Deep Learning (DL) libraries have significantly impacted various domains in computer science over the last decade. However, developers often face challenges when using the DL APIs, as the development paradigm of DL applications differs greatly from traditional software development. Existing studies on API misuse mainly focus on traditional software, leaving a gap in understanding API misuse within DL APIs. To address this gap, we present the first comprehensive study of DL API misuse in TensorFlow and PyTorch. Specifically, we first collected a dataset of 4,224 commits from the top 200 most-starred projects using these two libraries and manually identified 891 API misuses. We then investigated the characteristics of these misuses from three perspectives, i.e., types, root causes, and symptoms. We have also conducted an evaluation to assess the effectiveness of the current state-of-the-art API misuse detector on our 891 confirmed API misuses. Our results confirmed that the state-of-the-art API misuse detector is ineffective in detecting DL API misuses. To address the limitations of existing API misuse detection for DL APIs, we propose LLMAPIDet, which leverages Large Language Models (LLMs) for DL API misuse detection and repair. We build LLMAPIDet by prompt-tuning a chain of ChatGPT prompts on 600 out of 891 confirmed API misuses and reserve the rest 291 API misuses as the testing dataset. Our evaluation shows that LLMAPIDet can detect 48 out of the 291 DL API misuses while none of them can be detected by the existing API misuse detector. We further evaluate LLMAPIDet on the latest versions of 10 GitHub projects. The evaluation shows that LLMAPIDet can identify 119 previously unknown API misuses and successfully fix 46 of them.

DOI: 10.1145/3597503.3639177

Less is More? An Empirical Study on Configuration Issues in Python PyPI Ecosystem

作者: Peng, Yun and Hu, Ruida and Wang, Ruoke and Gao, Cuiyun and Li, Shuqing and Lyu, Michael R.
关键词: No keywords

Abstract

Python is the top popular programming language used in the open-source community, largely owing to the extensive support from diverse third-party libraries within the PyPI ecosystem. Nevertheless, the utilization of third-party libraries can potentially lead to conflicts in dependencies, prompting researchers to develop dependency conflict detectors. Moreover, endeavors have been made to automatically infer dependencies. These approaches focus on version-level checks and inference, based on the assumption that configurations of libraries in the PyPI ecosystem are correct. However, our study reveals that this assumption is not universally valid, and relying solely on version-level checks proves inadequate in ensuring compatible run-time environments.In this paper, we conduct an empirical study to comprehensively study the configuration issues in the PyPI ecosystem. Specifically, we propose PyConf, a source-level detector, for detecting potential configuration issues. PyConf employs three distinct checks, targeting the setup, packing, and usage stages of libraries, respectively. To evaluate the effectiveness of the current automatic dependency inference approaches, we build a benchmark called VLibs, comprising library releases that pass all three checks of PyConf. We identify 15 kinds of configuration issues and find that 183,864 library releases suffer from potential configuration issues. Remarkably, 68% of these issues can only be detected via the source-level check. Our experiment results show that the most advanced automatic dependency inference approach, PyEGo, can successfully infer dependencies for only 65% of library releases. The primary failures stem from dependency conflicts and the absence of required libraries in the generated configurations. Based on the empirical results, we derive six findings and draw two implications for open-source developers and future research in automatic dependency inference.

DOI: 10.1145/3597503.3639077

Data-Driven Evidence-Based Syntactic Sugar Design

作者: OBrien, David and Dyer, Robert and Nguyen, Tien and Rajan, Hridesh
关键词: syntactic sugars, data-driven language design, subgraph mining

Abstract

Programming languages are essential tools for developers, and their evolution plays a crucial role in supporting the activities of developers. One instance of programming language evolution is the introduction of syntactic sugars, which are additional syntax elements that provide alternative, more readable code constructs. However, the process of designing and evolving a programming language has traditionally been guided by anecdotal experiences and intuition. Recent advances in tools and methodologies for mining open-source repositories have enabled developers to make data-driven software engineering decisions. In light of this, this paper proposes an approach for motivating data-driven programming evolution by applying frequent subgraph mining techniques to a large dataset of 166,827,154 open-source Java methods. The dataset is mined by generalizing Java control-flow graphs to capture broad programming language usages and instances of duplication. Frequent subgraphs are then extracted to identify potentially impactful opportunities for new syntactic sugars. Our diverse results demonstrate the benefits of the proposed technique by identifying new syntactic sugars involving a variety of programming constructs that could be implemented in Java, thus simplifying frequent code idioms. This approach can potentially provide valuable insights for Java language designers, and serve as a proof-of-concept for data-driven programming language design and evolution.

DOI: 10.1145/3597503.3639580

Revisiting Android App Categorization

作者: Alecci, Marco and Samhi, Jordan and Bissyande, Tegawende F. and Klein, Jacques
关键词: android security, static analysis, app categorization

Abstract

Numerous tools rely on automatic categorization of Android apps as part of their methodology. However, incorrect categorization can lead to inaccurate outcomes, such as a malware detector wrongly flagging a benign app as malicious. One such example is the SlideIT Free Keyboard app, which has over 500 000 downloads on Google Play. Despite being a “Keyboard” app, it is often wrongly categorized alongside “Language” apps due to the app’s description focusing heavily on language support, resulting in incorrect analysis outcomes, including mislabeling it as a potential malware when it is actually a benign app. Hence, there is a need to improve the categorization of Android apps to benefit all the tools relying on it.In this paper, we present a comprehensive evaluation of existing Android app categorization approaches using our new ground-truth dataset. Our evaluation demonstrates the notable superiority of approaches that utilize app descriptions over those solely relying on data extracted from the APK file, while also leaving space for potential improvement in the former category. Thus, we propose two innovative approaches that effectively outperform the performance of existing methods in both description-based and APK-based methodologies. Finally, by employing our novel description-based approach, we have successfully demonstrated that adopting a higher-performing categorization method can significantly benefit tools reliant on app categorization, leading to an improvement in their overall performance. This highlights the significance of developing advanced and efficient app categorization methodologies for improved results in software engineering tasks.

DOI: 10.1145/3597503.3639094

Are Your Requests Your True Needs? Checking Excessive Data Collection in VPA App

作者: Xie, Fuman and Yan, Chuan and Meng, Mark Huasong and Teng, Shaoming and Zhang, Yanjun and Bai, Guangdong
关键词: virtual personal assistant, privacy compliance, alexa skills

Abstract

Virtual personal assistants (VPA) services encompass a large number of third-party applications (or apps) to enrich their functionalities. These apps have been well examined to scrutinize their data collection behaviors against their declared privacy policies. Nonetheless, it is often overlooked that most users tend to ignore privacy policies at the installation time. Dishonest developers thus can exploit this situation by embedding excessive declarations to cover their data collection behaviors during compliance auditing.In this work, we present Pico, a privacy inconsistency detector, which checks the VPA app’s privacy compliance by analyzing (in)consistency between data requested and data essential for its functionality. Pico understands the app’s functionality topics from its publicly available textual data, and leverages advanced GPT-based language models to address domain-specific challenges. Based on the counterparts with similar functionality, suspicious data collection can be detected through the lens of anomaly detection. We apply Pico to understand the status quo of data-functionality compliance among all 65,195 skills in the Alexa app store. Our study reveals that 21.7% of the analyzed skills exhibit suspicious data collection, including Top 10 popular Alexa skills that pose threats to 54,116 users. These findings should raise an alert to both developers and users, in the compliance with the purpose limitation principle in data regulations.

DOI: 10.1145/3597503.3639107

MiniMon： Minimizing Android Applications with Intelligent Monitoring-Based Debloating

作者: Liu, Jiakun and Zhang, Zicheng and Hu, Xing and Thung, Ferdian and Maoz, Shahar and Gao, Debin and Toch, Eran and Zhao, Zhipeng and Lo, David
关键词: android, software debloating, log analysis

Abstract

The size of Android applications is getting larger to fulfill the requirements of various users. However, not all the features of the applications are needed and desired by a specific user. The unnecessary and non-desired features can increase the attack surface and consume system resources such as storage and memory. To address this issue, we propose a framework, MiniMon, to debloat unnecessary features from an Android app based on the logs of specific users’ interactions with the app.However, rarely used features may not be recorded during the data collection, and users’ preferences may change slightly over time. To address these challenges, we embed several solutions in our framework that can uncover user-desired features by learning and generalizing from the logs of how users interact with an application. MiniMon first collects the application methods that are executed when users interact with it. Then, given the collected executed methods and the call graph of the application, MiniMon applies 10 techniques to generalize from logs. These include three program analysis-based techniques, two graph clustering-based techniques, and five graph embedding-based techniques to identify the additional methods in an app that are similar to the logged executed methods. Finally, MiniMon generates a debloated application by removing methods that are not similar to the executed methods. To evaluate the performance of variants of MiniMon that use different generalization techniques, we create a benchmark for a controlled experiment. The results show that the graph embedding-based generalization technique that considers the information of all nodes in the call graph is the best, and can correctly uncover 75.5% of the unobserved but desired behaviors and still debloat more than half of the app. We also conducted a user study that uncovers that the use of the intelligent (generalization) method of MiniMon boosts the overall user satisfaction rate by 37.6%.

DOI: 10.1145/3597503.3639113

Shedding Light on Software Engineering-specific Metaphors and Idioms

作者: Imran, Mia Mohammad and Chatterjee, Preetha and Damevski, Kostadin
关键词: No keywords

Abstract

Use of figurative language, such as metaphors and idioms, is common in our daily-life communications, and it can also be found in Software Engineering (SE) channels, such as comments on GitHub. Automatically interpreting figurative language is a challenging task, even with modern Large Language Models (LLMs), as it often involves subtle nuances. This is particularly true in the SE domain, where figurative language is frequently used to convey technical concepts, often bearing developer affect (e.g., 'spaghetti code). Surprisingly, there is a lack of studies on how figurative language in SE communications impacts the performance of automatic tools that focus on understanding developer communications, e.g., bug prioritization, incivility detection. Furthermore, it is an open question to what extent state-of-the-art LLMs interpret figurative expressions in domain-specific communication such as software engineering. To address this gap, we study the prevalence and impact of figurative language in SE communication channels. This study contributes to understanding the role of figurative language in SE, the potential of LLMs in interpreting them, and its impact on automated SE communication analysis. Our results demonstrate the effectiveness of fine-tuning LLMs with figurative language in SE and its potential impact on automated tasks that involve affect. We found that, among three state-of-the-art LLMs, the best improved fine-tuned versions have an average improvement of 6.66% on a GitHub emotion classification dataset, 7.07% on a GitHub incivility classification dataset, and 3.71% on a Bugzilla bug report prioritization dataset.

DOI: 10.1145/3597503.3639585

Empirical Study of the Docker Smells Impact on the Image Size

作者: Durieux, Thomas
关键词: No keywords

Abstract

Docker, a widely adopted tool for packaging and deploying applications leverages Dockerfiles to build images. However, creating an optimal Dockerfile can be challenging, often leading to “Docker smells” or deviations from best practices. This paper presents a study of the impact of 14 Docker smells on the size of Docker images.To assess the size impact of Docker smells, we identified and repaired 16 145 Docker smells from 11 313 open-source Dockerfiles. We observe that the smells result in an average increase of 48.06MB (4.6 %) per smelly image. Depending on the smell type, the size increase can be up to 10 %, and for some specific cases, the smells can represent 89 % of the image size. Interestingly, the most impactful smells are related to package managers which are commonly encountered and are relatively easy to fix.To collect the perspective of the developers regarding the size impact of the Docker smells, we submitted 34 pull requests that repair the smells and we reported their impact on the Docker image to the developers. 26/34 (76.5 %) of the pull requests have been merged and they contribute to a saving of 3.46 GB (16.4 %). The developer’s comments demonstrate a positive interest in addressing those Docker smells even when the pull requests have been rejected.

DOI: 10.1145/3597503.3639143

MotorEase： Automated Detection of Motor Impairment Accessibility Issues in Mobile App UIs

作者: Krishna Vajjala, Arun and Mansur, S M Hasan and Jose, Justin and Moran, Kevin
关键词: accessibility, mobile apps, screen understanding

Abstract

Recent research has begun to examine the potential of automatically finding and fixing accessibility issues that manifest in software. However, while recent work makes important progress, it has generally been skewed toward identifying issues that affect users with certain disabilities, such as those with visual or hearing impairments. However there are other groups of users with different types of disabilities that also need software tooling support to improve their experience. As such, this paper aims to automatically identify accessibility issues that affect users with motor-impairments.To move toward this goal, this paper introduces a novel approach, called MotorEase, capable of identifying accessibility issues in mobile app UIs that impact motor-impaired users. Motor-impaired users often have limited ability to interact with touch-based devices, and instead may make use of a switch or other assistive mechanism — hence UIs must be designed to support both limited touch gestures and the use of assistive devices. MotorEase adapts computer vision and text processing techniques to enable a semantic understanding of app UI screens, enabling the detection of violations related to four popular, previously unexplored UI design guidelines that support motor-impaired users, including: (i) visual touch target size, (ii) expanding sections, (iii) persisting elements, and (iv) adjacent icon visual distance. We evaluate MotorEase on a newly derived benchmark, called MotorCheck, that contains 555 manually annotated examples of violations to the above accessibility guidelines, across 1599 screens collected from 70 applications via a mobile app testing tool. Our experiments illustrate that MotorEase is able to identify violations with an average accuracy of ≈90%, and a false positive rate of less than 9%, outperforming baseline techniques.

DOI: 10.1145/3597503.3639167

An Exploratory Investigation of Log Anomalies in Unmanned Aerial Vehicles

作者: Wang, Dinghua and Li, Shuqing and Xiao, Guanping and Liu, Yepang and Sui, Yulei and He, Pinjia and Lyu, Michael R.
关键词: UAV anomaly, software bug, crash, code pattern, empirical study

Abstract

Unmanned aerial vehicles (UAVs) are becoming increasingly ubiquitous in our daily lives. However, like many other complex systems, UAVs are susceptible to software bugs that can lead to abnormal system behaviors and undesirable consequences. It is crucial to study such software bug-induced UAV anomalies, which are often manifested in flight logs, to help assure the quality and safety of UAV systems. However, there has been limited research on investigating the code-level patterns of software bug-induced UAV anomalies. This impedes the development of effective tools for diagnosing and localizing bugs within UAV system code.To bridge the research gap and deepen our understanding of UAV anomalies, we carried out an empirical study on this subject. We first collected 178 real-world abnormal logs induced by software bugs in two popular open-source UAV platforms, i.e., PX4 and Ardupilot. We then examined each of these abnormal logs and compiled their common patterns. In particular, we investigated the most severe anomalies that led to UAV crashes, and identified their features. Based on our empirical findings, we further summarized the challenges of localizing bugs in system code by analyzing anomalous UAV flight data, which can offer insights for future research in this field.

DOI: 10.1145/3597503.3639186

ModuleGuard： Understanding and Detecting Module Conflicts in Python Ecosystem

作者: Zhu, Ruofan and Wang, Xingyu and Liu, Chengwei and Xu, Zhengzi and Shen, Wenbo and Chang, Rui and Liu, Yang
关键词: module conflict, pypi ecosystem, dependency graphs, namespace conflict, dependency resolution

Abstract

Python has become one of the most popular programming languages for software development due to its simplicity, readability, and versatility. As the Python ecosystem grows, developers face increasing challenges in avoiding module conflicts, which occur when different packages have the same namespace modules. Unfortunately, existing work has neither investigated the module conflict comprehensively nor provided tools to detect the conflict. Therefore, this paper systematically investigates the module conflict problem and its impact on the Python ecosystem. We propose a novel technique called InstSimulator, which leverages semantics and installation simulation to achieve accurate and efficient module extraction. Based on this, we implement a tool called ModuleGuard to detect module conflicts for the Python ecosystem.For the study, we first collect 97 MC issues, classify the characteristics and causes of these MC issues, summarize three different conflict patterns, and analyze their potential threats. Then, we conducted a large-scale analysis of the whole PyPI ecosystem (4.2 million packages) and GitHub popular projects (3,711 projects) to detect each MC pattern and analyze their potential impact. We discovered that module conflicts still impact numerous TPLs and GitHub projects. This is primarily due to developers’ lack of understanding of the modules within their direct dependencies, not to mention the modules of the transitive dependencies. Our work reveals Python’s shortcomings in handling naming conflicts and provides a tool and guidelines for developers to detect conflicts.

DOI: 10.1145/3597503.3639221

Empirical Analysis of Vulnerabilities Life Cycle in Golang Ecosystem

作者: Hu, Jinchang and Zhang, Lyuye and Liu, Chengwei and Yang, Sen and Huang, Song and Liu, Yang
关键词: vulnerability life cycle, golang, open-source software

Abstract

Open-source software (OSS) greatly facilitates program development for developers. However, the high number of vulnerabilities in open-source software is a major concern, including in Golang, a relatively new programming language. In contrast to other commonly used OSS package managers, Golang presents a distinctive feature whereby commits are prevalently used as dependency versions prior to their integration into official releases. This attribute can prove advantageous to users, as patch commits can be implemented in a timely manner before the releases. However, Golang employs a decentralized mechanism for managing dependencies, whereby dependencies are upheld and distributed in separate repositories. This approach can result in delays in the dissemination of patches and unresolved vulnerabilities.To tackle the aforementioned concern, a comprehensive investigation was undertaken to examine the life cycle of vulnerability in Golang, commencing from its introduction and culminating with its rectification. To this end, a framework was established by gathering data from diverse sources and systematically amalgamating them with an algorithm to compute the lags in vulnerability patching. It turned out that 66.10% of modules in the Golang ecosystem were affected by vulnerabilities. Within the vulnerability life cycle, we found two kinds of lag impeding the propagation of vulnerability fixing. By analyzing reasons behind non-lagged and lagged vulnerabilities, timely releasing and indexing patch versions could significantly enhance ecosystem security.

DOI: 10.1145/3597503.3639230

ReFAIR： Toward a Context-Aware Recommender for Fairness Requirements Engineering

作者: Ferrara, Carmine and Casillo, Francesco and Gravino, Carmine and De Lucia, Andrea and Palomba, Fabio
关键词: software fairness, machine learning, requirements engineering

Abstract

Machine learning (ML) is increasingly being used as a key component of most software systems, yet serious concerns have been raised about the fairness of ML predictions. Researchers have been proposing novel methods to support the development of fair machine learning solutions. Nonetheless, most of them can only be used in late development stages, e.g., during model training, while there is a lack of methods that may provide practitioners with early fairness analytics enabling the treatment of fairness throughout the development lifecycle. This paper proposes ReFair, a novel context-aware requirements engineering framework that allows to classify sensitive features from User Stories. By exploiting natural language processing and word embedding techniques, our framework first identifies both the use case domain and the machine learning task to be performed in the system being developed; afterward, it recommends which are the context-specific sensitive features to be considered during the implementation. We assess the capabilities of ReFair by experimenting it against a synthetic dataset—which we built as part of our research—composed of 12,401 User Stories related to 34 application domains. Our findings showcase the high accuracy of ReFair, other than highlighting its current limitations.

DOI: 10.1145/3597503.3639185

Analyzing and Debugging Normative Requirements via Satisfiability Checking

作者: Feng, Nick and Marsso, Lina and Getir Yaman, Sinem and Baatartogtokh, Yesugen and Ayad, Reem and De Mello, Victoria Oldemburgo and Townsend, Beverley and Standen, Isobel and Stefanakos, Ioannis and Imrie, Calum and Rodrigues, Genaina Nunes and Cavalcanti, Ana and Calinescu, Radu and Chechik, Marsha
关键词: No keywords

Abstract

As software systems increasingly interact with humans in application domains such as transportation and healthcare, they raise concerns related to the social, legal, ethical, empathetic, and cultural (SLEEC) norms and values of their stakeholders. Normative non-functional requirements (N-NFRs) are used to capture these concerns by setting SLEEC-relevant boundaries for system behavior. Since N-NFRs need to be specified by multiple stakeholders with widely different, non-technical expertise (ethicists, lawyers, regulators, end users, etc.), N-NFR elicitation is very challenging. To address this difficult task, we introduce N-Check, a novel tool-supported formal approach to N-NFR analysis and debugging. N-Check employs satisfiability checking to identify a broad spectrum of N-NFR well-formedness issues, such as conflicts, redundancy, restrictiveness, and insufficiency, yielding diagnostics that pinpoint their causes in a user-friendly way that enables non-technical stakeholders to understand and fix them. We show the effectiveness and usability of our approach through nine case studies in which teams of ethicists, lawyers, philosophers, psychologists, safety analysts, and engineers used N-Check to analyse and debug 233 N-NFRs, comprising 62 issues for the software underpinning the operation of systems, such as, assistive-care robots and tree-disease detection drones to manufacturing collaborative robots.

DOI: 10.1145/3597503.3639093

Recovering Trace Links Between Software Documentation And Code

作者: Keim, Jan and Corallo, Sophie and Fuch\ss{
关键词: software traceability, software architecture, documentation, transitive links, intermediate artifacts, information retrieval

Abstract

Introduction Software development involves creating various artifacts at different levels of abstraction and establishing relationships between them is essential. Traceability link recovery (TLR) automates this process, enhancing software quality by aiding tasks like maintenance and evolution. However, automating TLR is challenging due to semantic gaps resulting from different levels of abstraction. While automated TLR approaches exist for requirements and code, architecture documentation lacks tailored solutions, hindering the preservation of architecture knowledge and design decisions. Methods This paper presents our approach TransArC for TLR between architecture documentation and code, using component-based architecture models as intermediate artifacts to bridge the semantic gap. We create transitive trace links by combining the existing approach ArDoCo for linking architecture documentation to models with our novel approach ArCoTL for linking architecture models to code.Results We evaluate our approaches with five open-source projects, comparing our results to baseline approaches. The model-to-code TLR approach achieves an average F1-score of 0.98, while the documentation-to-code TLR approach achieves a promising average F1-score of 0.82, significantly outperforming baselines. Conclusion Combining two specialized approaches with an intermediate artifact shows promise for bridging the semantic gap. In future research, we will explore further possibilities for such transitive approaches.

DOI: 10.1145/3597503.3639130

TRIAD： Automated Traceability Recovery based on Biterm-enhanced Deduction of Transitive Links among Artifacts

作者: Gao, Hui and Kuang, Hongyu and Assun\c{c
关键词: software traceability, information retrieval, transitive links

Abstract

Traceability allows stakeholders to extract and comprehend the trace links among software artifacts introduced across the software life cycle, to provide significant support for software engineering tasks. Despite its proven benefits, software traceability is challenging to recover and maintain manually. Hence, plenty of approaches for automated traceability have been proposed. Most rely on textual similarities among software artifacts, such as those based on Information Retrieval (IR). However, artifacts in different abstraction levels usually have different textual descriptions, which can greatly hinder the performance of IR-based approaches (e.g., a requirement in natural language may have a small textual similarity to a Java class). In this work, we leverage the consensual biterms and transitive relationships (i.e., inner- and outer-transitive links) based on intermediate artifacts to improve IR-based traceability recovery. We first extract and filter biterms from all source, intermediate, and target artifacts. We then use the consensual biterms from the intermediate artifacts to enrich the texts of both source and target artifacts, and finally deduce outer and inner-transitive links to adjust text similarities between source and target artifacts. We conducted a comprehensive empirical evaluation based on five systems widely used in other literature to show that our approach can outperform four state-of-the-art approaches in Average Precision over 15% and Mean Average Precision over 10% on average.

DOI: 10.1145/3597503.3639164

Prism： Decomposing Program Semantics for Code Clone Detection through Compilation

作者: Li, Haoran and Wang, Siqian and Quan, Weihong and Gong, Xiaoli and Su, Huayou and Zhang, Jin
关键词: code clone detection, behavior semantics, CISC and RISC, feature fusion

Abstract

Code clone detection (CCD) is of critical importance in software engineering, while semantic similarity is a key evaluation factor for CCD. The embedding technique, which represents an object using a numerical vector, is utilized to generate code representations, where code snippets with similar semantics (clone pairs) should have similar vectors. However, due to the diversity and flexibility of high-level program languages, the code representation of clone pairs may be inconsistent. Assembly code provides the program execution trace and can normalize the diversity of high-level languages in terms of the program behavior semantics. After revisiting the assembly language, we find that different assembly codes can align with the computational logic and memory access patterns of cloned pairs. Therefore, the use of multiple assembly languages can capture the behavior semantics to enhance the understanding of programs. Thus, we propose Prism, a new method for code clone detection fusing behavior semantics from multiple architecture assembly code, which directly captures multilingual domains’ syntax and semantic information. Additionally, we introduce a multi-feature fusion strategy that leverages global information interaction to expand the representation space. This fusion process allows us to capture the complementary information from each feature and leverage the relationships between them to create a more expressive representation of the code. After testing the OJClone dataset, the Prism model exhibited exceptional performance with precision and recall scores of 0.999 and 0.999, respectively.

DOI: 10.1145/3597503.3639129

Evaluating Code Summarization Techniques： A New Metric and an Empirical Characterization

作者: Mastropaolo, Antonio and Ciniselli, Matteo and Di Penta, Massimiliano and Bavota, Gabriele
关键词: code summarization, contrastive learning

Abstract

Several code summarization techniques have been proposed in the literature to automatically document a code snippet or a function. Ideally, software developers should be involved in assessing the quality of the generated summaries. However, in most cases, researchers rely on automatic evaluation metrics such as BLEU, ROUGE, and METEOR. These metrics are all based on the same assumption: The higher the textual similarity between the generated summary and a reference summary written by developers, the higher its quality. However, there are two reasons for which this assumption falls short: (i) reference summaries, e.g., code comments collected by mining software repositories, may be of low quality or even outdated; (ii) generated summaries, while using a different wording than a reference one, could be semantically equivalent to it, thus still being suitable to document the code snippet. In this paper, we perform a thorough empirical investigation on the complementarity of different types of metrics in capturing the quality of a generated summary. Also, we propose to address the limitations of existing metrics by considering a new dimension, capturing the extent to which the generated summary aligns with the semantics of the documented code snippet, independently from the reference summary. To this end, we present a new metric based on contrastive learning to capture said aspect. We empirically show that the inclusion of this novel dimension enables a more effective representation of developers’ evaluations regarding the quality of automatically generated summaries.

DOI: 10.1145/3597503.3639174

Are Prompt Engineering and TODO Comments Friends or Foes? An Evaluation on GitHub Copilot

作者: OBrien, David and Biswas, Sumon and Imtiaz, Sayem Mohammad and Abdalkareem, Rabe and Shihab, Emad and Rajan, Hridesh
关键词: technical debt, GitHub copilot, LLM, code generation

Abstract

Code intelligence tools such as GitHub Copilot have begun to bridge the gap between natural language and programming language. A frequent software development task is the management of technical debts, which are suboptimal solutions or unaddressed issues which hinder future software development. Developers have been found to “self-admit” technical debts (SATD) in software artifacts such as source code comments. Thus, is it possible that the information present in these comments can enhance code generative prompts to repay the described SATD? Or, does the inclusion of such comments instead cause code generative tools to reproduce the harmful symptoms of described technical debt? Does the modification of SATD impact this reaction? Despite the heavy maintenance costs caused by technical debt and the recent improvements of code intelligence tools, no prior works have sought to incorporate SATD towards prompt engineering. Inspired by this, this paper contributes and analyzes a dataset consisting of 36,381 TODO comments in the latest available revisions of their respective 102,424 repositories, from which we sample and manually generate 1,140 code bodies using GitHub Copilot. Our experiments show that GitHub Copilot can generate code with the symptoms of SATD, both prompted and unprompted. Moreover, we demonstrate the tool’s ability to automatically repay SATD under different circumstances and qualitatively investigate the characteristics of successful and unsuccessful comments. Finally, we discuss gaps in which GitHub Copilot’s successors and future researchers can improve upon code intelligence tasks to facilitate AI-assisted software maintenance.

DOI: 10.1145/3597503.3639176

Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization)

作者: Ahmed, Toufique and Pai, Kunal Suresh and Devanbu, Premkumar and Barr, Earl
关键词: LLM, code summarization, program analysis, prompt engineering

Abstract

Large Language Models (LLM) are a new class of computation engines, “programmed” via prompt engineering. Researchers are still learning how to best “program” these LLMs to help developers. We start with the intuition that developers tend to consciously and unconsciously collect semantics facts, from the code, while working. Mostly these are shallow, simple facts arising from a quick read. For a function, such facts might include parameter and local variable names, return expressions, simple pre- and post-conditions, and basic control and data flow, etc.One might assume that the powerful multi-layer architecture of transformer-style LLMs makes them implicitly capable of doing this simple level of “code analysis” and extracting such information, while processing code: but are they, really? If they aren’t, could explicitly adding this information help? Our goal here is to investigate this question, using the code summarization task and evaluate whether automatically augmenting an LLM’s prompt with semantic facts explicitly, actually helps.Prior work shows that LLM performance on code summarization benefits from embedding a few code & summary exemplars in the prompt, before the code to be summarized. While summarization performance has steadily progressed since the early days, there is still room for improvement: LLM performance on code summarization still lags its performance on natural-language tasks like translation and text summarization.We find that adding semantic facts to the code in the prompt actually does help! This approach improves performance in several different settings suggested by prior work, including for three different Large Language Models. In most cases, we see improvements, as measured by a range of commonly-used metrics; for the PHP language in the challenging CodeSearchNet dataset, this augmentation actually yields performance surpassing 30 BLEU1. In addition, we have also found that including semantic facts yields a substantial enhancement in LLMs’ line completion performance.

DOI: 10.1145/3597503.3639183

DSFM： Enhancing Functional Code Clone Detection with Deep Subtree Interactions

作者: Xu, Zhiwei and Qiang, Shaohua and Song, Dinghong and Zhou, Min and Wan, Hai and Zhao, Xibin and Luo, Ping and Zhang, Hongyu
关键词: code clone detection, semantic clone, code similarity, factorization machine

Abstract

Functional code clone detection is important for software maintenance. In recent years, deep learning techniques are introduced to improve the performance of functional code clone detectors. By representing each code snippet as a vector containing its program semantics, syntactically dissimilar functional clones are detected. However, existing deep learning-based approaches attach too much importance to code feature learning, hoping to project all recognizable knowledge of a code snippet into a single vector. We argue that these deep learning-based approaches can be enhanced by considering the characteristics of syntactic code clone detection, where we need to compare the contents of the source code (e.g., intersection of tokens, similar flow graphs, and similar subtrees) to obtain code clones. In this paper, we propose a novel deep learning-based approach named DSFM, which incorporates comparisons between code snippets for detecting functional code clones. Specifically, we improve the typical deep clone detectors with deep subtree interactions that compare every two subtrees extracted abstract syntax trees (ASTs) of two code snippets, thereby introducing more fine-grained semantic similarity. By conducting extensive experiments on three widely-used datasets, GCJ, OJClone, and BigCloneBench, we demonstrate the great potential of deep subtree interactions in code clone detection task. The proposed DSFM outperforms the state-of-the-art approaches, including two traditional approaches, two unsupervised and four supervised deep learning-based baselines.

DOI: 10.1145/3597503.3639215

Machine Learning is All You Need： A Simple Token-based Approach for Effective Code Clone Detection

作者: Feng, Siyue and Suo, Wenqi and Wu, Yueming and Zou, Deqing and Liu, Yang and Jin, Hai
关键词: code clones, machine learning, token

Abstract

As software engineering advances and the code demand rises, the prevalence of code clones has increased. This phenomenon poses risks like vulnerability propagation, underscoring the growing importance of code clone detection techniques. While numerous code clone detection methods have been proposed, they often fall short in real-world code environments. They either struggle to identify code clones effectively or demand substantial time and computational resources to handle complex clones. This paper introduces a code clone detection method namely Toma using tokens and machine learning. Specifically, we extract token type sequences and employ six similarity calculation methods to generate feature vectors. These vectors are then input into a trained machine learning model for classification. To evaluate the effectiveness and scalability of Toma, we conduct experiments on the widely used BigCloneBench dataset. Results show that our tool outperforms token-based code clone detectors and most tree-based clone detectors, demonstrating high effectiveness and significant time savings.

DOI: 10.1145/3597503.3639114

Cross-Inlining Binary Function Similarity Detection

作者: Jia, Ang and Fan, Ming and Xu, Xi and Jin, Wuxia and Wang, Haijun and Liu, Ting
关键词: cross-inlining, binary similarity detection, inlining pattern

Abstract

Binary function similarity detection plays an important role in a wide range of security applications. Existing works usually assume that the query function and target function share equal semantics and compare their full semantics to obtain the similarity. However, we find that the function mapping is more complex, especially when function inlining happens.In this paper, we will systematically investigate cross-inlining binary function similarity detection. We first construct a cross-inlining dataset by compiling 51 projects using 9 compilers, with 4 optimizations, to 6 architectures, with 2 inlining flags, which results in two datasets both with 216 combinations. Then we construct the cross-inlining function mappings by linking the common source functions in these two datasets. Through analysis of this dataset, we find that three cross-inlining patterns widely exist while existing work suffers when detecting cross-inlining binary function similarity. Next, we propose a pattern-based model named CI-Detector for cross-inlining matching. CI-Detector uses the attributed CFG to represent the semantics of binary functions and GNN to embed binary functions into vectors. CI-Detector respectively trains a model for these three cross-inlining patterns. Finally, the testing pairs are input to these three models and all the produced similarities are aggregated to produce the final similarity. We conduct several experiments to evaluate CI-Detector. Results show that CI-Detector can detect cross-inlining pairs with a precision of 81% and a recall of 97%, which exceeds all state-of-the-art works.

DOI: 10.1145/3597503.3639080

BinaryAI： Binary Software Composition Analysis via Intelligent Binary Source Code Matching

作者: Jiang, Ling and An, Junwen and Huang, Huihui and Tang, Qiyi and Nie, Sen and Wu, Shi and Zhang, Yuqun
关键词: software composition analysis, static binary analysis

Abstract

While third-party libraries (TPLs) are extensively reused to enhance productivity during software development, they can also introduce potential security risks such as vulnerability propagation. Software composition analysis (SCA), proposed to identify reused TPLs for reducing such risks, has become an essential procedure within modern DevSecOps. As one of the mainstream SCA techniques, binary-to-source SCA identifies the third-party source projects contained in binary files via binary source code matching, which is a major challenge in reverse engineering since binary and source code exhibit substantial disparities after compilation. The existing binary-to-source SCA techniques leverage basic syntactic features that suffer from redundancy and lack robustness in the large-scale TPL dataset, leading to inevitable false positives and compromised recall. To mitigate these limitations, we introduce BinaryAI, a novel binary-to-source SCA technique with two-phase binary source code matching to capture both syntactic and semantic code features. First, BinaryAI trains a transformer-based model to produce function-level embeddings and obtain similar source functions for each binary function accordingly. Then by applying the link-time locality to facilitate function matching, BinaryAI detects the reused TPLs based on the ratio of matched source functions. Our experimental results demonstrate the superior performance of BinaryAI in terms of binary source code matching and the downstream SCA task. Specifically, our embedding model outperforms the state-of-the-art model CodeCMR, i.e., achieving 22.54% recall@1 and 0.34 MRR compared with 10.75% and 0.17 respectively. Additionally, BinaryAI outperforms all existing binary-to-source SCA tools in TPL detection, increasing the precision from 73.36% to 85.84% and recall from 59.81% to 64.98% compared with the well-recognized commercial SCA product Black Duck.https://www.binaryai.net

DOI: 10.1145/3597503.3639100

PPT4J： Patch Presence Test for Java Binaries

作者: Pan, Zhiyuan and Hu, Xing and Xia, Xin and Zhan, Xian and Lo, David and Yang, Xiaohu
关键词: patch presence test, binary analysis, software security

Abstract

The number of vulnerabilities reported in open source software has increased substantially in recent years. Security patches provide the necessary measures to protect software from attacks and vulnerabilities. In practice, it is difficult to identify whether patches have been integrated into software, especially if we only have binary files. Therefore, the ability to test whether a patch is applied to the target binary, a.k.a. patch presence test, is crucial for practitioners. However, it is challenging to obtain accurate semantic information from patches, which could lead to incorrect results.In this paper, we propose a new patch presence test framework named Ppt4J (Patch Presence Test for Java Binaries). Ppt4J is designed for open-source Java libraries. It takes Java binaries (i.e. bytecode files) as input, extracts semantic information from patches, and uses feature-based techniques to identify patch lines in the binaries. To evaluate the effectiveness of our proposed approach Ppt4J, we construct a dataset with binaries that include 110 vulnerabilities. The results show that Ppt4J achieves an F1 score of 98.5% with reasonable efficiency, improving the baseline by 14.2%. Furthermore, we conduct an in-the-wild evaluation of Ppt4J on JetBrains IntelliJ IDEA. The results suggest that a third-party library included in the software is not patched for two CVEs, and we have reported this potential security problem to the vendor.

DOI: 10.1145/3597503.3639231

Compiler-directed Migrating API Callsite of Client Code

作者: Zhong, Hao and Meng, Na
关键词: code migration, compiler, API library

Abstract

API developers evolve software libraries to fix bugs, add new features, or refactor code, but the evolution can introduce API-breaking changes (e.g., API renaming). To benefit from such evolution, the programmers of client projects have to repetitively upgrade the callsites of libraries, since API-breaking changes introduce many compilation errors. It is tedious and error-prone to resolve such errors, especially when programmers are often unfamiliar with the API usages of newer versions. To migrate client code, the prior approaches either mine API mappings or learn edit scripts, but both the research lines have inherent limitations. For example, mappings alone cannot handle complex cases, and there is no sufficient source (e.g., migration commits) for learning edit scripts.In this paper, we propose a new research direction. When a library is replaced with a newer version, each type of API-breaking change introduces a type of compilation error. For example, renaming the name of an API method causes undefined-method errors at its callsites. Based on this observation, we propose to resolve errors that are introduced by migration, according to their locations and types that are reported by compilers. In this way, a migration tool can incrementally migrate complex cases, even without any change examples. Towards this direction, we propose the first approach, called LibCatch. It defines 14 migration operators, and in a compiler-directed way, it exploits the combinations of migration operators to generate migration solutions, until its predefined criteria are satisfied. We conducted two evaluations. In the first evaluation, we use LibCatch to handle 123 migration tasks. LibCatch reduced migration-related compilation errors for 92.7% of tasks, and eliminated such errors for 32.4% of tasks. We inspect the tasks whose errors are eliminated, and find that 33.9% of them produce identical edits to manual migration edits. In the second evaluation, we use two tools and LibCatch to migrate 15 real client projects in the wild. LibCatch resolved all compilation errors of 7 projects, and reduced the compilation errors of 6 other projects to no more than two errors. As a comparison, the compared two tools reduced the compilation errors of only 1 project.

DOI: 10.1145/3597503.3639084

Hard to Read and Understand Pythonic Idioms? DeIdiom and Explain Them in Non-Idiomatic Equivalent Code

作者: Zhang, Zejun and Xing, Zhenchang and Zhao, Dehai and Lu, Qinghua and Xu, Xiwei and Zhu, Liming
关键词: pythonic idioms, code transformation, program comprehension

Abstract

The Python community strives to design pythonic idioms so that Python users can achieve their intent in a more concise and efficient way. According to our analysis of 154 questions about challenges of understanding pythonic idioms on Stack Overflow, we find that Python users face various challenges in comprehending pythonic idioms. And the usage of pythonic idioms in 7,577 GitHub projects reveals the prevalence of pythonic idioms. By using a statistical sampling method, we find pythonic idioms result in not only lexical conciseness but also the creation of variables and functions, which indicates it is not straightforward to map back to non-idiomatic code. And usage of pythonic idioms may even cause potential negative effects such as code redundancy, bugs and performance degradation. To alleviate such readability issues and negative effects, we develop a transforming tool, DeIdiom, to automatically transform idiomatic code into equivalent non-idiomatic code. We test and review over 7,572 idiomatic code instances of nine pythonic idioms (list/set/dict-comprehension, chain-comparison, truth-value-test, loop-else, assign-multi-targets, for-multi-targets, star), the result shows the high accuracy of DeIdiom. Our user study with 20 participants demonstrates that explanatory non-idiomatic code generated by DeIdiom is useful for Python users to understand pythonic idioms correctly and efficiently, and leads to a more positive appreciation of pythonic idioms.

DOI: 10.1145/3597503.3639101

Exploiting Library Vulnerability via Migration Based Automating Test Generation

作者: Chen, Zirui and Hu, Xing and Xia, Xin and Gao, Yi and Xu, Tongtong and Lo, David and Yang, Xiaohu
关键词: library vulnerabilities, search-based test generation

Abstract

In software development, developers extensively utilize third-party libraries to avoid implementing existing functionalities. When a new third-party library vulnerability is disclosed, project maintainers need to determine whether their projects are affected by the vulnerability, which requires developers to invest substantial effort in assessment. However, existing tools face a series of issues: static analysis tools produce false alarms, dynamic analysis tools require existing tests and test generation tools have low success rates when facing complex vulnerabilities.Vulnerability exploits, as code snippets provided for reproducing vulnerabilities after disclosure, contain a wealth of vulnerability-related information. This study proposes a new method based on vulnerability exploits, called Vesta (Vulnerability Exploit-based Software Testing Auto-Generator), which provides vulnerability exploit tests as the basis for developers to decide whether to update dependencies. Vesta extends the search-based test generation methods by adding a migration step, ensuring the similarity between the generated test and the vulnerability exploit, which increases the likelihood of detecting potential library vulnerabilities in a project.We perform experiments on 30 vulnerabilities disclosed in the past five years, involving 60 vulnerability-project pairs, and compare the experimental results with the baseline method, Transfer. The success rate of Vesta is 71.7% which is a 53.4% improvement over Transfer in the effectiveness of verifying exploitable vulnerabilities.

DOI: 10.1145/3597503.3639583

MUT： Human-in-the-Loop Unit Test Migration

作者: Gao, Yi and Hu, Xing and Xu, Tongtong and Xia, Xin and Lo, David and Yang, Xiaohu
关键词: No keywords

Abstract

Test migration, which enables the reuse of test cases crafted with knowledge and creativity by testers across various platforms and programming languages, has exhibited effectiveness in mobile app testing. However, unit test migration at the source code level has not garnered adequate attention and exploration. In this paper, we propose a novel cross-language and cross-platform test migration methodology, named MUT, which consists of four modules: code mapping, test case filtering, test case translation, and test case adaptation. MUT initially calculates code mappings to establish associations between source and target projects, and identifies suitable unit tests for migration from the source project. Then, MUT’s code translation component generates a syntax tree by parsing the code to be migrated and progressively converts each node in the tree, ultima tely generating the target tests, which are compiled and executed in the target project. Moreover, we develop a web tool to assist developers in test migration. The effectiveness of our approach has been validated on five prevalent functional domain projects within the open-source community. We migrate a total of 550 unit tests and submitted pull requests to augment test code in the target projects on GitHub. By the time of this paper submission, 253 of these tests have already been merged into the projects (including 197 unit tests in the Luliyucoordinate-LeetCode project and 56 unit tests in the Rangerlee-HtmlParser project). Through running these tests, we identify 5 bugs, and 2 functional defects, and submitted corresponding issues to the project. The evaluation substantiates that MUT’s test migration is both viable and beneficial across programming languages and different projects.

DOI: 10.1145/3597503.3639124

Streamlining Java Programming： Uncovering Well-Formed Idioms with IdioMine

作者: Yang, Yanming and Hu, Xing and Xia, Xin and Lo, David and Yang, Xiaohu
关键词: code idiom mining, code pattern, large language model (LLM), clustering

Abstract

Code idioms are commonly used patterns, techniques, or practices that aid in solving particular problems or specific tasks across multiple software projects. They can improve code quality, performance, and maintainability, and also promote program standardization and reuse across projects. However, identifying code idioms is significantly challenging, as existing studies have still suffered from three main limitations. First, it is difficult to recognize idioms that span non-contiguous code lines. Second, identifying idioms with intricate data flow and code structures can be challenging. Moreover, they only extract dataset-specific idioms, so common idioms or well-established code/design patterns that are rarely found in datasets cannot be identified.To overcome these limitations, we propose a novel approach, named IdioMine, to automatically extract generic and specific idioms from both Java projects and libraries. We perform program analysis on Java functions to transform them into concise PDGs, for integrating the data flow and control flow of code fragments. We then develop a novel chain structure, Data-driven Control Chain (DCC), to extract sub-idioms that possess contiguous semantic meanings from PDGs. After that, we utilize GraphCodeBERT to generate code embeddings of these sub-idioms and perform density-based clustering to obtain frequent sub-idioms. We use heuristic rules to identify interrelated sub-idioms among the frequent ones. Finally, we employ ChatGPT to synthesize interrelated sub-idioms into potential code idioms and infer real idioms from them.We conduct well-designed experiments and a user study to evaluate IdioMine’s correctness and the practical value of the extracted idioms. Our experimental results show that IdioMine effectively extracts more idioms with better performance in most metrics. We compare our approach with Haggis and ChatGPT, IdioMine outperforms them by 22.8% and 35.5% in Idiom Set Precision (ISP) and by 9.7% and 22.9% in Idiom Coverage (IC) when extracting idioms from libraries. IdioMine also extracts almost twice the size of idioms than the baselines, exhibiting its ability to identify complete idioms. Our user study indicates that idioms extracted by IdioMine are well-formed and semantically clear. Moreover, we conduct a qualitative and quantitative analysis to investigate the primary functionalities of IdioMine’s extracted idioms from various projects and libraries.

DOI: 10.1145/3597503.3639135

Fine-grained, accurate and scalable source differencing

作者: Falleri, Jean-Remy and Martinez, Matias
关键词: software evolution, code differencing

Abstract

Understanding code changes is of crucial importance in a wide range of software evolution activities. The traditional approach is to use textual differencing, as done with success since the 1970s with the ubiquitous diff tool. However, textual differencing has the important limitation of not aligning the changes to the syntax of the source code. To overcome these issues, structural (i.e. syntactic) differencing has been proposed in the literature, notably GumTree which was one of the pioneering approaches. The main drawback of GumTree’s algorithm is the use of an optimal, but expensive tree-edit distance algorithm that makes it difficult to diff large ASTs. In this article, we describe a less expensive heuristic that enables GumTree to scale to large ASTs while yielding results of better quality than the original GumTree. We validate this new heuristic against 4 datasets of changes in two different languages, where we generate edit-scripts with a median size 50% smaller and a total speedup of the matching time between 50x and 281x.

DOI: 10.1145/3597503.3639148

Semantic GUI Scene Learning and Video Alignment for Detecting Duplicate Video-based Bug Reports

作者: Yan, Yanfu and Cooper, Nathan and Chaparro, Oscar and Moran, Kevin and Poshyvanyk, Denys
关键词: bug reporting, GUI learning, duplicate video retrieval

Abstract

Video-based bug reports are increasingly being used to document bugs for programs centered around a graphical user interface (GUI). However, developing automated techniques to manage video-based reports is challenging as it requires identifying and understanding often nuanced visual patterns that capture key information about a reported bug. In this paper, we aim to overcome these challenges by advancing the bug report management task of duplicate detection for video-based reports. To this end, we introduce a new approach, called Janus, that adapts the scene-learning capabilities of vision transformers to capture subtle visual and textual patterns that manifest on app UI screens — which is key to differentiating between similar screens for accurate duplicate report detection. Janus also makes use of a video alignment technique capable of adaptive weighting of video frames to account for typical bug manifestation patterns. In a comprehensive evaluation on a benchmark containing 7,290 duplicate detection tasks derived from 270 video-based bug reports from 90 Android app bugs, the best configuration of our approach achieves an overall mRR/mAP of 89.8%/84.7%, and for the large majority of duplicate detection tasks, outperforms prior work by ≈9% to a statistically significant degree. Finally, we qualitatively illustrate how the scene-learning capabilities provided by Janus benefits its performance.

DOI: 10.1145/3597503.3639163

The Classics Never Go Out of Style： An Empirical Study of Downgrades from the Bazel Build Technology

作者: Alfadel, Mahmoud and McIntosh, Shane
关键词: build systems, downgrades, empirical software engineering

Abstract

Software build systems specify how source code is transformed into deliverables. Keeping build systems in sync with the software artifacts that they build while retaining their capacity to quickly produce updated deliverables requires a serious investment of development effort. Enticed by advanced features, several software teams have migrated their build systems to a modern generation of build technologies (e.g., Bazel, Buck), which aim to reduce the maintenance and execution overhead that build systems impose on development. However, not all migrations lead to perceived improvements, ultimately culminating in abandonment of the build technology. While prior work has focused on upward migration towards more advanced technologies, so-called downgrades, i.e., abandonment of a modern build technology in favour of a traditional one, remains largely unexplored.In this paper, we perform an empirical study to better understand the abandonment of Bazel—a modern build technology with native support for multi-language software projects and (local/distributed) artifact caching. Our investigation of 542 projects that adopt Bazel reveals that (1) 61 projects (11.2%) have abandoned Bazel; and (2) abandonment tends to occur after investing in Bazel for a substantial amount of time (a median of 638 days). Thematic analysis reveals seven recurring reasons for abandonment, such as technical challenges, lack of platform integration, team coordination issues, and upstream trends. After abandoning Bazel, the studied projects have adopted a broad set of alternatives, spanning from language-specific tools like Go Build, to more traditional build technologies like CMake and even pure Make. These results demonstrate that choosing a build technology involves balancing tradeoffs that are not always optimized by adopting the latest technology. This paper also lays the foundation for future work on balancing the tradeoffs that are associated with build technology choice (e.g., feature richness vs. maintenance costs) and the development of tools to support migration away from modern technologies.

DOI: 10.1145/3597503.3639169

Scaling Code Pattern Inference with Interactive What-If Analysis

作者: Kang, Hong Jin and Wang, Kevin and Kim, Miryung
关键词: active learning, code search patterns, API misuse, human feedback

Abstract

Programmers often have to search for similar code when detecting and fixing similar bugs. Prior active learning approaches take only instance-level feedback, i.e., positive and negative method instances. This limitation leads to increased labeling burden, when users try to control generality and specificity for a desired code pattern.We present a novel feedback-guided pattern inference approach, called SURF. To reduce users’ labelling effort, it actively guides users in assessing the implication of having a particular feature choice in the constructed pattern, and incorporates direct feature-level feedback. The key insight behind SURF is that users can effectively select appropriate features with the aid of impact analysis. SURF provides hints on the global distribution of how each feature is consistent with already labelled positive and negative instances, and how selection of a new feature can yield additional matching instances. Its what-if-analysis contrasts how different feature choices can include (or exclude) more instances in the rest of the population.We performed a user study with 14 participants, designed with two-treatment factorial crossover. Participants were able to provide 30% more correct answers about different API usages in 20% less time. All participants found that what-if-analysis and impact analysis are useful for pattern refinement. 79% of the participants were able to produce the correct, expected pattern with SURF’s feature-level guidance, as opposed to 43% of the participants when using the baseline with instance-level feedback only. SURF is the first approach to incorporate feature-level feedback with automated what-if analysis to empower users to control the generality (/ specificity) of a desired code pattern.

DOI: 10.1145/3597503.3639193

Context-Aware Name Recommendation for Field Renaming

作者: Dong, Chunhao and Jiang, Yanjie and Niu, Nan and Zhang, Yuxia and Liu, Hui
关键词: refactoring, rename, recommendation, context-aware

Abstract

Renaming is one of the most popular software refactorings. Although developers may know what the new name should be when they conduct a renaming, it remains valuable for refactoring tools to recommend new names automatically so that developers can simply hit Enter and efficiently accept the recommendation to accomplish the refactoring. Consequently, most IDEs automatically recommend new names for renaming refactorings by default. However, the recommendation made by mainstream IDEs is often incorrect. For example, the precision of IntelliJ IDEA in recommending names for field renamings is as low as 6.3%. To improve the accuracy, in this paper, we propose a context-aware lightweight approach (called CARER) to recommend new names for Java field renamings. Different from mainstream IDEs that rely heavily on initializers and data types of the to-be-renamed fields, CARER exploits both dynamic and static contexts of the renamings as well as naming conventions. We evaluate CARER on 1.1K real-world field renamings discovered from open-source applications. Our evaluation results suggest that CARER can significantly improve the state of the practice in recommending new names for field renamings, improving the precision from 6.30% to 61.15%, and recall from 6.30% to 41.50%. Our evaluation results also suggest that CARER is as efficient as IntelliJ IDEA is, making it suitable to be integrated into IDEs.

DOI: 10.1145/3597503.3639195

CNEPS： A Precise Approach for Examining Dependencies among Third-Party C/C++ Open-Source Components

作者: Na, Yoonjong and Woo, Seunghoon and Lee, Joomyeong and Lee, Heejo
关键词: open source software reuse, supply chain security, third-party library dependency, software bill of materials (SBOM)

Abstract

The rise in open-source software (OSS) reuse has led to intricate dependencies among third-party components, increasing the demand for precise dependency analysis. However, owing to the presence of reused files that are difficult to identify the originating components (i.e., indistinguishable files) and duplicated components, precisely identifying component dependencies is becoming challenging.In this paper, we present Cneps, a precise approach for examining dependencies in reused C/C++ OSS components. The key idea of Cneps is to use a novel granularity called a module, which represents a minimum unit (i.e., set of source files) that can be reused as a library from another project. By examining dependencies based on modules instead of analyzing single reused files, Cneps can precisely identify dependencies in the target projects, even in the presence of indistinguishable files. To differentiate duplicated components, Cneps examines the cloned paths and originating projects of each component, enabling precise identification of dependencies associated with them. Experimental results on top 100 C/C++ software show that Cneps outperforms a state-of-the-art approach by identifying twice as many dependencies. Cneps could identify 435 dependencies with 89.9% precision and 93.2% recall in less than 10 seconds per application on average, whereas the existing approach hardly achieved 63.5% precision and 42.5% recall.

DOI: 10.1145/3597503.3639209

A Study on the Pythonic Functional Constructs’ Understandability

作者: Zid, Cyrine and Zampetti, Fiorella and Antoniol, Giuliano and Di Penta, Massimiliano
关键词: functional programming, Python, program comprehension, empirical study

Abstract

The use of functional constructs in programming languages such as Python has been advocated to help write more concise source code, improve parallelization, and reduce side effects. Nevertheless, their usage could lead to understandability issues. This paper reports the results of a controlled experiment conducted with 209 developers to assess the understandability of given Pythonic functional constructs—namely lambdas, comprehensions, and map/reduce/-filter functions—if compared to their procedural alternatives. To address the study’s goal, we asked developers to modify code using functional constructs or not, to compare the understandability of different implementations, and to provide insights about when and where it is preferable to use such functional constructs. Results of the study indicate that code snippets with lambdas are more straightforward to modify than the procedural alternatives. However, this is not the case for comprehension. Regarding the perceived understandability, code snippets relying on procedural implementations are considered more readable than their functional alternatives. Last but not least, while functional constructs may help write compact code, improving maintainability and performance, they are considered hard to debug. Our results can lead to better education in using functional constructs, prioritizing quality assurance activities, and enhancing tool support for developers.

DOI: 10.1145/3597503.3639211

GitBug-Actions： Building Reproducible Bug-Fix Benchmarks with GitHub Actions

作者: Saavedra, Nuno and Silva, Andr'{e
关键词: software bugs, bug benchmark, bug database, reproducibility, software testing, program analysis, github actions

Abstract

Bug-fix benchmarks are fundamental in advancing various sub-fields of software engineering such as automatic program repair (APR) and fault localization (FL). A good benchmark must include recent examples that accurately reflect technologies and development practices of today. To be executable in the long term, a benchmark must feature test suites that do not degrade overtime due to, for example, dependencies that are no longer available. Existing benchmarks fail in meeting both criteria. For instance, Defects4J, one of the foremost Java benchmarks, last received an update in 2020. Moreover, full-reproducibility has been neglected by the majority of existing benchmarks. In this paper, we present GitBug-Actions: a novel tool for building bug-fix benchmarks with modern and fully-reproducible bug-fixes. GitBug-Actions relies on the most popular CI platform, GitHub Actions, to detect bug-fixes and smartly locally execute the CI pipeline in a controlled and reproducible environment. To the best of our knowledge, we are the first to rely on GitHub Actions to collect bug-fixes. To demonstrate our toolchain, we deploy GitBug-Actions to build a proof-of-concept Go bug-fix benchmark containing executable, fully-reproducible bug-fixes from different repositories. A video demonstrating GitBug-Actions is available at: https://youtu.be/aBWwa1sJYBs.

DOI: 10.1145/3639478.3640023

DronLomaly： Runtime Log-based Anomaly Detector for DJI Drones

作者: Minn, Wei and Tun, Yan Naing and Shar, Lwin Khin and Jiang, Lingxiao
关键词: drone security, anomaly detection, log analysis, deep learning

Abstract

We present an automated tool for realtime detection of anomalous behaviors while a DJI drone is executing a flight mission. The tool takes sensor data logged by drone at fixed time intervals and performs anomaly detection using a Bi-LSTM model. The model is trained on baseline flight logs from a successful mission physically or via a simulator. The tool has two modules — the first module is responsible for sending the log data to the remote controller station, and the second module is run as a service in the remote controller station powered by a Bi-LSTM model, which receives the log data and produces visual graphs showing the realtime flight anomaly statuses with respect to various sensor readings on a dashboard. We have successfully evaluated the tool on three datasets including industrial test scenarios. DronLomaly is released as an open-source tool on GitHub [10], and the demo video can be found at [17].

DOI: 10.1145/3639478.3640042

JOG： Java JIT Peephole Optimizations and Tests from Patterns

作者: Zang, Zhiqiang and Thimmaiah, Aditya and Gligoric, Milos
关键词: just-in-time compilers, code generation, peephole optimizations

Abstract

We present JOG, a framework for developing peephole optimizations and accompanying tests for Java compilers. JOG allows developers to write a peephole optimization as a pattern in Java itself. Such a pattern contains code before and after the desired transformation defined by the peephole optimization, with any necessary preconditions, and the pattern can be written in the same way that tests for the optimization are already written in OpenJDK. JOG automatically translates each pattern into C/C++ code as a JIT optimization pass, and generates tests for the optimization. Also, JOG automatically analyzes the shadow relation between a pair of optimizations where the effect of the shadowed optimization is overridden by the other. We used JOG to write 162 patterns, including many patterns found in OpenJDK and LLVM, as well as some that we proposed. We opened ten pull requests (PRs) for OpenJDK, on introducing new optimizations, removing shadowed optimizations, and adding generated tests for optimizations; nine of PRs have already been integrated into the master branch of OpenJDK. The demo video for JOG can be found at https://youtu.be/z2q6dhOiqgw.

DOI: 10.1145/3639478.3640040

作者: Alexopoulos, Georgios and Mitropoulos, Dimitris
关键词: graphics processing unit, resource sharing, machine learning

Abstract

GPUs are essential for accelerating Machine Learning (ML) workloads. A common practice is deploying ML jobs as containers managed by an orchestrator such as Kubernetes. Kubernetes schedules GPU workloads by exclusively assigning a device to a single job, which leads to massive GPU underutilization, especially for interactive development jobs with significant idle periods. Current GPU sharing approaches assign a fraction of GPU memory to each co-located job to avoid memory contention and out-of-memory errors. However, this is impractical, as it requires a priori knowledge of memory usage and does not fully address GPU underutilization. We propose nvshare, which transparently enables page faults (i.e., exceptions that are raised when an entity attempts to access a resource) to allow virtual GPU memory oversubscription. In this way we permit each application to utilize the entire physical GPU memory (Video RAM). To prevent thrashing (a situation in which page faults dominate execution time) in a reliable manner, nvshare serializes overlapping GPU bursts from different applications. We compared nvshare with KubeShare, a state-of-the-art GPU sharing solution. Our results indicate that both perform equally well in conventional sharing cases where total GPU memory usage fits into VRAM. For memory oversubscription scenarios, which KubeShare does not support, nvshare outperforms the sequential execution baseline by up to 1.35x. A video of nvshare is available at https://www.youtube.com/watch?v=9n-5sc5AICY

DOI: 10.1145/3639478.3640034

Daedalux： An Extensible Platform for Variability-Aware Model Checking

作者: Lazreg, Sami and Cordy, Maxime and Hansen, Simon Thrane and Legay, Axel
关键词: No keywords

Abstract

This paper presents Daedalux, a new model-checking platform for variability-intensive systems based on Featured Transition System theory developed in C++. Daedalux features a modular, flexible, and extensible architecture, overcoming previous tools’ maintainability limitations. In addition, during verification, it provides visualizations of intermediate models and results. A key added value of Daedalux lies in its software architecture, which allows straightforward extension and integration of new formalisms and verification algorithms. We have implemented two recent FTS-based approaches, i.e., a statistical model-checking algorithm for LTL properties and an exhaustive algorithm for multi-LTL properties. By reducing the entry barrier of understanding variability-aware model checking and facilitating the comprehension and extension of the software tools, we hope to increase the community’s ambitions in developing novel model-checking advances. A video demonstration of Daedalux can be found at https://youtu.be/kirpOAlV-0w.

DOI: 10.1145/3639478.3640043

Verifying and Displaying Move Smart Contract Source Code for the Sui Blockchain

作者: van Tonder, Rijnard
关键词: smart contracts, source code, bytecode, compilers, program comprehension, software development, blockchain

Abstract

Smart contract development presents additional challenges beyond traditional software workflows, e.g., locally in IDEs. For smart contract developers to understand and trust code execution, they need to write and use software libraries with a comprehensible code representation—i.e., source code. However, blockchains do not typically store the original source code of smart contracts, but a condensed bytecode representation. Thus, when developers consult smart contract source code, they need to be sure that it corresponds to the same bytecode on the blockchain. Depending on available developer tools, this process can be ad-hoc, cumbersome, or opaque. In this paper we present our design and implementation of a new tool that serves to verify Move smart contract source code against its bytecode representation on the Sui blockchain. We demonstrate the user-facing shift where developers now benefit from seeing source code in their browser instead of bytecode. We further highlight future features and research directions that verified source availability brings to smart contract developer experience.

DOI: 10.1145/3639478.3640038

TestSpark： IntelliJ IDEA’s Ultimate Test Generation Companion

作者: Sapozhnikov, Arkadii and Olsthoorn, Mitchell and Panichella, Annibale and Kovalenko, Vladimir and Derakhshanfar, Pouria
关键词: unit test generation, intellij idea plugin, large language models

Abstract

Writing software tests is laborious and time-consuming. To address this, prior studies introduced various automated test-generation techniques. A well-explored research direction in this field is unit test generation, wherein artificial intelligence (AI) techniques create tests for a method/class under test. While many of these techniques have primarily found applications in a research context, existing tools (e.g., EvoSuite, Randoop, and AthenaTest) are not user-friendly and are tailored to a single technique. This paper introduces Test-Spark, a plugin for IntelliJ IDEA that enables users to generate unit tests with only a few clicks directly within their Integrated Development Environment (IDE). Furthermore, TestSpark also allows users to easily modify and run each generated test and integrate them into the project workflow. TestSpark leverages the advances of search-based test generation tools, and it introduces a technique to generate unit tests using Large Language Models (LLMs) by creating a feedback cycle between the IDE and the LLM. Since TestSpark is an open-source (https://github.com/JetBrains-Research/TestSpark), extendable, and well-documented tool, it is possible to add new test generation methods into the plugin with the minimum effort. This paper also explains our future studies related to TestSpark and our preliminary results. Demo video: https://youtu.be/0F4PrxWfiXo

DOI: 10.1145/3639478.3640024

SpotFlow： Tracking Method Calls and States at Runtime

作者: Hora, Andre
关键词: dynamic analysis, runtime monitoring, software testing, code comprehension, debugging, python

Abstract

Understanding the runtime behavioral aspects of a software system is fundamental for several software engineering tasks, such as testing and code comprehension. For this purpose, typically, one needs to instrument the system and collect data from its execution. Despite the importance of runtime analysis, few tools have been created and made public to support developers extracting information from software execution. In this paper, we propose SpotFlow, a tool to ease the runtime analysis of Python programs. With Spot-Flow, practitioners and researchers can easily extract information about executed methods, run lines, argument values, return values, variable states, and thrown exceptions. Finally, we present tool prototypes built on top of SpotFlow to support software testing and code comprehension and we detail how SpotFlow runtime data can support novel empirical studies and datasets. SpotFlow is publicly available at https://github.com/andrehora/spotflow. Video: https://youtu.be/jhOv3nKz_u4.

DOI: 10.1145/3639478.3640029

Boidae： Your Personal Mining Platform

作者: Sigurdson, Brian and Flint, Samuel W. and Dyer, Robert
关键词: boa, mining software repositories, scalable, open source

Abstract

Mining software repositories is a useful technique for researchers and practitioners to see what software developers actually do when developing software. Tools like Boa provide users with the ability to easily mine these open-source software repositories at a very large scale, with datasets containing hundreds of thousands of projects. The trade-off is that users must use the provided infrastructure, query language, runtime, and datasets and this might not fit all analysis needs. In this work, we present Boidae: a family of Boa installations controlled and customized by users. Boidae uses automation tools such as Ansible and Docker to facilitate the deployment of a customized Boa installation. In particular, Boidae allows the creation of custom datasets generated from any set of Git repositories, with helper scripts to aid in finding and cloning repositories from GitHub and SourceForge. In this paper, we briefly describe the architecture of Boidae and how researchers can utilize the infrastructure to generate custom datasets. Boidae’s scripts and all infrastructure it builds upon are open-sourced. A video demonstration of Boidae’s installation and extension is available at https://go.unl.edu/boidae.

DOI: 10.1145/3639478.3640026

Code Mapper： Mapping the Global Contributions of OSS

作者: Le Tourneau, Thomas and Latendresse, Jasmine and Abdellatif, Ahmad and Shihab, Emad
关键词: open source, machine learning, software development

Abstract

Free and Open Source Software (FOSS) has reshaped the software landscape. Software developers from around the world contribute to the development and maintenance of these projects. The geographic diversity within FOSS offers insights into community dynamics, collaboration patterns, and inclusivity. Despite the rich insights that can be gained from this geographic diversity, there remains a scarcity of research in this area. One possible reason for this gap in studies is the lack of tools that can identify and visualize the geographic distribution of contributions in OSS projects.We present Code Mapper, a tool that identifies the location of contributors in GitHub projects. To enable users to explore the global influence of their projects, Code Mapper visually presents the geographic distribution of project contributors. To accelerate future research in this area, we have deployed Code Mapper at https://codemapper.alwaysdata.net and have made our source code publicly available online. A demonstration of Code Mapper can be viewed at https://www.youtube.com/watch?v=AtARvrBJbVM.

DOI: 10.1145/3639478.3640030

TypeEvalPy： A Micro-benchmarking Framework for Python Type Inference Tools

作者: Shivarpatna Venkatesh, Ashwin Prasad and Sabu, Samkutty and Wang, Jiawei and M. Mir, Amir and Li, Li and Bodden, Eric
关键词: No keywords

Abstract

In light of the growing interest in type inference research for Python, both researchers and practitioners require a standardized process to assess the performance of various type inference techniques. This paper introduces TypeEvalPy, a comprehensive micro-benchmarking framework for evaluating type inference tools. TypeEvalPy contains 154 code snippets with 845 type annotations across 18 categories that target various Python features. The framework manages the execution of containerized tools, transforms inferred types into a standardized format, and produces meaningful metrics for assessment. Through our analysis, we compare the performance of six type inference tools, highlighting their strengths and limitations. Our findings provide a foundation for further research and optimization in the domain of Python type inference.

DOI: 10.1145/3639478.3640033

Can My Microservice Tolerate an Unreliable Database? Resilience Testing with Fault Injection and Visualization

作者: Assad, Michael and Meiklejohn, Christopher S. and Miller, Heather and Krusche, Stephan
关键词: fault injection, byzantine faults, resilience testing, SFIT, databases

Abstract

In microservice applications, ensuring resilience during database or service disruptions constitutes a significant challenge. While several tools address resilience testing for service failures, there is a notable gap in tools specifically designed for resilience testing of database failures. To bridge this gap, we have developed an extension for fault injection in database clients, which we integrated into Filibuster, an existing tool for fault injection in services within microservice applications. Our tool systematically simulates database disruptions, thereby enabling comprehensive testing and evaluation of application resilience. It is versatile, supporting a range of both SQL and NoSQL database systems, such as Redis, Apache Cassandra, CockroachDB, PostgreSQL, and DynamoDB. A defining feature is its integration during the development phase, complemented by an IntelliJ IDE plugin, which offers developers visual feedback on the types, locations, and impacts of injected faults. A video demonstration of the tool’s capabilities is accessible at https://youtu.be/bvaUVCy1m1s.

DOI: 10.1145/3639478.3640021

CATMA： Conformance Analysis Tool For Microservice Applications

作者: Cao, Clinton and Schneider, Simon and Ferreyra, Nicolas E. Diaz and Verwer, Sicco and Panichella, Annibale and Scandariato, Riccardo
关键词: microservices, static analysis, dynamic analysis, software testing, empirical software engineering

Abstract

The microservice architecture allows developers to divide the core functionality of their software system into multiple smaller services. However, this architectural style also makes it harder for them to debug and assess whether the system’s deployment conforms to its implementation. We present CATMA, an automated tool that detects non-conformances between the system’s deployment and implementation. It automatically visualizes and generates potential interpretations for the detected discrepancies. Our evaluation of CATMA shows promising results in terms of performance and providing useful insights. CATMA is available at https://cyber-analytics.nl/catma.github.io/, and a demonstration video is available at https://youtu.be/WKP1hG-TDKc.

DOI: 10.1145/3639478.3640022

作者: Marussy, Krist'{o
关键词: model generation, partial modeling, logic solver, cloud service

Abstract

Various software and systems engineering scenarios rely on the systematic construction of consistent graph models. However, automatically generating a diverse set of consistent graph models for complex domain specifications is challenging. First, the graph generation problem must be specified with mathematical precision. Moreover, graph generation is a computationally complex task, which necessitates specialized logic solvers. Refinery is a novel open-source software framework to automatically synthesize a diverse set of consistent domain-specific graph models. The framework offers an expressive high-level specification language using partial models to succinctly formulate a wide range of graph generation challenges. Moreover, it provides a modern cloud-based architecture for a scalable graph solver as a service, which uses logic reasoning rules to efficiently synthesize a diverse set of solutions to graph generation problems by partial model refinement. Applications include system-level architecture synthesis, test generation for modeling tools or traffic scenario synthesis for autonomous vehicles.Video demonstration: https://youtu.be/Qy_3udNsWsMContinuously deployed at: https://refinery.sen/ices/

DOI: 10.1145/3639478.3640045

(Neo4j)^ Browser： Visualizing Variable-Aware Analysis Results

作者: Toledo, Rafael F. and Atlee, Joanne M. and Xiong, Rui Ming and Liu, Mingyu
关键词: variability-aware visualizer, software product lines, graphical software models, Neo4j database

Abstract

A software product line (SPL) implements a family of related software products. As such, analyzing a software produce line produces variable results that apply to some SPL variants and not to others. Typically, such results are annotated with presence conditions, which are logical expressions that represent the product variants to which the results apply. When analyzing large SPLs, these expressions that annotate results can become overwhelmingly large and difficult to reason about. In this paper, we present Neo4j Browser for visualizing and exploring the results of an SPL analysis. Neo4j Browser provides an interactive and customizable interface that allows the user to highlight results according to product variants of interest. Previous evaluations show that the Neo4j Browser improves the correctness and efficiency of the user’s work and reduces the user’s cognitive load in working with variable results. The tool can be downloaded at https://vault.cs.uwaterloo.ca/s/Rqy2f56PeC6s4XD, and a demo video presenting its features is at https://youtu.be/CoweflQQFWU.

DOI: 10.1145/3639478.3640046

SAFE： Safety Analysis and Retraining of DNNs

作者: Attaoui, Mohammed and Pastore, Fabrizio and Briand, Lionel
关键词: DNN explanation, functional safety analysis, DNN debugging

Abstract

We present SAFE, a tool based on a black-box approach to automatically characterize the root causes of Deep Neural Network (DNN) failures. SAFE relies on VGGNet-16, a transfer learning model pre-trained on ImageNet, to extract the features from error-inducing images. After feature extraction, SAFE applies a density-based clustering algorithm to discover arbitrarily shaped clusters of images modeling plausible causes of failures. By relying on the identified clusters, SAFE can select a set of additional images to be used to retrain and improve the DNN efficiently. Empirical results show the potential of SAFE in identifying different root causes of DNN failures based on case studies in the automotive domain. It also yields significant improvements in DNN accuracy after retraining while saving considerable execution time and memory compared to alternatives. A demo video of SAFE is available at https://youtu.be/8QD-PPFTZxs.

DOI: 10.1145/3639478.3640028

MutaBot： A Mutation Testing Approach for Chatbots

作者: Urrico, Michael Ferdinando and Clerissi, Diego and Mariani, Leonardo
关键词: chatbot testing, mutation testing, botium, dialogflow

Abstract

Mutation testing is a technique aimed at assessing the effectiveness of test suites by seeding artificial faults into programs. Although available for many platforms and languages, no mutation testing tool is currently available for conversational chatbots, which represent an increasingly popular solution to design systems that can interact with users through a natural language interface. Note that since conversations must be explicitly engineered by the developers of conversational chatbots, these systems are exposed to specific types of faults not supported by existing mutation testing tools.In this paper, we present MutaBot, a mutation testing tool for conversational chatbots. MutaBot addresses mutations at multiple levels, including conversational flows, intents, and contexts. We designed the tool to potentially target multiple platforms, while we implemented initial support for Google Dialogflow chatbots. We assessed the tool with three Dialogflow chatbots and test cases generated with Botium, revealing weaknesses in the test suites.

DOI: 10.1145/3639478.3640032

AntiCopyPaster 2.0： Whitebox just-in-time code duplicates extraction

作者: AlOmar, Eman Abdullah and Knobloch, Benjamin and Kain, Thomas and Kalish, Christopher and Mkaouer, Mohamed Wiem and Ouni, Ali
关键词: refactoring, duplicated code, software quality

Abstract

AntiCopyPaster is an IntelliJ IDEA plugin, implemented to detect and refactor duplicate code interactively as soon as a duplicate is introduced. The plugin only recommends the extraction of a duplicate when it is worth it. In contrast to current Extract Method refactoring approaches, our tool seamlessly integrates with the developer’s workflow and actively provides recommendations for refactorings. This work extends our tool to allow developers to customize the detection rules, i.e., metrics, based on their needs and preferences. The plugin and its source code are publicly available on GitHub at https://github.com/refactorings/anti-copy-paster. The demonstration video can be found on YouTube: https://youtu.be/Y1sbfpds2Ms.

DOI: 10.1145/3639478.3640035

GitHubInclusifier： Finding and fixing non-inclusive language in GitHub Repositories

作者: Todd, Liam and Grundy, John and Treude, Christoph
关键词: inclusive language, refactoring, biased language, inappropriate language, software documentation, software maintenance tools

Abstract

Non-inclusive language in software artefacts has been recognised as a serious problem. We describe a tool to find and fix non-inclusive language in a variety of GitHub repository artefacts. These include various README files, PDFs, code comments, and code. A wide variety of non-inclusive language including racist, ageist, ableist, violent and others are located and issues created, tagging the artefacts for checking. Suggested fixes can be generated using third-party LLM APIs, and approved changes made to documents, including code refactorings, and committed to the repository.The tool and evaluation data are available from: https://github.com/LiamTodd/github-inclusifierThe demo video is available at: https://www.youtube.com/watch?v=1z1QKdQg-nM

DOI: 10.1145/3639478.3640025

OpenSBT： A Modular Framework for Search-based Testing of Automated Driving Systems

作者: Sorokin, Lev and Munaro, Tiziano and Safin, Damir and Liao, Brian Hsuan-Cheng and Molin, Adam
关键词: search-based software testing, metaheuristics, scenario-based testing, autonomous driving, automated driving

Abstract

Search-based software testing (SBST) is an effective and efficient approach for testing automated driving systems (ADS). However, testing pipelines for ADS testing are particularly challenging as they involve integrating complex driving simulation platforms and establishing communication protocols and APIs with the desired search algorithm. This complexity prevents a wide adoption of SBST and thorough empirical comparative experiments with different simulators and search approaches. We present OpenSBT, an open-source, modular and extensible framework to facilitate the SBT of ADS. With OpenSBT, it is possible to integrate simulators with an embedded system under test, search algorithms and fitness functions for testing. We describe the architecture and show the usage of our framework by applying different search algorithms for testing Automated Emergency Braking Systems in CARLA as well as in the industrial and high-fidelity simulator Prescan in collaboration with our industrial partner DENSO. OpenSBT is available at https://git.fortiss.org/opensbt. A demo video is provided here: https://www.youtube.com/watch?v=qi_CTTzrk5s.

DOI: 10.1145/3639478.3640027

APICIA： An API Change Impact Analyzer for Android Apps

作者: Mahmud, Tarek and Che, Meiru and Rouijel, Jihan and Khan, Mujahid and Yang, Guowei
关键词: Android, API evolution, change impact analysis, regression testing

Abstract

Android APIs are updated frequently, making it critical to analyze the impact of these updates on Android apps to ensure their reliability. In this paper, we introduce APICIA, a tool that can be used for analyzing the impact of API changes on Android apps. APICIA identifies program elements such as classes, methods, and statements that have been affected, along with affected tests and untested affected code following the target API update. Our evaluation on 31 real-world Android apps shows that APICIA can be cost-effective by pinpointing only 35.31% of tests per app on average that may exhibit different behaviors (e.g. app crashes) due to the API update. Furthermore, since many of the affected statements are not covered by existing tests, APICIA can also aid developers in expanding their test suite to cover these statements. APICIA is publicly available at https://github.com/TSUMahmud/apicia and a screencast presenting the demonstration of APICIA is available at https://tinyurl.com/apicia-tool.

DOI: 10.1145/3639478.3640041

RAT： A Refactoring-Aware Tool for Tracking Code History

作者: Niu, Feifei and Shao, Junqian and Xu, Chaofan and Mayr-Dorn, Christoph and Assuncao, Wesley K. G. and Huang, Liguo and Li, Chuanyi and Ge, Jidong and Luo, Bin and Egyed, Alexander
关键词: code history, refactoring, traceability

Abstract

History of code elements is essential for software maintenance tasks. However, code refactoring is one of the main causes that makes obtaining a consistent view on code evolution difficult as renaming or moving source code elements break such history. To this end, this paper presents RAT, a refactoring-aware tool for keeping track of code elements evolution across time, not just in terms of revisions but also in terms of refactoring. This is the first tool that enables fine-grained code element traceability of the whole repository.Empirical evaluation of leveraging our tool in three bug localization techniques relying on code history shows significant improvement in localization accuracy. Based on our findings, we believe that many of the state-of-the-art approaches using past source code data would benefit from our tool.Demo Tool: https://github.com/feifeiniu-se/RAT_DemoDemo Video: https://youtu.be/VI_xwUaIPp4

DOI: 10.1145/3639478.3640047

Emulation Tool For Android Edge Devices

作者: Naghipour Vijouyeh, Lyla and Bruno, Rodrigo and Ferreira, Paulo
关键词: Android, edge networks, wi-fi direct, emulation, peer-to-peer, bluetooth, edge

Abstract

The number of mobile devices has surpassed the global population, making them the primary means of communication and data sharing. However, existing applications still heavily rely on centralized networks for communicating and sharing data due to the lack of tools for developers to create and test distributed applications in edge environments. To address this issue, we present EdgeEmu, an Android-based distributed emulation tool. EdgeEmu allows a considerable number of Android emulators to participate remotely in the emulation, making it an appropriate tool for testing large networks. Unlike the standard Android SDK, EdgeEmu is not restricted to local emulators. Thus, it eliminates scalability issues present in the current Android testing infrastructure. Evaluations show that EdgeEmu outperforms the standard Android SDK by approximately 59.1% in terms of emulation startup time when ten Android emulators are used. Additionally, it exhibits low latency and negligible overhead during message exchanges between different emulators. A demo video of EdgeEmu is available at https://youtu.be/6jT9KXiUmQM.

DOI: 10.1145/3639478.3640039

TPV： A Tool for Validating Temporal Properties in UML Class Diagrams

作者: Al Lail, Mustafa and Viesca, Antonio and Cardenas, Hector and Zarour, Mohammad and Perez, Alfredo
关键词: temporal propeties, OCL, UML, tool, model checking, verification

Abstract

Software scientists and practitioners have criticized Model-driven engineering (MDE) for lacking effective tooling. Although progress has been made, most MDE analysis tools rely on complex, heavyweight mathematical techniques that are not based on UML. Such tools require a steep learning curve and suffer from many accidental complexities. We developed the Temporal Property Validator (TPV) to tackle this issue. TPV allows designers to specify and analyze temporal properties using UML notations, techniques, and tools. We evaluated TPV using the user experience evaluation method and obtained promising results in all aspects of user needs. You can download TPV and view the demo video from https://github.com/mustafalail/TPV-Tool.

DOI: 10.1145/3639478.3640044

CodeGRITS： A Research Toolkit for Developer Behavior and Eye Tracking in IDE

作者: Tang, Ningzhi and An, Junwen and Chen, Meng and Bansal, Aakash and Huang, Yu and McMillan, Collin and Li, Toby Jia-Jun
关键词: IDE extension/plugin, developer behavior analysis, eye tracking

Abstract

Traditional methodologies for exploring programmers’ behaviors have primarily focused on capturing their actions within the Integrated Development Environment (IDE), offering limited view into their cognitive processes. Recent emergent work started using eye-tracking techniques in software engineering (SE) research. However, the lack of tools specifically designed for coordinated data collection poses technical barriers and requires significant effort from researchers who wish to combine these two complementary approaches. To address this gap, we present CodeGRITS, a plugin specifically designed for SE researchers. CodeGRITS is built on top of IntelliJ’s SDK, with wide compatibility with the entire family of JetBrains IDEs to track developers’ IDE interactions and eye gaze data. CodeGRITS also features various practical features for SE research (e.g., activity labeling) and a real-time API that provides interoperability for integration with other research instruments and developer tools. The demo video is available at https://youtu.be/d-YsJfW2NMI.

DOI: 10.1145/3639478.3640037

ValidGen： A Tool for Automatic Generation of Validation Scripts to Support Rapid Requirements Validation

作者: Pan, Hongyue and Yang, Yilong
关键词: requirements model, software prototype, code generation, requirements validation

Abstract

Rapid prototyping is an effective way for requirement validation in the earliest stages of software development. Our previous work, RM2PT, can automatically generate software prototypes from requirements models to support incremental and rapid requirements validation. This paper proposes a CASE tool named ValidGen based on RM2PT, which can automatically generate validation scripts to execute the prototype. Thanks to these validation scripts, the stakeholder only needs to monitor the execution process without selecting the system operation and typing the input parameters, which significantly reduces the time and effort needed for validating requirements. We adopted three case studies to evaluate the tool, and the results show that the tool requires only about 60% of the time for requirements validation compared to traditional methods. Overall, the results were satisfactory. The proposed tool can be further extended and applied for requirements validation in the software industry.The tool can be downloaded at https://rm2pt.com/advs/validgen/, and a demo video is at https://youtu.be/AP9Ymg1ewIA.

DOI: 10.1145/3639478.3640048

FaultFuzz： A Coverage Guided Fault Injection Tool for Distributed Systems

作者: Feng, Wenhan and Pei, Qiugen and Gao, Yu and Wang, Dong and Dou, Wensheng and Wei, Jun and Liang, Zheheng and Long, Zhenyue
关键词: distributed system, fault recovery bug, fault injection

Abstract

Distributed systems are expected to correctly recover from various faults, e.g., node crash / reboot and network disconnection / reconnection. However, faults that occur under special timing can trigger fault recovery bugs that are rooted in incorrect fault recovery protocols and implementations. Existing random and brute-force fault injection approaches are not effective in revealing fault recovery bugs due to the combinatorial explosion of multiple faults in distributed systems.In this paper, we propose FaultFuzz, a coverage guided fault injection approach that can systematically and effectively test fault recovery behaviors in distributed systems. Based on runtime feedbacks collected from distributed system testing, e.g., code coverage and I/O information, FaultFuzz generates possible combinations of faults, and preferentially selects the combinations that are more likely to trigger new fault recovery behaviors and reveal new fault recovery bugs. We have applied FaultFuzz on three widely-used distributed systems, i.e., Zookeeper, HDFS and HBase and found 5 bugs in them. A video demonstration of FaultFuzz is available at https://youtu.be/SMw1ZF1vyXw.

DOI: 10.1145/3639478.3640036

Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist

作者: Khatiri, Sajad and Panichella, Sebastiano and Tonella, Paolo
关键词: unmanned aerial vehicles, test generation, simulation

Abstract

Simulation-based testing is crucial for ensuring the safety and reliability of unmanned aerial vehicles (UAVs), especially as they become more autonomous and get increasingly used in commercial scenarios. The complexity and automated nature of UAVs requires sophisticated simulation environments for effectively testing their safety requirements. The primary challenges in setting up these environments pose significant barriers to the practical, widespread adoption of UAVs. We address this issue by introducing Aerialist (unmanned AERIAL vehIcle teST bench), a novel UAV test bench, built on top of PX4 firmware, that facilitates or automates all the necessary steps of definition, generation, execution, and analysis of system-level UAV test cases in simulation environments. Moreover, it also supports parallel and scalable execution and analysis of test cases on Kubernetes clusters. This makes Aerialist a unique platform for research and development of test generation approaches for UAVs. To evaluate Aerialist’s support for UAV developers in defining, generating, and executing UAV test cases, we implemented a search-based approach for generating realistic simulation-based test cases using real-world UAV flight logs. We confirmed its effectiveness in improving the realism and representativeness of simulation-based UAV tests.Code Repository: https://github.com/skhatiri/AerialistDemo Video: https://youtu.be/k_bqYpWItSg

DOI: 10.1145/3639478.3640031

Accurate Architectural Threat Elicitation From Source Code Through Hybrid Information Flow Analysis

作者: Gruner, Bernd
关键词: architectural threat analysis, information flow, security, fuzzing, software architecture reconstruction, static & dynamic analysis

Abstract

Software processes a vast amount of sensitive data. However, tracing information flows in complex programs and eliciting threats, which, for example, could lead to information leaks, pose significant challenges. The problem lies in the absence of suitable approaches to effectively address this issue. Symbolic verification is too restrictive for practical use, taint analysis faces challenges due to overapproximation, and fuzzers can only identify crashes and hangs.In my doctoral research, I introduce an approach for reconstructing and refining information flow graphs in order to elicit threats. Using static analysis, I automatically reconstruct an information flow graph. Subsequently, I refine the found information flows using information flow fuzzing and associate threats through a rule-based system. My approach provides a validated information flow graph of the software and a list of elicited threats.

DOI: 10.1145/3639478.3639795

Aiding Developer Understanding of Software Changes via Symbolic Execution-based Semantic Differencing

作者: Glock, Johann
关键词: program comprehension, semantic differencing, equivalence checking, symbolic execution

Abstract

According to a recent observational study, developers spend an average of 48% of their development time on debugging tasks. Approaches such as equivalence checking and fault localization support developers during debugging tasks by providing information that enables developers to more quickly identify and deal with unintended changes in program behavior. The accuracy and runtime performance of these approaches have seen continuous improvements throughout the years. However, the outputs of existing tools are often difficult to understand for developers due to a lack of context information and result explanations. Our goal is to improve upon this issue by developing a new equivalence checking approach that (i) is at least as accurate as existing approaches but (ii) provides more detailed descriptions of identified behavioral / semantic differences and (iii) presents these results in a way that is useful for developers, thus aiding developer understanding of equivalence checking results and corresponding software changes.

DOI: 10.1145/3639478.3639783

Architecture-Based Cross-Component Issue Management and Propagation Analysis

作者: Speth, Sandro
关键词: issue management, issue propagation analysis, component-based software architecture, model-based analysis

Abstract

This paper addresses the challenge of issue management in complex, component-based software architectures. In these systems, issues in one component often propagate across the architecture along the call chains. Yet, traditional issue management systems (IMSs) are limited to the boundaries of a single component and lack mechanisms for managing issues concerning their architectural dependencies. We present Gropius, a novel method that enhances issue management by integrating issues in an architecture graph. Gropius allows semantically linking issues across different components, synchronizes changes with underlying IMSs like GitHub, and allows modeling the architecture ontologically by defining the components’ semantics at runtime. We explore whether combining issue and architecture management improves the development of component-based architectures regarding issue management. We hypothesize that this method will improve the efficiency and effectiveness of identifying and resolving cross-component issues, maintaining a comprehensive view of the application’s state.

DOI: 10.1145/3639478.3639814

A Software Security Evaluation Framework

作者: Kudriavtseva, Arina
关键词: security metrics, assessing security, security evaluation framework

Abstract

This research will aim to introduce a comprehensive framework to measure the security of software systems. We plan to enhance and extend the existing security measurement approaches with critical human insights of the mental models of security software development experts because we sense a strong focus on security metrics by these approaches currently. By intertwining security metrics and humans’ perception of security, we strive to overcome the well-known hurdles of software security measurement that have long been considered an unsolvable problem. Our proposed solution is captured by the so-called software security evaluation framework.

DOI: 10.1145/3639478.3639796

Automated Model Quality Estimation and Change Impact Analysis on Model Histories

作者: Blaschke, Konstantin Rupert
关键词: model-based systems engineering, model quality, model metrics, quality assessment, model review, change-impact analysis

Abstract

Cyber-Physical Systems integrate hardware with software in complex applications. To mitigate the complexity, engineers rely on model-based systems engineering approaches. Updates and function enhancements lead to frequently changing design constraints and objectives. These changes increase the need to rework and extend model artifacts of the system. This can cause quality degradation over time due to modeling errors, knowledge disparities, or a lack of guidelines. To enable efficient collaboration and reduce maintenance costs in model-based systems engineering, the industry needs a cost-efficient, scalable approach to monitor model quality. The work outlines a doctoral thesis investigating the potential of automated data-driven quality assessment strategies using model artifact history and model changes. We will extract metrics and model changes to establish quality feedback for system engineers. We aim to use manual model quality assessments to incorporate domain-specific expert knowledge into the automated strategy. The main goals are to lower the effort of model quality assessments, to provide practitioners with foresight on quality development, and to estimate task effort to improve model quality.

DOI: 10.1145/3639478.3639809

Autonomic Testing： Testing with Scenarios from Production

作者: Qiu, Ketai
关键词: autonomic testing, failure detection, test generation

Abstract

My PhD addresses the problem of detecting field failures with a new approach to test software systems under conditions that emerge only in production. Ex-vivo approaches detect field failures by executing the software system in the testbed with data extracted from the production environment. In-vivo approaches execute the available test suites in the production environment. We will define autonomic testing that detects conditions that emerge only in production scenarios, generates test cases for the new conditions, and executes the generated test cases in the new scenarios, to detect failures before they occur in production.

DOI: 10.1145/3639478.3639802

Beyond Accuracy and Robustness Metrics for Large Language Models for Code

作者: Rodriguez-Cardenas, Daniel
关键词: deep learning, code generation, interpretability, transformers

Abstract

In recent years, Large Language Models for code (LLMc) have transformed the landscape of software engineering (SE), demonstrating significant efficacy in tasks such as code completion, summarization, review, tracing, translation, test case generation, clone detection, and bug fixing. Notably, GitHub Copilot [31] and Google’s CodeBot [21] exemplify how LLMc contributes to substantial time and effort savings in software development. However, despite their widespread use, there is a growing need to thoroughly assess LLMc, as current evaluation processes heavily rely on accuracy and robustness metrics, lacking consensus on additional influential factors in code generation. This gap hinders a holistic understanding of LLMc performance, impacting interpretability, efficiency, bias, fairness, and robustness. The challenges in benchmarking and data maintenance compound this issue, underscoring the necessity for a comprehensive evaluation approach. To address these issues, this dissertation proposes the development of a benchmarking infrastructure, named HolBench, aimed at overcoming gaps in evaluating LLMc quality. The goal is to standardize testing scenarios, facilitate meaningful comparisons across LLMc, and provide multi-metric measurements beyond a sole focus on accuracy. This approach aims to decrease the costs associated with advancing LLMc research, enhancing their reliability for adoption in academia and industry.

DOI: 10.1145/3639478.3639792

Beyond Accuracy： Evaluating Source Code Capabilities in Large Language Models for Software Engineering

作者: Velasco, Alejandro
关键词: large language models, interpretability, DL4SE, category theory, causal inference

Abstract

This dissertation aims to introduce interpretability techniques to comprehensively evaluate the performance of Large Language Models (LLMs) in software engineering tasks, beyond canonical metrics. In software engineering, Deep Learning techniques are widely employed across various domains, automating tasks such as code comprehension, bug fixing, code summarization, machine translation, and code generation. However, the prevalent use of accuracy-based metrics for evaluating Language Models trained on code often leads to an overestimation of their performance. Our work seeks to propose novel and comprehensive interpretability techniques to evaluate source code capabilities and provide a more nuanced understanding of LLMs performance across downstream tasks.

DOI: 10.1145/3639478.3639815

Building a Framework to Improve the User Experience of Static Analysis Tools

作者: Schlichtig, Michael
关键词: cryptography, benchmark, API misuse, static analysis, explainability, user experience

Abstract

Static analysis tools are an important technique that helps in the development of secure code by analyzing code and reporting potential errors to developers. Besides the technical challenges of developing sophisticated static analyses, research, however, has also shown that static analysis tools often do not address the tools’ usability sufficiently. Such usability issues existing in static analysis tools can inhibit the acceptance in practice by developers or might even lead to the tool’s dismissal.To address this, we aim to help improve the user experience for developers using static analysis tools. We investigated several fundamentals to develop and properly evaluate usability interventions. Those fundamentals namely are the Foundations on the origin of API misuse (F1), the State of the art of static analysis tool usability (F2), and the Evaluation of static analysis tool accuracy (F3).Combining these fundamentals, we propose a theoretical framework to develop usability interventions for static analysis tools and then evaluate them. In this paper we discuss our research contribution to each fundamental respectively and how we believe the resulting framework can be employed to improve the user experience of static analysis tools.

DOI: 10.1145/3639478.3639813

Discovering Explainability Requirements in ML-Based Software

作者: Sporsem, Tor
关键词: requirements engineering, explainability requirements, ML-based software, qualitative sensemaking, user feedback

Abstract

As the demand for Machine Learning (ML)-based software continues to grow across various industries such as healthcare, automotive, energy, and banking, there is an increasing need for explainability requirements. Domain experts such as doctors must have confidence in ML-based software to integrate them into their professional practices. This requires developers to simultaneously develop clear explanations of how these Machine Learning models work as they build the systems. While numerous philosophies and techniques for eliciting user requirements in software systems have been extensively studied within Requirements Engineering (RE), scholars argue that we need new approaches tailored to elicit explainability requirements. This PhD research aims to conduct empirical studies examining emerging methodologies and philosophies for identifying explainability requirements. The objective is to connect theoretical insights and practical approaches adopted by practitioners in this rapidly evolving field.

DOI: 10.1145/3639478.3639807

Enhancing Model-Driven Reverse Engineering Using Machine Learning

作者: Siala, Hanan Abdulwahab
关键词: application programs, model driven reverse engineering (MDRE), unified modeling language (UML), object constraint language (OCL), machine learning, large language models (LLMS), program comprehension

Abstract

Organizations often rely on large applications that are classified as legacy systems due to their dependence on outdated programming languages or platforms. To modernize these systems, it is necessary to understand their architecture, functionality, and business rules. Our research aims to define a novel model-driven reverse engineering (MDRE) approach to extract Unified Modeling Language (UML) and Object Constraint Language (OCL) representations from source code using Large Language Models (LLMs).

DOI: 10.1145/3639478.3639797

Ensuring Critical Properties of Test Oracles for Effective Bug Detection

作者: Hossain, Soneya Binta
关键词: No keywords

Abstract

With software becoming essential in all aspects of our lives, especially in critical areas like medical and avionic systems, the need for robust and reliable software is more critical than ever. Even seemingly insignificant software bugs can compromise system stability and security, as evidenced by a simple copy-paste error in Apple devices accepting invalid SSL certificates and a date formatting issue causing a widespread Twitter outage. These realities underscore the need for effective testing and bug detection mechanisms to ensure software reliability. At the heart of this challenge are test oracles, a fundamental component of testing, which play a crucial role in detecting software bugs.Recognizing the pivotal role of test oracles, my research conducts large-scale studies to understand their impact on bug detection effectiveness, identify limitations in existing test adequacy metrics and automated oracle generation methods. Based on the findings, my research identifies three key properties of test oracles essential for effective bug detection, referred to as CCS (check, correct, strong). These properties ensure that test oracles thoroughly check codes, are correct based on the specification and strong for bug detection. To enforce the CCS properties, my research introduces a set of methods, leading to the development of OracleGuru framework that significantly enhances the quality of test oracles.

DOI: 10.1145/3639478.3639791

Generating User Experience Based on Personas with AI Assistants

作者: Huang, Yutan
关键词: No keywords

Abstract

Traditional UX development methodologies focus on developing “one size fits all” solutions and lack the flexibility to cater to diverse user needs. In response, a growing interest has arisen in developing more dynamic UX frameworks. However, existing approaches often cannot personalise user experiences and adapt to user feedback in real-time. Therefore, my research introduces a novel approach of combining Large Language Models and personas, to address these limitations. The research is structured around three areas: (1) a critical review of existing adaptive UX practices and the potential for their automation; (2) an investigation into the role and effectiveness of personas in enhancing UX adaptability; and (3) the proposal of a theoretical framework that leverages LLM capabilities to create more dynamic and responsive UX designs and guidelines.

DOI: 10.1145/3639478.3639810

Increasing trust in the open source supply chain with reproducible builds and functional package management

作者: Malka, Julien
关键词: No keywords

Abstract

Functional package managers (FPMs) and reproducible builds (R-B) are technologies and methodologies that are conceptually very different from the traditional software deployment model, and that have promising properties for software supply chain security. This thesis aims to evaluate the impact of FMPs and R-B on the security of the software supply chain and propose improvements to the FPM model to further improve trust in the open source supply chain.

DOI: 10.1145/3639478.3639806

Investigating Cultural Dispersion： on the Role of Cultural Differences in Software Development Teams

作者: Lambiase, Stefano
关键词: global software engineering, cultural dispersion, socio-technical aspects

Abstract

Software development, inherently a social activity, involves individuals across diverse geographical and cultural settings. Despite this nature, the existing Global Software Engineering research body encounters limitations, making the achieved results challenging to use by practitioners. This Ph.D. research project seeks to overcome these constraints by crafting a theoretical framework. The framework systematically captures cultural differences, exploring their impact on various aspects of software development and delving into practitioners’ strategies for managing these influences. Additionally, the project aims to significantly contribute to the professional software development landscape by transferring research findings to practitioners through practical tools. This framework serves as an immediate application for professionals, fostering project success through heightened cultural awareness and adaptability, thereby enhancing developer well-being in inclusive and culturally diverse environments.

DOI: 10.1145/3639478.3639799

Learning Models of Cyber-Physical Systems with Discrete and Continuous Behaviour for Digital Twin Synthesis

作者: Wallner, Felix
关键词: digital twin, hybrid system, automata learning, machine learning, cyber-physical system

Abstract

Digital twins are used to simulate (cyber-physical) systems and offer great benefits for testing and verification. The importance of quickly and efficiently constructing digital twins increases with the appearance of devices of greater complexity. Furthermore, the more (varied) behaviour the digital twin captures of the simulated device the more use cases it can be used for. In the presented thesis we investigate methods from automata learning and machine learning to automatically synthesise digital twins from cyber-physical systems, capturing both discrete and continuous behaviour. Our aim hereby is to combine methods from both fields and utilize their respective strengths to build better digital twins from cyber-physical systems in practice. We already developed an algorithm that learns discrete behavioural models even in the presence of noisy data.

DOI: 10.1145/3639478.3639793

Managing API Evolution in Microservice Architecture

作者: Lercher, Alexander
关键词: microservice architecture, API evolution, web API, REST API

Abstract

Nowadays, many software systems are split into loosely coupled microservices only communicating via Application Programming Interfaces (APIs) to improve maintainability, scalability, and fault tolerance. However, the loose coupling between microservices provides no immediate feedback on breaking API changes, and consuming services break or exhibit unexpected behavior only after the first actual call to the changed API. Hence, development teams must actively identify and communicate all breaking changes to affected teams to stay compatible. This research addresses this problem with three contributions. First, we identified API evolution strategies and open challenges in practice with an explorative study. Based on the study findings, we formulated two open research directions for evolving publicly accessible APIs, i.e., REpresentational State Transfer (REST) APIs. As the second contribution, we will introduce a REST API change extraction approach to improve the change notification accuracy. We plan experiments on open-source projects to evaluate our approach’s accuracy and compare it to openapi-diff for structural changes. Third, we plan to investigate methods for automating communication with affected teams, which will then improve the change notification reliability. Finally, we will evaluate the accuracy and reliability of our notifications with a user study.

DOI: 10.1145/3639478.3639800

MEITREX - Gamified and Adaptive Intelligent Tutoring in Software Engineering Education

作者: Mei\ss{
关键词: software engineering education, student motivation, intelligent tutoring system, learning analytics, gamification, feedback

Abstract

Nowadays, learning management systems (LMSs) are established tools in higher education, especially in the domain of software engineering (SE). However, the potential of such educational technologies has not been fully exploited, as student performance in SE education is still strongly dependent on feedback from time-constrained lecturers and tutors. Moreover, current LMSs are not designed for SE courses, as external SE tools are required to fulfill the requirements of lecturers such as programming and UML modeling features. Evolving these LMSs in the direction of intelligent tutoring could assist students in receiving automatic, individual feedback from the LMSs on their learning performance at any time. Also, gamified learning elements can serve to motivate students to engage with SE materials. Therefore, this paper presents an approach combining learning analytics, feedback, and interactive learning such as gamification in one LMS designed for SE education. The system could thus address diverse students with different backgrounds and motivational aspects and provide appropriate individual support to ensure effective SE education.

DOI: 10.1145/3639478.3639804

On Improving Management of Duplicate Video-Based Bug Reports

作者: Yan, Yanfu
关键词: bug reporting, GUI learning, duplicate video retrieval

Abstract

Video-based bug reports have become a promising alternative to text-based reports for programs centered around a graphical user interface (GUI), as they allow for seamless documentation of software faults by visually capturing buggy behavior on app screens. However, developing automated techniques to manage video-based reports is challenging as it requires identifying and understanding often nuanced visual patterns that capture key information about a reported bug. Therefore, my research endeavors to overcome these challenges by advancing the bug report management task of duplicate detection for video-based reports. The objectives of my research are fourfold: (i) investigate the benefits of tailoring recent advancements in the computer vision domain for learning both visual and textual patterns from video frames depicting GUI screens to detect duplicate reports; (ii) adapt the scene-learning capabilities of vision transformers to capture subtle visual and textual patterns that manifest on app UI screens; (iii) construct a more comprehensive and realistic benchmark which contains video-based bug reports derived from real bugs; (iv) conduct an empirical evaluation to potentially demonstrate state-of-the-art improvements achieved by the proposed approach.

DOI: 10.1145/3639478.3639786

Programming Language Models in Multilingual Settings

作者: Katzy, Jonathan
关键词: large language models, explainable AI, software engineering, code completion, multilingual, programming languages

Abstract

Large language models have become increasingly utilized in programming contexts. However, due to the recent emergence of this trend, some aspects have been overlooked. We propose a research approach that investigates the inner mechanics of transformer networks, on a neuron, layer, and output representation level, to understand whether there is a theoretical limitation that prevents large language models from performing optimally in a multilingual setting. We propose to approach the investigation into the theoretical limitations, by addressing open problems in machine learning for the software engineering community. This will contribute to a greater understanding of large language models for programming-related tasks, making the findings more approachable to practitioners, and simply their implementation in future models.

DOI: 10.1145/3639478.3639787

Resolving Goal-Conflicts and Scaling Synthesis through Mode-Based Decomposition

作者: Brizzio, Mat'{\i
关键词: search-based software engineering, formal methods, requirements engineering, evolutionary computation, reactive synthesis

Abstract

Reactive synthesis, with its roots in the work of A. Church, presents a transformative approach for the formal methods community. It seeks to translate system behaviors expressed in Linear-Time Temporal Logic (LTL) into correct-by-construction models using synthesis tools. However, this approach faces substantial challenges. Among these challenges is the high computational complexity of LTL synthesis, which constrains its application to large-scale systems. Additionally, unrealizable specifications present a significant obstacle as they act as barriers, impeding the synthesis process. Furthermore, the presence of goal-conflicts within requirements introduces contradictions and ambiguity, further complicating the synthesis process. These issues collectively make synthesis demanding, often resulting in suboptimal or unviable systems.Our research is dedicated to establishing a robust framework that systematically addresses these challenges, effectively bridging the gap between high-level requirements and dependable system realization. We prioritize refining requirement precision and advancing scalable synthesis techniques, offering advanced tools and methodologies to practitioners and researchers.

DOI: 10.1145/3639478.3639801

Selecting and Constraining Metamorphic Relations

作者: Duque-Torres, Alejandra
关键词: test oracle, metamorphic testing, metamorphic relations, test data, pattern mining

Abstract

Software testing is a critical aspect of ensuring the reliability and quality of software systems. However, it often poses challenges, particularly in determining the expected output of a System Under Test (SUT) for a given set of inputs, a problem commonly referred to as the test oracle problem. Metamorphic Testing (MT) offers a promising solution to the test oracle problem by examining the relations between input-output pairs in consecutive executions of the SUT. These relations, referred to as Metamorphic Relations (MRs), define the expected changes in the output when specific changes are made to the input. Our research is focused on developing methods and tools to assist testers in the selection of MRs, the definition of constraints, and providing explanations for MR outcomes. The research is divided in three parts. The first part focuses on MR collection and description, entailing the creation of a comprehensive repository of MRs from various sources. A standardised MR representation is devised to promote machine-readability and wide-ranging applicability. The second part introduces MetraTrimmer, a test-data-driven approach for systematically selecting and constraining MRs. This approach acknowledges that MRs may not be universally applicable to all test data space. The final part, evaluation and validation, encompasses empirical studies aimed at assessing the effectiveness of the developed methods and validating their suitability for real-world regression testing scenarios. Through this research, we aim to advance the automation of MR generation, enhance the understanding of MR violations, and facilitate their effective application in regression testing.

DOI: 10.1145/3639478.3639781

Simulation-based Testing of Automated Driving Systems

作者: Khan, Fauzia
关键词: automated driving systems, safety testing, simulation-based testing

Abstract

Automated Driving Systems (ADS) require extensive safety testing before receiving a road permit. To gain public trust, ADSs must be as safe as a Human Driven Vehicle (HDV) or even safer. Simulation-based safety testing is a cost-effective way to check the safety of ADS. My goal is to compare the safety behavior of ADS with HDV via simulation and to develop a process of selecting testing scenarios that could be useful to build trust and reliability in simulations. Additionally, I aim to translate the performance advantages and disadvantages observed in simulated ADS behavior into real-world safety-critical traffic situations.

DOI: 10.1145/3639478.3639788

Smart Quality Monitoring for Evolving Complex Systems

作者: EL Moussa, Noura
关键词: No keywords

Abstract

Evolving complex systems, such as complex software systems, dynamic cloud systems and smart ecosystems, arise from the interactions of systems, agents and people, evolve and adjust dynamically over time. Evolving complex systems may fail due to interactions among components, agents and people, interactions that emerge during the evolution of the system. It is essential to adequately monitor evolving software systems to reveal anomalous conditions that emerge during evolution and may lead to catastrophic failures. Current monitoring approaches do not deal with either the dynamic characteristics of evolving complex systems or people as active elements of the system.In this PhD thesis, we define smart monitoring, an approach to monitor evolving complex systems and predict failures. We propose an incrementally trained neural model to capture the evolving characteristics of complex systems and detect anomalies that can later lead to failures. We exploit the state-of-the-art OCEAN model to monitor the impact of the personality traits of the people and detect behaviors that may lead to system failures.

DOI: 10.1145/3639478.3639784

Studying and Improving Software License Compliance in Practice

作者: Wintersgill, Nathan
关键词: software licensing, legal practitioners, open source software

Abstract

As the process of software development has matured, the reuse of open-source software has become commonplace. Open-source software licenses can both provide permissions and impose restrictions regarding software’s distribution, modification, and reuse. Modern systems can have many such licensed components, complicating the task of license compliance and compounding the risk associated with reusing open source components. To address these issues, this dissertation seeks to identify weaknesses in current processes and automated tools, such as in handling license conflicts, exceptions, and interpretations, in order to develop new compliance tools and resources grounded in the realities of software compliance as revealed by software developers and legal practitioners.

DOI: 10.1145/3639478.3639785

Sustainable Adaptive Security

作者: Ramkumar, Kushal
关键词: No keywords

Abstract

With software systems permeating our lives, we are entitled to expect that such systems are secure-by-design, and that such security endures throughout the use of these systems and their subsequent evolution. During my PhD, I aim to engineer sustainable adaptive security solutions that reflect such enduring protection in the dynamically changing security theatre of cyber-physical systems. I have chosen the example of a smart home as a cyber-physical system to motivate & illustrate sustainable adaptive security, discuss challenges for sustainably secure systems, and my research plan for engineering them.This research was funded by Science Foundation Ireland grant 13/RC/2094_P2.

DOI: 10.1145/3639478.3639790

Sustainable Software Engineering： Visions and Perspectives beyond Energy Efficiency

作者: K"{o
关键词: sustainable development, sustainability, sustainable software engineering

Abstract

In the face of multiple global crises such as climate change, a transformation towards sustainable development is more urgent than ever. Digitalization, as a fundamental change in society and the economy, offers great opportunities for sustainable development, but also poses its own threats, as evident in the immense resource consumption and growing surveillance tendencies. To leverage digitalization for sustainability transformation without compromising it, software engineering requires a significant shift in practices and structures. However, research in this area is still immature, lacking a deeper understanding of sustainability, its application in practice and solid engineering approaches. To bridge these gaps, this thesis aims to operationalize sustainability by proposing sustainability goals for software engineering, followed by the development of novel assessment methods and appropriate tool support.

DOI: 10.1145/3639478.3639782

Sustaining Scientific Open-Source Software Ecosystems： Challenges, Practices, and Opportunities

作者: Sun, Jiayi
关键词: No keywords

Abstract

Scientific open-source software (scientific OSS) has facilitated scientific research due to its transparent and collaborative nature. The sustainability of such software is becoming crucial given its pivotal role in scientific endeavors. While past research has proposed strategies for the sustainability of the scientific software or general OSS communities in isolation, it remains unclear when the two scenarios are merged if these approaches are directly applicable to developing scientific OSS. In this research, we propose to investigate the unique challenges in sustaining the scientific OSS ecosystems. We first conduct a case study to empirically understand the interdisciplinary team’s collaboration in scientific OSS ecosystems and identify the collaboration challenges. Further, to generalize our findings, we plan to conduct a large-scale quantitative study in broader scientific OSS ecosystems to identify the cross-project collaboration inefficiencies. Finally, we would like to design and develop interventions to mitigate the problems identified.

DOI: 10.1145/3639478.3639805

Toward Rapid Bug Resolution for Android Apps

作者: Mahmud, Junayed
关键词: bug reporting, bug localization, GUI, mobile apps

Abstract

Bug reports document unexpected behaviors in software, enabling developers to understand, validate, and fix bugs. Unfortunately, a significant portion of bug reports is of low quality, which poses challenges for developers in terms of addressing these issues. Prior research has delved into the information needed for documenting high-quality bug reports and expediting bug report management. Furthermore, researchers have explored the challenges associated with bug report management and proposed various automated techniques. Nevertheless, these techniques exhibit several limitations, including a lexical gap between developers and reporters, difficulties in bug reproduction, and identifying bug locations. Therefore, there is a pressing need for additional efforts to effectively manage bug reports and enhance the quality of both desktop and mobile applications. In this paper, we describe the existing limitations of bug reports and identify potential strategies for addressing them. Our vision encompasses a future where the alleviation of these limitations and successful execution of our proposed new research directions can benefit both reporters and developers, ultimately making the entire software maintenance faster.

DOI: 10.1145/3639478.3639812

Towards AI-centric Requirements Engineering for Industrial Systems

作者: Bashir, Sarmad
关键词: requirements engineering, industrial automation, language models

Abstract

Engineering large-scale industrial systems mandate an effective Requirements Engineering (RE) process. Such systems necessitate RE process optimization to align with standards, infrastructure specifications, and customer expectations. Recently, artificial intelligence (AI) based solutions have been proposed, aiming to enhance the efficiency of requirements management within the RE process. Despite their advanced capabilities, generic AI solutions exhibit limited adaptability within real-world contexts, mainly because of the complexity and specificity inherent to industrial domains. This limitation notably leads to the continued prevalence of manual practices that not only cause the RE process to be heavily dependent on practitioners’ experience, making it prone to errors, but also often contributes to project delays and inefficient resource utilization. To address these challenges, this Ph.D. dissertation focuses on two primary directions: i) conduct a comprehensive focus group study with a large-scale industry to determine the requirements evolution process and their inherent challenges and ii) propose AI solutions tailored for industrial case studies to automate and streamline their RE process and optimize the development of large-scale systems. We anticipate that our research will significantly contribute to the RE domain by providing empirically validated insights in the industrial context.

DOI: 10.1145/3639478.3639811

Towards Automatic Inference of Behavioral Component Models for ROS-Based Robotics Systems

作者: D"{u
关键词: No keywords

Abstract

Model-based analysis is a common technique to identify incorrect behavioral composition of complex, safety-critical systems, such as robotics systems. However, creating structural and behavioral models for hundreds of software components manually is often a labor-intensive and error-prone process. I propose an approach to infer behavioral models for components of systems based on the Robot Operating System (ROS), the most popular framework for robotics systems, using a combination of static and dynamic analysis by exploiting assumptions about the usage of the ROS framework. This work is a contribution towards making well-proven and powerful but infrequently used methods of model-based analysis more accessible and economical in practice to make robotics systems more reliable and safer.

DOI: 10.1145/3639478.3639808

Towards Combining STPA and Safety-Critical Runtime Monitoring

作者: Zimmermann, Eva
关键词: software engineering, STPA, safety-critical, runtime monitoring

Abstract

The dependence on software in safety-critical system is increasingly growing and the reliability of the systems becomes more and more critical. Therefore, we need to adapt software engineering concepts like DevOps to also be able to react to changes faster. This process then also needs to be enriched by an alignment of an agile safety method with different phases of the DevOps cycle, where the contribution of my PhD is located. I plan to enhance the safety analysis method STPA with a concept to identify indicators for runtime monitoring and to provide valuable feedback to safety engineers about violations and trends of identified indicators, as well as suggesting new ones.

DOI: 10.1145/3639478.3639794

Towards Interpreting the Behavior of Large Language Models on Software Engineering Tasks

作者: Dipongkor, Atish Kumar
关键词: No keywords

Abstract

Large Language Models (LLMs) have ushered in a significant breakthrough within the field of Natural Language Processing. Building upon this achievement, analogous language models have been developed specifically for code-related tasks, commonly referred to as Large Language Models for Code (LLMsC). Notable examples of LLMsC include CodeBERT, UnixCoder, CoPilot, among others. These models have demonstrated exceptional performance across various Software Engineering (SE) tasks, encompassing code summarization, test case generation, natural language to code conversion, bug triaging, malware detection, program repair, and more.Despite the promising results achieved by LLMsC in SE tasks, there remains fundamental questions regarding their decision-making processes. Understanding these model decision mechanisms is crucial for further enhancing the performance of LLMsC. In pursuit of this objective, my PhD dissertation aims to pioneer novel methodologies for interpreting and comprehending the behavior of LLMsC.

DOI: 10.1145/3639478.3639798

Towards Safe, Secure, and Usable LLMs4Code

作者: Al-Kaswan, Ali
关键词: large language models, privacy, memorisation, data leakage, compression

Abstract

Large Language Models (LLMs) are gaining popularity in the field of Natural Language Processing (NLP) due to their remarkable accuracy in various NLP tasks. LLMs designed for coding are trained on massive datasets, which enables them to learn the structure and syntax of programming languages. These datasets are scraped from the web and LLMs memorise information in these datasets. LLMs for code are also growing, making them more challenging to execute and making users increasingly reliant on external infrastructure. We aim to explore the challenges faced by LLMs for code and propose techniques to measure and prevent memorisation. Additionally, we suggest methods to compress models and run them locally on consumer hardware.

DOI: 10.1145/3639478.3639803

Understandable Test Generation Through Capture/Replay and LLMs

作者: Deljouyi, Amirhossein
关键词: automatic test generation, carving and replaying, large language models, readability, understandability, unit testing

Abstract

Automatic unit test generators, particularly search-based software testing (SBST) tools such as EvoSuite, efficiently generate unit test suites with acceptable coverage. Although this removes the burden of writing unit tests from developers, these generated tests often pose challenges in terms of comprehension for developers. In my doctoral research, I aim to investigate strategies to address the issue of comprehensibility in generated test cases and improve the test suite in terms of effectiveness. To achieve this, I introduce four projects leveraging Capture/Replay and Large Language Model (LLM) techniques.Capture/Replay carves information from End-to-End (E2E) tests, enabling the generation of unit tests containing meaningful test scenarios and actual test data. Moreover, the growing capabilities of large language models (LLMs) in language analysis and transformation play a significant role in improving readability in general. Our proposed approach involves leveraging E2E test scenario extraction alongside an LLM-guided approach to enhance test case understandability, augment coverage, and establish comprehensive mock and test oracles.In this research, we endeavor to conduct both a quantitative analysis and a user evaluation of the quality of the generated tests in terms of executability, coverage, and understandability.

DOI: 10.1145/3639478.3639789

Obfuscation-Resilient Software Plagiarism Detection with JPlag

作者: Sa\u{g
关键词: plagiarism detection, obfuscation attacks, CS education

Abstract

The rise of automated obfuscation techniques challenges the widespread assumption that evading a software plagiarism detector requires more effort than completing programming and modeling assignments in computer science education. This threatens plagiarism detectors without comprehensive obfuscation resilience and, ultimately, academic integrity. This paper summarizes recent enhancements of JPlag, a widely-used software plagiarism detector, enabling it to achieve broad resilience against automated obfuscation. The findings demonstrate that JPlag significantly outperforms the state-of-the-art in terms of obfuscation resilience.

DOI: 10.1145/3639478.3643074

Poster： Kotlin Assimilating the Android Ecosystem - An Appraisal of Diffusion and Impact on Maintainability

作者: Coppola, Riccardo and Fulcini, Tommaso and Torchiano, Marco
关键词: software maintainability, android development, kotlin

Abstract

Kotlin is a language alternative to Java, introduced in 2011. It promises to address many of Java’s limitations and lead to better application maintainability. In 2017, it became a first-class language for Android development with full tool support. We mined a dataset of 2708 Android applications on which we based our study. Our empirical assessment of the diffusion of Kotlin in Android app development shows that it is now used in around 40% of projects. Kotlin adoption has a significant positive effect on code maintainability metrics and in popularity among end-users and developers. Overall, Kotlin appears to be successfully fulfilling its promise of being a better Java for Android development.

DOI: 10.1145/3639478.3643071

作者: Franke, Lucas and Liang, Huayu and Brantly, Aaron and Davis, James C. and Brown, Chris
关键词: No keywords

Abstract

This poster describes work on the General Data Protection Regulation (GDPR) in open-source software. Although open-source software is commonly integrated into regulated software, and thus must be engineered or adapted for compliance, we do not know how such laws impact open-source software development.We surveyed open-source developers (N=47) to understand their experiences and perceptions of GDPR. We learned many engineering challenges, primarily regarding the management of users’ data and assessments of compliance. We call for improved policy-related resources, especially tools to support data privacy regulation implementation and compliance in open-source software.

DOI: 10.1145/3639478.3643077

KareCoder： A New Knowledge-Enriched Code Generation System

作者: Huang, Tao and Sun, Zhihong and Jin, Zhi and Li, Ge and Lyu, Chen
关键词: No keywords

Abstract

Large Language Models (LLMs) demonstrate proficiency in handling fundamental programming problems but struggle with complex programming in new types. The study presents KareCoder, integrating programming knowledge into code generation. Initial tests reveal KareCoder’s significant success in the Pass@1 metric for complex competitive programming problems.

DOI: 10.1145/3639478.3643076

ParSE： Efficient Detection of Smart Contract Vulnerabilities via Parallel and Simplified Symbolic Execution

作者: He, Long and Zhao, Xiangfu and Wang, Yichen
关键词: smart contract, blockchain, symbolic execution, vulnerability detection

Abstract

Symbolic execution is a frequently used method for vulnerability detection in smart contracts. However, existing tools face limitations with the constraint solving and may cause the “path explosion” problem. This costs too much time and may lead to False-Negative (FN) of detection results. In this work, we propose ParSE, a novel approach that leverages Parallel and Simplified symbolic Execution to improve both detection efficiency and the number of True-Positive (TP). We inject ParSE into two widely used symbolic execution tools, Oyente and Mythril, for detecting vulnerabilities in smart contracts. Experimental results show that ParSE accelerates to 9.33x and 5.30x for Oyente and Mythril, respectively. Moreover, tools based on ParSE improve the detection number of TP.

DOI: 10.1145/3639478.3643066

Endogeneity, Instruments, and Two-Stage Models

作者: Graf-Vlachy, Lorenz and Wagner, Stefan
关键词: regression, endogeneity, confounder, two-stage least squares, 2SLS

Abstract

Background: Studies in software engineering are often particularly useful if they make causal claims because this allows practitioners to identify how they can influence outcomes of interest. Unfortunately, many non-experimental studies suffer from potential endogeneity through omitted confounding variables, which precludes claims of causality. Aims and Method: We introduce instrumental variables and two-stage models as a means to account for endogeneity to the field of empirical software engineering. Results and Conclusions: We define endogeneity, explain its primary cause, and lay out the idea behind instrumental variable approaches and two-stage models.

DOI: 10.1145/3639478.3643064

Prompt-Enhanced Software Vulnerability Detection Using ChatGPT

作者: Zhang, Chenyuan and Liu, Hao and Zeng, Jiutian and Yang, Kejing and Li, Yuhong and Li, Hui
关键词: software vulnerability detection, prompt engineering, large language model, chatgpt

Abstract

With the increase in software vulnerabilities that cause significant economic and social losses, automatic vulnerability detection has become essential in software development and maintenance. Recently, large language models (LLMs) have received considerable attention due to their stunning intelligence, and some studies consider using ChatGPT for vulnerability detection. However, they do not fully consider the characteristics of LLMs, since their designed questions to ChatGPT are simple without a prompt design tailored for vulnerability detection. This paper launches a study on the performance of software vulnerability detection using ChatGPT with different prompt designs. Firstly, we complement previous work by applying various improvements to the basic prompt. Moreover, we incorporate structural and sequential auxiliary information to improve the prompt design. Moreover, we leverage ChatGPT’s ability of memorizing multi-round dialogue to design suitable prompts for vulnerability detection. We conduct extensive experiments on two vulnerability datasets to demonstrate the effectiveness of prompt-enhanced vulnerability detection using ChatGPT.

DOI: 10.1145/3639478.3643065

作者: Corallo, Sophie and Weber, Thomas and K"{o
关键词: assumption management, security assumptions

Abstract

Assumptions play a significant role in software engineering. Especially for security, implicit, inconsistent, or invalid assumptions on the system can have a high impact. Even though there are several approaches for managing assumptions in security engineering, most of them are highly specific for their domain and phase in software development. However, for holistic assumption management, a general understanding of security-related assumptions is needed. Funded on a Grounded Theory-based approach, including nine interviews with security researchers and a literature review of 53 scientific publications on assumptions, we propose a first definition of security-related assumptions.

DOI: 10.1145/3639478.3643070

An Empirical Study on Cross-language Clone Bugs

作者: Chen, Honghao and Tang, Ye and Zhong, Hao
关键词: No keywords

Abstract

Many applications have implementations in different languages. Although their languages are different, they can implement similar or even identical functionalities. If an implementation has a bug, the other implementations can have corresponding bugs. In this paper, we call them cross-language clone bugs, or mirror bugs for short. Mirror bugs are important since many applications release implementations in different languages. From mirror bugs, it can be feasible to learn more bug patterns, and thus detect more types of bugs. Although researchers have conducted empirical studies to analyze the bugs in clones, to the best of our knowledge, no study has ever explored mirror bugs. As a result, many research questions are still open. For example, are there any mirror bugs in real projects? Are bug fixes in a language useful to detect and repair bugs in other languages? To answer the above questions, in this paper, we conduct the first empirical study on mirror bugs. In this study, we manually analyze 402 bugs that are collected from four projects, and each project releases a Java implementation and C# implementation. Our study presents answers to two interesting research questions. According to our results, there is a timely need for a tool that assists in detecting mirror bugs. Indeed, we find that some programmers already manually identify and fix mirror bugs, even without any tool support.

DOI: 10.1145/3639478.3643075

Applying Transformer Models for Automatic Build Errors Classification of Java-Based Open Source Projects

作者: Lee, Jonathan and Li, Mason and Hsu, Kuo-Hsun
关键词: build error, build fixing, gradle, open source, deep learning model

Abstract

In open-source development, encountering build failures is a common challenge. Addressing these issues requires analyzing the causes of errors and developing solutions for fixing them. In this work, we fine-tuned Google’s BERT, a well-known language model excellent in transfer learning, to address build issues in Gradle Java projects. Our strategy utilizes this model to classify error logs and identify fixing solutions. This approach extends our previous work, Gradle ACFix, an automated build error fixing system, to explore the potential of using machine learning to classify error types and identify appropriate fixing strategies for software projects. We gathered a dataset of 11,483 open-source Gradle Java projects from GitHub for this research. The model’s evaluation on the error logs of these projects demonstrated a high accuracy rate exceeding 98%.

DOI: 10.1145/3639478.3643068

Micro-scale Concolic Testing Framework for Automated Test Data Generation Based on Path Coverage

作者: Liu, Fangqing and Huang, Han and Xiang, Yi
关键词: micro-scale, concolic testing, test data generation, path coverage

Abstract

Automated test data generation based on path coverage (ATDG-PC) is an essential task in software testing. However, existing concolic testing approaches typically employ single static analyses or search-based algorithms for different paths without utilizing the heuristics between these components. To address ATDG-PC, we propose a micro-scale concolic testing framework (MCTF) that identifies effective subspaces for search-based algorithms based on results of static analyses. Experimental results verify the effectiveness of MCTF in solving ATDG-PC.

DOI: 10.1145/3639478.3643067

Safety Monitoring of Deep Reinforcement Learning Agents

作者: Zolfagharian, Amirhossein and Abdellatif, Manel and Briand, Lionel and S, Ramesh
关键词: No keywords

Abstract

Problem. Deep Reinforcement Learning (DRL) algorithms are increasingly being used in safety-critical systems. Ensuring the safety of DRL agents is a critical concern in such contexts. However, relying solely on testing is not sufficient to ensure safety as it does not offer guarantees. Building safety monitors is one solution to alleviate this challenge. Existing safety monitoring techniques for regular software systems often rely on formal verification to ensure compliance with safety constraints [4]. However, when it comes to DRL policies, formally verifying their behavior to satisfy safety properties becomes an NP-complete problem [6]. Further, monitoring DRL agents in a black-box manner is practically important, as testers and safety engineers often do not have full access to the internals nor the training dataset of the DRL agent [2, 8].

DOI: 10.1145/3639478.3643072

Interpretable Software Maintenance and Support Effort Prediction Using Machine Learning

作者: Haldar, Susmita and Capretz, Luiz Fernando
关键词: maintenance and support effort prediction, explainable machine learning models, model agnostic interpretation

Abstract

Software maintenance and support efforts consume a significant amount of the software project budget to operate the software system in its expected quality. Manually estimating the total hours required for this phase can be very time-consuming, and often differs from the actual cost that is incurred. The automation of these estimation processes can be implemented with the aid of machine learning algorithms. The maintenance and support effort prediction models need to be explainable so that project managers can understand which features contributed to the model outcome. This study contributes to the development of the maintenance and support effort prediction model using various tree-based regression machine-learning techniques from cross-company project information. The developed models were explained using the state-of-the-art model agnostic technique SHapley Additive Explanations (SHAP) to understand the significance of features from the developed model. This study concluded that staff size, application size, and number of defects are major contributors to the maintenance and support effort prediction models.

DOI: 10.1145/3639478.3643069

An Actionable Framework for Understanding and Improving Talent Retention as a Competitive Advantage in IT Organizations

作者: Costa, Luiz Alexandre and Dias, Edson and Ribeiro, Danilo and Font~{a
关键词: No keywords

Abstract

In the rapidly evolving global business landscape, the demand for software has intensified competition among organizations, leading to challenges in retaining highly qualified IT members in software organizations. One of the problems faced by IT organizations is the retention of these strategic professionals, also known as talent. This work presents an actionable framework for Talent Retention (TR) used in IT organizations. It is based on our findings from interviews performed with 21 IT managers. The TR Framework is our main research outcome. Our framework encompasses a set of factors, contextual characteristics, barriers, strategies, and coping mechanisms. Our findings indicated that software engineers can be differentiated from other professional groups, and beyond competitive salaries, other elements for retaining talent in IT organizations should be considered, such as psychological safety, work-life balance, a positive work environment, innovative and challenging projects, and flexible work. A better understanding of factors could guide IT managers in improving talent management processes by addressing Software Engineering challenges, identifying important elements, and exploring strategies at the individual, team, and organizational levels.

DOI: 10.1145/3639478.3643073

Towards Leveraging Fine-Grained Dependencies to Check Requirements Traceability Correctness

作者: Preda, Anamaria-Roberta and Mayr-Dorn, Christoph and Mashkoor, Atif and Assun\c{c
关键词: No keywords

Abstract

Efficient software maintenance and evolution rely heavily on effective software traceability, which is crucial for understanding the relationships between code elements and their corresponding requirements. However, ensuring the accuracy of trace links, whether manually or automatically, is a significant challenge due to the labor-intensive and error-prone nature of traceability tasks. The granularity issue in traceability compounds this challenge, as most existing research focuses on class-level traceability, while fine-grained dependencies (e.g., method-level traces) are more pertinent in daily development practices.Our primary aim is to facilitate the checking of requirement-to-method traces. To this end, we investigate an approach that utilizes the method’s calling information and textual embeddings of requirement-to-method traces to identify inaccuracies in trace links. Our preliminary results are promising. By leveraging a Random Forest (RF) classifier, we have achieved notable improvements in both precision (≈10%) and recall (≈30%) compared to existing methods. This advancement highlights the potential of our method in enhancing the accuracy and efficiency of traceability processes in software development.

DOI: 10.1145/3639478.3643091

Programmable and Semantic Connector for DNN Component Integration： a Software Engineering Perspective

作者: Xu, Jingwei and Zeng, Zihan
关键词: No keywords

Abstract

As deep learning technology continues to evolve, deep neural network (DNN) models have found their way into numerous modern software applications and systems, serving as crucial components. Despite the widespread adoption of DNN models in software, their development process still largely adheres to a craft production model [1]. This craft production approach leads to the creation of unique, highly specialized DNN models that may excel within their target software but prove difficult to standardize or adapt for compatibility with other software systems. In addition, due to the holistic training of these crafted DNN models, they cannot be easily disassembled or reassembled to accommodate new software requirements. Consequently, the reuse of DNN models remains a significant challenge in software engineering, hindering the potential for greater efficiency and adaptability in the development process.At present, the primary approach to reusing DNN models involves retraining them, either by fine-tuning or training from scratch, in the target domain. This retraining process necessitates the use of a new dataset and incurs substantial costs associated with training the model. Moreover, acquiring a new dataset entails additional data collection and labeling efforts, even when the target domain differs only marginally from the original domain. In some cases, it becomes essential to devise distinct DNN structures tailored to the data characteristics or specific software requirements. These factors highlight the craft production nature of DNN model development, lack of scalability, and adaptability.In conventional software engineering, software architecture [2] serves as a blueprint for complex software systems and development projects, as proposed and developed by software engineering researchers. This architectural perspective envisions software as a collection of computational components, connectors, and constraints [3], which dictate the interactions between these components. Incorporating DNN models into software architecture, the primary objective of DNN model integration is to establish components, connectors, and constraints for DNN model design. The architecture of DNN models is naturally constructed through multiple layers, functioning as a DNN component. However, directly stitching together different DNN components presents several challenges: 1) The generated output of each DNN component is difficult to comprehend. 2) directly establishing the connection between DNN components usually requires the expensive cost of DNN model retraining or fine-tuning. 3) the constraints currently present in DNN models are primarily structural but not explicitly semantic. Given that developed DNN components are not easily modified, our primary focus is on establishing connections between DNN components to alter the software’s functionality without the need for retraining the DNN components. By doing this, the deep neural networks, functioning as components, could operate cohesively within the software system. As a result, any changes to software requirements would impact only the connections between components, eliminating the need for developers to retrain the models.In this paper, we propose a novel method NeuralNector to solve the problem of DNN component integration. In NeuralNector, we design a programmable semantic connector. The connector could 1) program a clear semantic output for the DNN component’s raw output and 2) program a logical rule component meeting the semantic constraints to connect DNN components with the programmed outputs. As shown in Figure 1, by developing an easy-to-establish programmable semantic connector, an effective and adaptable DNN model integration can be achieved, allowing for seamless integration of DNN models into software systems without the need for DNN model retraining. The proposed NeuralNector significantly enhances the efficiency of software development involving deep learning models as integral components. We design comprehensive experiments to evaluate our DNN component integration approach. The evaluation primarily focuses on the classification of transportation and animals from the PASCAL VOC dataset. The concepts from PASCAL are listed in Table 1.Programmable Semantic Connector The training accuracy of each semantic concept extractor is listed in Table 1. Several selected concepts and their observations are depicted in Figure 2. The accuracy of the logical rule component reaches 97.6%, demonstrating that our concept setting is reasonable and that a logical relationship exists between the concept and the original label.DNN components integration We use three classic DNN structures to construct components MA and MB, and build the corresponding programmable semantic connector for each component pair. The results of each DNN structure are shown in Table 2. The results show that our approach can be used for DNN component integration without an excessive focus on the DNN architecture, indicating the compatibility of our approach.Data requirements and transferability of concepts To make the programmable semantic connector practical, we evaluate the performances of concept extractors trained on a smaller dataset (10% of the training data). The results are listed in Table 3. Even with only one-tenth of the original data available, the average accuracy is not significantly affected (93.9% down to 92.0%). We also evaluate the transferability of the concept extractors. The process of this experiment is illustrated in Figure 3. The accuracy of the DNN component MC is 94%. After integrating it with MB, which has a training accuracy of 97.7%, the accuracy for the 12 categories reaches 85.8%. This result indicates that the representations from the DNN component effectively extract common and dataset-independent semantic information from the samples, and the proposed method capitalizes on this advantage to exhibit transferability.In summary, we presented a novel approach for integrating DNN components via a programmable semantic connector. The extensive evaluation demonstrated the effectiveness and compatibility of our approach across various datasets, DNN architectures, and practical scenarios. The semantic concept extractors can be programmed with limited data and possess strong transferability to other DNN components. To this end, our approach shows new possibilities for efficient model integration and adaptation from a software engineering perspective, pushing the development of DNN components toward a mass production paradigm.

DOI: 10.1145/3639478.3643090

A Study of Backporting Code in Open-Source Software for Characterizing Changesets

作者: Chakroborti, Debasish and Roy, Chanchal and Schneider, Kevin
关键词: porting, backport, pull-request, commit, github

Abstract

The software development process, shaped by stakeholder feedback, encompasses the creation of diverse versions tailored for customization and addressing hardware limitations. Maintaining these versions involves initiating the transfer of changes for reuse. In the context of a pull-based development model, where the development branch remains current, the term “backporting” is coined to sustain stable versions. Stability requirements may necessitate fewer changes, compatible modifications, or security checks. Consequently, we conducted an analysis of 37,460 backports from 223,602 pull requests in open-source GitHub projects, aiming to identify types of incompatibilities encountered in real-life scenarios. We manually pinpointed various reasons why pull requests may lack compatibility with other versions, including contextual differences, varying dependencies, and statement-level alterations. This study constitutes the inaugural comprehensive characterization of changesets during the porting process across different versions with incompatibilities. The acquired insights can serve as a foundation for automated slicing and adaptation of changesets in stable software versions.

DOI: 10.1145/3639478.3643079

Engineering Industry-Ready Anomaly Detection Algorithms

作者: Nguyen, Ngoc-Thanh and Heldal, Rogardt and Pelliccione, Patrizio
关键词: No keywords

Abstract

Practical values of anomaly detection algorithms, which are engineered and tested on open data, are often low as their real-world applications are rare. The underlying reason is the lack of consideration for practical needs (i.e., research context). Additionally, the validity of algorithms is a concern due to the absence of a proper research method being followed. This paper reports how we considered the research context and followed the Design Science paradigm to engineer our algorithm. In this way, we can address a real-world application of automatic marine data quality control.

DOI: 10.1145/3639478.3643085

Unleashing the Giants： Enabling Advanced Testing for Infrastructure as Code

作者: Sokolowski, Daniel and Spielmann, David and Salvaneschi, Guido
关键词: property-based testing, fuzzing, infrastructure as code, DevOps

Abstract

Infrastructure as Code (IaC) programs are written in imperative programming languages like Python or TypeScript while declaratively defining the target state of software deployments, which the IaC solution then sets up, e.g., Pulumi and AWS CDK. Through a repository mining study and analysis, we noticed that testing IaC programs poses a dilemma: current techniques are either slow and expensive or require prohibitively high development effort. To solve this issue, we introduce Automated Configuration Testing (ACT), enabling efficient testing with low development effort. ACT automates the tedious aspects of unit testing IaC programs and is extensible through a plugin system for test generators and oracles. ACT is already effective with simple type-based plugins, and leveraging existing giants, i.e., advanced test generation and oracle techniques, in new plugins will further boost its effectiveness.

DOI: 10.1145/3639478.3643078

Poirot： Deep Learning for API Misuse Detection

作者: Li, Yi and Nguyen, Tien N. and Wang, Shaohua and Yadavally, Aashish
关键词: AI4SE, API misuse detection, deep learning

Abstract

API misuses refer to incorrect usages that violate the usage constraints of API elements, potentially leading to issues such as runtime errors, exceptions, program crashes, and security vulnerabilities. Existing mining-based approaches for API misuse detection face challenges in accuracy, particularly in distinguishing infrequent from invalid usage. This limitation stems from the necessity to set predefined thresholds for frequent API usage patterns, resulting in potential misclassification of alternative usages. This paper introduces Poirot, a learning-based approach that mitigates the need for predefined thresholds. Leveraging Labeled, Graph-based Convolutional Networks, Poirot learns embeddings for API usages, capturing key features and enhancing API misuse detection. Preliminary evaluation on an API misuse benchmark demonstrates that Poirot achieves a relative improvement of 1.37–10.36X in F-score compared to state-of-the-art API misuse detection techniques.

DOI: 10.1145/3639478.3643080

Behavior Trees with Dataflow： Coordinating Reactive Tasks in Lingua Franca

作者: Schulz-Rosengarten, Alexander and Ahmad, Akash and Clement, Malte and von Hanxleden, Reinhard and Asch, Benjamin and Lohstroh, Marten and Lee, Edward A. and Quiros, Gustavo and Shukla, Ankit
关键词: behavior trees, reactive systems, coordination languages

Abstract

Behavior Trees (BTs) provide a lean set of control flow elements that are easily composable in a modular tree structure. They are well established for modeling the high-level behavior of non-player characters in computer games and recently gained popularity in other areas such as industrial automation.While BTs nicely express control, data handling aspects so far must be provided separately, e. g. in the form of blackboards. This may hamper reusability and can be a source of nondeterminism.We here propose a dataflow extension to BTs that explicitly models data relations and communication. We realize and validate that approach in the recently introduced polyglot coordination language Lingua Franca (LF).

DOI: 10.1145/3639478.3643093

Graph Neural Networks based Log Anomaly Detection and Explanation

作者: Li, Zhong and Shi, Jiayang and Van Leeuwen, Matthijs
关键词: No keywords

Abstract

Event logs are widely used to record the status of high-tech systems, making log anomaly detection important for monitoring those systems. We propose a graph-based method for unsupervised log anomaly detection, dubbed Logs2Graphs, which first converts event logs into attributed, directed, and weighted graphs, and then leverages graph neural networks to perform graph-level anomaly detection. Specifically, we introduce OCDiGCN, a novel graph neural network model for detecting graph-level anomalies in a collection of attributed, directed, and weighted graphs. By coupling the graph representation and anomaly detection steps, OCDiGCN can learn a representation that is especially suited for anomaly detection, resulting in a high detection accuracy. For each detected anomaly, we provide a subset of nodes that are crucial in OCDiGCN’s predictions, offering useful insights for root cause diagnosis. Experiments on five benchmark datasets show that Logs2Graphs matches or exceeds current top log anomaly detection methods on simple datasets and largely outperforms them on complex ones.

DOI: 10.1145/3639478.3643084

Going Viral： Case Studies on the Impact of Protestware

作者: Fan, Youmei and Wang, Dong and Wattanakriengkrai, Supatsara and Damrongsiri, Hathaichanok and Treude, Christoph and Hata, Hideaki and Kula, Raula Gaikovina
关键词: protestware, software ecosystems, case studies

Abstract

Maintainers are now self-sabotaging their work in order to take political or economic stances, a practice referred to as “protestware”. In this poster, we present our approach to understand how the discourse about such an attack went viral, how it is received by the community, and whether developers respond to the attack in a timely manner. We study two notable protestware cases, i.e., Colors.js and es5-ext, comparing with discussions of a typical security vulnerability as a baseline, i.e., Ua-parser, and perform a thematic analysis of more than two thousand protest-related posts to extract the different narratives when discussing protestware.

DOI: 10.1145/3639478.3643086

Understanding the Strategies Used by Employees to Cope with Technostress in the Software Industry

作者: Siitonen, Valtteri and Ritonummi, Saima and Salo, Markus and Pirkkalainen, Henri and Mauno, Saija
关键词: technostress, software engineering, coping, coping strategy

Abstract

Working in the software industry exposes individual employees to the harmful effects of technostress because the work is heavily tied to information technology (IT) use. Because of the ever-increasing IT-related demands and high levels of stress experienced by software industry employees, it is important to understand how employees respond to and cope with these demands. We set to explore the coping strategies employed by individual employees in the industry by utilizing the coping taxonomy proposed by Skinner et al. [1]. We collected and analyzed the coping responses of 715 employees collected via a qualitative questionnaire. In total, we identified 29 individual coping strategies categorized into coping families per Skinner et al. [1]. Our findings help in moving towards a more comprehensive understanding of coping with technostress and in supporting the well-being of those working in the industry.

DOI: 10.1145/3639478.3643092

A Transformer-based Model for Assisting Dockerfile Revising

作者: Wu, Yiwen and Zhang, Yang and Wang, Tao and Wang, Huaimin
关键词: docker, dockerfile, deep learning, transformer, pre-training

Abstract

Dockerfile plays an important role in the containerized software development process since it specifies the structure and functionality of the built Docker image. Currently, Dockerfile writing and modification still rely on manual operations which can be time-consuming. Thus, there is a need for automation tools to support the Dockerfile revising process. In this study, we focus on utilizing pre-training techniques for the tasks in the Dockerfile revising scenario. We propose a Transformer-based model and pre-train it with an instruction-aware objective. Furthermore, we fine-tune our model in two downstream tasks, including revision opportunity estimation and revision activity prediction. The experimental results show that our model outperforms the baseline models.

DOI: 10.1145/3639478.3643083

Domain Knowledge is All You Need： A Field Deployment of LLM-Powered Test Case Generation in FinTech Domain

作者: Xue, Zhiyi and Li, Liangguo and Tian, Senyue and Chen, Xiaohong and Li, Pingping and Chen, Liangyu and Jiang, Tingting and Zhang, Min
关键词: No keywords

Abstract

Despite the promise of automation, general-purpose Large Language Models (LLMs) face difficulties in generating complete and accurate test cases from informal software requirements, primarily due to challenges in interpreting unstructured text and producing diverse, relevant scenarios. This paper argues that incorporating domain knowledge significantly improves LLM performance in test case generation. We report on the successful deployment of our LLM-powered tool, LLM4Fin, in the FinTech domain, showcasing the crucial role of domain knowledge in addressing the aforementioned challenges. We demonstrate two methods for integrating domain knowledge: implicit incorporation through model fine-tuning, and explicit incorporation with algorithm design. This combined approach delivers remarkable results, achieving up to 98.18% improvement in test scenario coverage and reducing generation time from 20 minutes to 7 seconds.

DOI: 10.1145/3639478.3643087

Neural Exception Handling Recommender

作者: Li, Yi and Nguyen, Tien N. and Cai, Yuchen and Yadavally, Aashish and Mishra, Abhishek and Montejo, Genesis
关键词: AI4SE, exception handling, graph convolutional network

Abstract

Practical code reuse often leads to the incorporation of code fragments from developer forums into applications. However, these fragments, being incomplete, frequently lack details on exception handling. Integrating exception handling into a codebase is not a straightforward task, requiring developers to understand and remember which API methods may trigger exceptions and which exceptions should be handled. To address that, we introduce EHBlock, a learning-based exception handling recommender for Java code snippets. EHBlock analyzes a given code snippet and suggests whether a try-catch block is necessary. It employs a Relational Graph Convolutional Network (R-GCN) to learn exception handling from complete code. R-GCN considers program dependencies in the surrounding context, allowing EHBlock to learn the identities of APIs and their relations with corresponding exception types that need to be handled. Our empirical evaluation shows that EHBlock achieves a 12.3% improvement in F-score compared to the state-of-the-art approach in determining the need of try-catch blocks.

DOI: 10.1145/3639478.3643082

Unleashing the Power of Clippy in Real-World Rust Projects

作者: Li, Chunmiao and Yu, Yijun and Wu, Haitao and Carlig, Luca and Nie, Shijie and Jiang, Lingxiao
关键词: No keywords

Abstract

The error messages generated by the Rust compiler (rustc) are useful for developers to identify and diagnose suspicious code segments. Complementing the compiler, linters can also play an important role in promoting the adherence to certain coding style conventions and best practices. Prominent linters utilized in the Rust ecosystem include Clippy [1] and Rustfmt [2]. Among them, the Rust community particularly emphasizes on the importance of heeding the warnings provided by Clippy to mitigate common errors and promote the adoption of idiomatic conventions. Clippy provides a set of more than 600 lints in addition to the built-in rustc lints. These lints are divided into nine distinct categories that address correctness and style aspects. Each category is assigned a default lint level, namely Allow, Warn, or Deny, indicating the severity with which the lints are reported.

DOI: 10.1145/3639478.3643096

GRAIL： Checking Transaction Isolation Violations with Graph Queries

作者: Dumbrava, Stefania and Jin, Zhao and Kulahcioglu Ozkan, Burcu and Qiu, Jingxuan
关键词: distributed databases, transaction isolation, testing, graph queries

Abstract

Distributed databases are surging in popularity with the growing need for performance and fault tolerance. However, implementing transaction isolation models on distributed databases is more challenging due to their sharding and replication. As a result, they can produce executions that violate their claimed isolation guarantees.In this work, we propose a novel isolation model-agnostic approach that utilizes graph databases to efficiently detect isolation violations expressed as anti-patterns in transactional dependency graphs. To illustrate our approach, we introduce the GRAIL framework, implemented on top of the popular ArangoDB and Neo4j graph databases. GRAIL combines soundness guarantees and high performance with understandable, detailed counter-examples.

DOI: 10.1145/3639478.3643094

Exploring the Computational Complexity of SAT Counting and Uniform Sampling with Phase Transitions

作者: Zeyen, Olivier and Cordy, Maxime and Perrouin, Gilles and Acher, Mathieu
关键词: No keywords

Abstract

Uniform Random Sampling (URS) is the problem of selecting solutions (models) from a Boolean formula such that each solution gets the same probability of being selected. URS has many applications. In large configurable software systems, one wants an unbiased sample of configurations to look for bugs at an affordable cost [12, 13]. Other applications of URS include deep learning verification (to sample inputs from unknown distributions) [2] and evolutionary algorithms (to initialize the input population) [4].

DOI: 10.1145/3639478.3643097

Hunting DeFi Vulnerabilities via Context-Sensitive Concolic Verification

作者: Ding, Yepeng and Gervais, Arthur and Wattenhofer, Roger and Sato, Hiroyuki
关键词: vulnerability finding, smart contracts, decentralized finance, program analysis, concolic verification

Abstract

Decentralized finance (DeFi) is revolutionizing the traditional centralized finance paradigm with its attractive features such as high availability, transparency, and tamper-proofing. However, attacks targeting DeFi services have severely damaged the DeFi market, as evidenced by our investigation of 80 real-world DeFi incidents from 2017 to 2022. Existing methods, based on symbolic execution, model checking, semantic analysis, and fuzzing, fall short in identifying the most DeFi vulnerability types. To address the deficiency, we propose Context-Sensitive Concolic Verification (CSCV), a method of automating the DeFi vulnerability finding based on user-defined properties formulated in temporal logic. CSCV builds and optimizes contexts to guide verification processes that dynamically construct context-carrying transition systems in tandem with concolic executions. Furthermore, we demonstrate the effectiveness of CSCV through experiments on real-world DeFi services and qualitative comparison. The experiment results show that our CSCV prototype successfully detects 76.25% of the vulnerabilities from the investigated incidents with an average time of 253.06 seconds.

DOI: 10.1145/3639478.3643105

Blocks? Graphs? Why Not Both? Designing and Evaluating a Hybrid Programming Environment for End-users

作者: Ritschel, Nico and Fronchetti, Felipe and Holmes, Reid and Garcia, Ronald and Shepherd, David C.
关键词: No keywords

Abstract

Many modern end-user development environments support one of two visual modalities: block-based programming or data-flow programming. In this work, we investigate the trade-offs between the two modalities in the context of robotics tasks. These often contain both aspects that are better solved with blocks and others that best fit data-flow programming. To address this style of task, we present and discuss two novel programming environment prototypes, one purely block-based and one a hybrid of blocks and data-flow programming. We compare the designs through a controlled experiment with 113 end-user participants, in which we asked them to solve programming and program comprehension tasks using one of the two environments. We find that participants preferred the hybrid environment in direct comparison, but performed better across all tasks and also reported higher usability ratings for blocks.

DOI: 10.1145/3639478.3643101

SLIM： a Scalable Light-weight Root Cause Analysis for Imbalanced Data in Microservice

作者: Ren, Rui and Yang, Jingbang and Yang, Linxiao and Gu, Xinyue and Sun, Liang
关键词: No keywords

Abstract

The newly deployed service - one kind of change service, could lead to a new type of minority fault. Existing state-of-the-art methods for fault localization rarely consider the imbalanced fault classification in change service. This paper proposes a novel method that utilizes decision rule sets to deal with highly imbalanced data by optimizing the F1 score subject to cardinality constraints. The proposed method greedily generates the rule with maximal marginal gain and uses an efficient minorize-maximization (MM) approach to select rules iteratively, maximizing a non-monotone submodular lower bound. Compared with existing fault localization algorithms, our algorithm can adapt to the imbalanced fault scenario of change service, and provide interpretable fault causes which are easy to understand and verify. Our method can also be deployed in the online training setting, with only about 15% training overhead compared to the current SOTA methods. Empirical studies showcase that our algorithm outperforms existing fault localization algorithms in both accuracy and model interpretability.

DOI: 10.1145/3639478.3643098

Designing Digital Twins for Enhanced Reusability

作者: Ratushniak, Olga and Cabrero-Daniel, Beatriz
关键词: digital twins, software architecture and design, requirements engineering, software reusability, adaptability, interoperability, environments and software development tools, release engineering and DevOps

Abstract

Digital Twins (DTs) are dynamic virtual models that mirror the behavior and characteristics of physical systems. They are emerging as a crucial tool in digital transformation, adaptable for various applications. DTs are used to simulate, analyze, and optimize physical and virtual assets [10, 11]. However, their complexity and resource-intensive nature make them challenging to integrate into real settings. Therefore, we propose that an efficient architectural design, flexibility, adaptability, and interoperability is key to achieving these objectives. Furthermore, by enhancing these aspects of DTs, we also contribute to their sustainability across technological, environmental, and economic dimensions.

DOI: 10.1145/3639478.3643102

Analyzing the Impact of Context Representation and Scope in Code Infilling

作者: Heo, Jinseok and Lee, Eunseok
关键词: No keywords

Abstract

Existing studies solve software engineering tasks using code infilling through LLMC. They utilize context information, which refers to data near the target code of infilling, as input prompts. Although prompts are essential for infilling the target code, current studies use them without analyzing the impact of the representation and scope of context on code infilling. In this study, we analyzed how context representation and scope affect the performance of code infilling. We used XLCost, which contains code, comments, and a function comment for various programming languages. The combination of code and a function comment for context representation yielded the best code infilling performance. Furthermore, we found that the context scope is proportional to performance. Our analysis results can be applied in various tasks that involve code infilling in the future.

DOI: 10.1145/3639478.3643107

eAIEDF： Extended AI Error Diagnosis Flowchart for Automatically Identifying Misprediction Causes in Production Models

作者: Sakuma, Keita and Matsuno, Ryuta and Kameda, Yoshio
关键词: MLOPs, machine learning, error analysis, explainability

Abstract

MLOps, addressing operational issues in machine learning, has gained attention for enhancing the performance of production models. A core challenge is efficiently understanding the causes of mispredictions, as current methods often require labor-intensive manual analysis. To address this, we propose the Extended AI Error Diagnosis Flowchart (eAIEDF) as an extension of the AIEDF, an automated method for identifying root causes of mispredictions during model operation, in order to make it adaptable to both classification and regression models, ensuring applicability in various use cases. Compared to AIEDF, eAIEDF features a more comprehensive flowchart structure for improved cause identification. Through numerical experiments, we confirm that eAIEDF provides valuable insights for enhancing model performance.

DOI: 10.1145/3639478.3643104

The Impact of a Live Refactoring Environment on Software Development

作者: Fernandes, Sara and Aguiar, Ademar and Restivo, Andr'{e
关键词: code smells, refactoring, code quality metrics, software visualization, live programming

Abstract

Reading, adapting, and maintaining complex software can be a daunting task. We might need to refactor it to streamline the process and make the code cleaner and self-explanatory. Traditional refactoring tools guide developers to achieve better-quality code. However, the feedback and assistance they provide can take considerable time. To tackle this issue, we explored the concept of Live Refactoring. This approach focuses on delivering real-time, visually-driven refactoring suggestions. That way, we prototyped a Live Refactoring Environment that visually identifies, recommends, and applies several refactorings in real-time. To validate its effectiveness, we conducted a set of experiments. Those showed that our approach significantly improved various code quality metrics and outperformed the results obtained from manually refactoring code.

DOI: 10.1145/3639478.3643100

Fault Localization on Verification Witnesses (Poster Paper)

作者: Beyer, Dirk and Kettl, Matthias and Lemberger, Thomas
关键词: software verification, program analysis, model checking, result validation, fault localization, violation witnesses, error paths

Abstract

Verifiers export violation witnesses, which help independent validators to confirm a reported specification violation. It is assumed that violation witnesses are helpful if they are very precise: ideally, they should describe a single program path for the validator to check. But we claim that this leads verifiers to produce large, detailed witnesses that include a lot of unnecessary information that actually hinders validation. We reduce violation witnesses with automated fault localization to only that information which fault localization suspects as fault. We performed a large experimental evaluation on the witnesses produced in the International Competition on Software Verification (SV-COMP 2023) to explore the effect of our reduction. Our experiments show that the witnesses reduced using our approach shrink considerably and can be confirmed better.

DOI: 10.1145/3639478.3643099

Tracking assets in source code with Security Annotations

作者: Haak, Daniel and Mayr, Raphael and Stegh"{o
关键词: No keywords

Abstract

Small and medium enterprises (SMEs) that build individualized software require lightweight solutions to trace cybersecurity concerns across the codebase. This includes tracking where potentially vulnerable assets are handled in the codebase. The solution that provides this tracking should be fully integrated into the developers’ workflow and should be usable by developers who are not cybersecurity experts. To address this need, we propose Security Annotations, which can be added to any codebase regardless of programming language and allows linking blocks of code, functions, or single statements with assets. In order to use the main functionality of the Security Annotations an asset catalog of sufficient quality is needed. These assets can either be identified upfront or while annotating. We conducted a preliminary evaluation in which four pairs of developers created an asset catalog for a legacy software system and then annotated the code using Security Annotations. All groups successfully identified assets in a code base largely unknown to them. We also found that the annotation patterns differed between pairs but that there were significant overlaps. The workload of identifying assets and performing annotations was demanding, but feasible.

DOI: 10.1145/3639478.3643095

Lightweight Semantic Conflict Detection with Static Analysis

作者: De Jesus, Galileu Santos and Borba, Paulo and Bonif'{a
关键词: merge conflicts, configuration management, software evolution, static analysis

Abstract

Version control system tools empower developers to independently work on their development tasks. These tools also facilitate the integration of changes through merging operations, and report textual conflicts. However, during the integration of changes, developers might encounter other types of conflicts that are not detected by current merge tools. In this paper, we focus on dynamic semantic conflicts, which occur when merging reports no textual conflicts but results in undesired interference—causing unexpected program behavior at runtime. To address this issue, we propose a technique that explores the use of static analysis to detect interference when merging contributions from two developers. We evaluate our technique using a dataset of 99 experimental units extracted from merge scenarios. The results provide evidence that our technique presents significant interference detection capability (F1 Score of 0.50 and Accuracy of 0.60).

DOI: 10.1145/3639478.3643118

How Does Pre-trained Language Model Perform on Deep Learning Framework Bug Prediction?

作者: Du, Xiaoting and Li, Chenglong and Ma, Xiangyue and Zheng, Zheng
关键词: No keywords

Abstract

Understanding and predicting bugs is crucial for developers seeking to enhance testing efficiency and mitigate issues in software releases. Bug reports, though semi-structured texts, contain a wealth of semantic information, rendering their comprehension a critical aspect of bug prediction. In light of the recent success of pre-trained language models (PLMs) in the domain of natural language processing, numerous studies have leveraged these models to grasp various forms of textual information. However, the capability of PLMs to understand bug reports remains uncertain. To tackle this challenge, we introduce KnowBug, a framework with a bug report knowledge-enhanced PLM. In this framework, utilizing bug reports obtained from open-source deep learning frameworks as input, prompts are designed and the PLM is fine-tuned for evaluating KnowBug’s ability to comprehend bug reports and predict bug types.

DOI: 10.1145/3639478.3643113

Multi-requirement Parametric Falsification

作者: Camilli, Matteo and Mirandola, Raffaela
关键词: falsification, multi-requirement, probabilistic requirements

Abstract

Falsification is a popular simulation-based testing method for Cyber-Physical Systems to find inputs that violate a formal requirement. However, detecting violations considering multiple probabilistic requirements simultaneously with a dense space of changing factors in the execution scenario is an open problem. We address this problem by proposing a novel approach that combines parametric model checking and many-objective optimization. Results of a preliminary empirical evaluation show the effectiveness of the approach compared to selected baseline methods.

DOI: 10.1145/3639478.3643120

作者: Kapitsaki, Georgia and Papoutsoglou, Maria
关键词: GitHub, open source software, data privacy, GDPR

Abstract

A vast amount of software data are available on GitHub repositories and they are being reused by practitioners worldwide. The latest advances in privacy legislation, such as the EU General Data Protection Regulation (GDPR), have forced the software community to pay special attention to users’ data privacy. In this work, we are examining the commits of Open Source Software repositories with a reference to GDPR in their message. We have collected commits from GitHub repositories for this purpose. We are examining how many such commits appear over the years and we are showing the main terms appearing in the commit messages. This study can serve as a first step towards understanding how privacy legislation is present in the Open Source Software community.

DOI: 10.1145/3639478.3643109

Towards Data Augmentation for Supervised Code Translation

作者: Chen, Binger and Golebiowski, Jacek and Abedjan, Ziawasch
关键词: No keywords

Abstract

Supervised learning is a robust strategy for data-driven program translation. This work addresses the challenge of insufficient parallel training data in code translation by exploring two innovative data augmentation methods: a rule-based approach specifically designed for code translation datasets and a retrieval-based method leveraging unorganized code repositories.

DOI: 10.1145/3639478.3643115

High-precision Online Log Parsing with Large Language Models

作者: Chen, Xiaolei and Shi, Jie and Chen, Jia and Wang, Peng and Wang, Wei
关键词: log parsing, large language model, prefix tree

Abstract

System logs are vital for diagnosing system failures, with log parsing converting unstructured logs into structured data. Existing methods fall into two categories: non-deep-learning approaches cluster logs based on stats but often miss semantic information, resulting in poor performance. Deep-learning approaches excel at identifying variables and constants but often lack generalizability beyond training data. And they always suffer from low efficiency. This paper proposes a novel LLM-based log parsing approach, named Hooglle, to address these challenges. Leveraging a large language model, Hooglle extracts templates for precise and generalized parsing. To overcome the efficiency issue, we propose a prefix-tree-based full-matching strategy which significantly improves parsing efficiency. Extensive evaluation across real-world datasets showcases Hooglle’s superior performance on 16 public benchmark datasets.

DOI: 10.1145/3639478.3643112

Multi-step Automated Generation of Parameter Docstrings in Python： An Exploratory Study

作者: Venkatkrishna, Vatsal and Nagabushanam, Durga Shree and Simon, Emmanuel Iko-Ojo and Vidoni, Melina
关键词: docstrings, pre-trained models, code summarisation, scientific software, documentation debt

Abstract

Documentation debt hinders the effective utilisation of open-source software. Although code summarisation tools have been helpful for developers, most would prefer a detailed account of each parameter in a function rather than a high-level summary. However, generating such a summary is too intricate for a single generative model to produce reliably due to the lack of high-quality training data. Thus, we propose a multi-step approach that combines multiple task-specific models, each adept at producing a specific section of a docstring. The combination of these models ensures the inclusion of each section in the final docstring. We compared the results from our approach with existing generative models using both automatic metrics and a human-centred evaluation with 17 participating developers, which proves the superiority of our approach over existing methods.

DOI: 10.1145/3639478.3643110

Energy Consumption of Automated Program Repair

作者: Martinez, Matias and Mart'{\i
关键词: No keywords

Abstract

In the last decade, following current societal needs, software sustainability has emerged as research field [2]. In this paper, we particularly focus on environmental sustainability, defined as “how software product development, maintenance, and use affect energy consumption and the consumption of other natural resources. […] This dimension is also known as Green Software” [2].

DOI: 10.1145/3639478.3643114

Automatic Generation of Test Cases based on Bug Reports： a Feasibility Study with Large Language Models

作者: Plein, Laura and Ou'{e
关键词: No keywords

Abstract

Tests suites are a key ingredient in various software automation tasks. Recently, various studies [4] have demonstrated that they are paramount in the adoption of latest innovations in software engineering, such as automated program repair (APR) [3]. Test suites are unfortunately often too scarce in software development projects. Generally, they are provided for regression testing, while new bugs are discovered by users who then describe them informally in bug reports. In recent literature, a new trend of research in APR has attempted to leverage bug reports in generate-and-validate pipelines for program repair. Even in such cases, when an APR tool generates a patch candidate, if test cases are unavailable, developers must manually validate the patch, leading to a threat to validity.

DOI: 10.1145/3639478.3643119

ReviewRanker： A Semi-Supervised Learning Based Approach for Code Review Quality Estimation

作者: Mahbub, Saifullah and Arafat, Md. Easin and Rahman, Chowdhury Rafeed and Ferdows, Zannatul and Hasan, Masum
关键词: No keywords

Abstract

Inspection of code review process effectiveness and continuous improvement can boost development productivity. Such inspection is a time-consuming and human-bias-prone task. We propose a semi-supervised learning based system ReviewRanker which is aimed at assigning each code review a confidence score which is expected to resonate with the quality of the review. Our proposed method is trained based on simple and and well defined labels provided by developers. The labeling task requires little to no effort from the developers. ReviewRanker has the potential of minimizing the back-and-forth cycle existing in the development and review process. Related code and data can be found at: https://github.com/saifarnab/code_review

DOI: 10.1145/3639478.3643111

LogPrompt： Prompt Engineering Towards Zero-Shot and Interpretable Log Analysis

作者: Liu, Yilun and Tao, Shimin and Meng, Weibin and Yao, Feiyu and Zhao, Xiaofeng and Yang, Hao
关键词: No keywords

Abstract

Automated log analysis plays a crucial role in software maintenance as it allows for efficient identification and resolution of issues. However, traditional methods employed in log analysis heavily rely on extensive historical data for training purposes and lack rationales for its predictions. The performance of these traditional methods significantly deteriorates when in-domain logs for training are limited and unseen log data are the majority, particularly in rapidly changing online environments. Additionally, the lack of rationales hampers the interpretability of analysis results and impacts analysts’ subsequent decision-making processes. To address these challenges, we proposes LogPrompt, an novel approach that leverages large language models (LLMs) and advanced prompting techniques to achieve performance improvements in zero-shot scenarios (i.e., no in-domain training). Moreover, LogPrompt has garnered positive evaluations from experienced practitioners in its log interpretation ability. Code available at https://github.com/lunyiliu/LogPrompt.

DOI: 10.1145/3639478.3643108

Data vs. Model Machine Learning Fairness Testing： An Empirical Study

作者: Shome, Arumoy and Cruz, Lu'{\i
关键词: SE4ML, ML fairness testing, empirical software engineering, data-centric AI

Abstract

Although several fairness definitions and bias mitigation techniques exist in the literature, all existing solutions evaluate fairness of Machine Learning (ML) systems after the training stage. In this paper, we take the first steps towards evaluating a more holistic approach by testing for fairness both before and after model training. We evaluate the effectiveness of the proposed approach and position it within the ML development lifecycle, using an empirical analysis of the relationship between model dependent and independent fairness metrics. The study uses 2 fairness metrics, 4 ML algorithms, 5 real-world datasets and 1600 fairness evaluation cycles. We find a linear relationship between data and model fairness metrics when the distribution and the size of the training data changes. Our results indicate that testing for fairness prior to training can be a “cheap” and effective means of catching a biased data collection process early; detecting data drifts in production systems and minimising execution of full training cycles thus reducing development time and costs.

DOI: 10.1145/3639478.3643121

On the Effects of Program Slicing for Vulnerability Detection during Code Inspection： Extended Abstract

作者: Papotti, Aurora and Massacci, Fabio and Tuma, Katja
关键词: vulnerabilities, slicing, code inspection, program comprehension, controlled experiment

Abstract

[Background]: Slicing has been first introduced to support debugging as a fault localization technique. Yet, program slicing as support for identifying vulnerabilities during code inspection has received limited attention. [Aims]: Evaluate the effectiveness of slicing as a general concept to support code inspectors while detecting vulnerabilities into source code. [Method]: We designed a controlled experiment which goal is identifying the vulnerable lines in original or sliced Java files from Apache Tomcat. The designed treatments differ in the pair (Vulnerability, Original/Sliced file) with a balanced design with four vulnerabilities from the OWASP Top 10. The participants are MSc students attending security courses (n = 236). [Observations]: By using a notion of neighborhood based on the context size of the command git diff we observed that slicing helps in ‘finding something’ as opposed to ‘finding nothing’. However, once some correct lines have been found, analyzing a slice and analyzing the original file are statistically equivalent.

DOI: 10.1145/3639478.3643117

xNose： A Test Smell Detector for C#

作者: Paul, Partha Protim and Akanda, Md Tonoy and Ullah, Mohammed Raihan and Mondal, Dipto and Chowdhury, Nazia Sultana and Tawsif, Fazle Mohammed
关键词: test smell, code smell, empirical studies, C#, abstract syntax tree (AST), rosalyn, static analysis

Abstract

Test smells, similar to code smells, can negatively impact both the test code and the production code being tested. Despite extensive research on test smells in languages like Java, Scala, and Python, automated tools for detecting test smells in C# are lacking. This paper aims to bridge this gap by extending the study of test smells to C#, and developing a tool (xNose) to identify test smells in this language and analyze their distribution across projects. We identified 16 test smells from prior studies that were language-independent and had equivalent features in C# and evaluated xNose, achieving a precision score of 96.97% and a recall score of 96.03%. In addition, we conducted an empirical study to determine the prevalence of test smells in xUnit-based C# projects. This analysis sheds light on the frequency and distribution of test smells, deepening our understanding of their impact on C# projects and test suites. The development of xNose and our analysis of test smells in C# code aim to assist developers in maintaining code quality by addressing potential issues early in the development process.

DOI: 10.1145/3639478.3643116

On the Need for Empirically Investigating Fast-Growing Programming Languages

作者: Kumar, Jahnavi and Chimalakonda, Sridhar
关键词: developer emotion analysis, programming languages

Abstract

The research community has extensively explored various aspects of programming languages (PLs) such as developer sentiment analysis, ecosystem dynamics, pull request trends, energy consumption, and more. Existing research has predominantly focused on “Popular PLs” with the largest unique user bases which considers users since the language’s inception. However, our approach shifts to examining “Fast Growing Languages” (FGLs) - those experiencing significant growth in their user base in recent years. We see this perspective as an essential exploration for understanding the current developer interest in both established and emerging languages. In this short note, we discuss (i) An empirical study on understanding how developers perceive these FGLs, analyzing their emotions through 1.86M Github comments (ii) Discuss potential future research on fast-growing PLs at the cross-section of programming languages and software engineering.

DOI: 10.1145/3639478.3643524

Improving the Condensing of Reverse Engineered Class Diagrams using Weighted Network Metrics

作者: Pan, Weifeng and Wu, Wei and Ming, Hua and Kim, Dae-Kyoo and Yang, Jinkai and Liu, Ruochen
关键词: class diagrams, key classes, network metrics, class dependency networks, program comprehension

Abstract

Reverse engineered class diagrams (REDs) are helpful to ease the comprehension of complex software. However, the original REDs might contain many details and thus provide little benefit. Condensing REDs by identifying the most important classes (aka key classes) and discarding unimportant ones has been regarded as a promising way. In the last decade, many key class prediction (KCP) approaches have been proposed. However, the unweighted network metrics used in these studies fail to capture the dependency strength between classes and thus cannot precisely measure the actual complexity of classes. In this paper, we propose an approach, KEEPER, for condensing REDs, which introduces a set of weighted network metrics to characterize the complexity of classes and to build KCP models. Empirical results show that KEEPER performs better than all the baseline approaches and has a good scalability.

DOI: 10.1145/3639478.3643520

作者: Nath, Sristy Sumana and Roy, Banani
关键词: No keywords

Abstract

Inadequate traceability links between software artifacts can create challenges for developers in tracking the origin of bugs or issues and their corresponding code changes, leading to longer resolution times and the potential introduction of new bugs [5]. When changes are made without proper traceability links, inconsistencies and conflicts may arise between different artifacts [4], such as requirements, design documents, and code, resulting in software development that fails to meet user expectations or exhibits unexpected behavior. The lack of proper traceability links also poses challenges in maintaining software over time, making it difficult to upgrade, manage dependencies, and make changes to the software [3]. Additionally, the lack of traceability links can make it challenging to understand the software’s evolution and developers’ decision-making process, reducing transparency and hindering collaboration.Release notes are a document that includes details about new features, bug fixes, improvements, and known issues. They help users and developers understand the changes made to the software and their impact on workflows [1]. Traceability links of issues, pull requests (PRs), and commits are important in release notes as they provide context and understanding of changes made in a release [6]. In our dataset, 33% of release notes are not linked with the corresponding artifacts, highlighting the need for automated traceability link recovery in release notes.Additionally, limited traceability links can lead to duplicate bug reports, confusion, and wasted time and effort [7]. Traceability links are crucial for version controlling and back-porting to give a clear understanding of the dependencies within different versions. Without these links, managing releases and documenting changes accurately becomes challenging, potentially damaging the reputation of the software and causing confusion among stakeholders and customers [2].Our study begins by creating a benchmark to propose an automated traceability technique between release and related artifacts. To collect data, we use the GitHub API to gather information from 10 popular repositories, including release notes, pull-request titles, and commit messages. We analyze textual data and create a benchmark for recovering traceability links between releases and related artifacts such as commits, pull requests, and issues. Next, we investigate the feasibility of automated traceability approaches for software release notes in GitHub using rule-based and information retrieval (IR)-based classifier. To the best of our knowledge, our techniques are the first to automatically recover traceability links between software releases and related artifacts (i.e., commits, pull requests, and issues). This approach can help keep track of changes and improve the quality of release notes by collecting all useful information automatically in real-time. By using this approach, we aim to identify areas of improvement and refine our proposed technique to make it more usable and effective in practice.

DOI: 10.1145/3639478.3643126

Bringing Structure to Naturalness： On the Naturalness of ASTs

作者: P^{a
关键词: naturalness, structure, AST, self-cross-entropy

Abstract

Source code comes in different shapes and forms. Previous research has already shown code to be more predictable than natural language at the token level: source code can be natural. More recently, the structure of code — either as graphs or trees — has been successfully used to improve the state-of-the-art on numerous tasks: code suggestion, code summarisation, method naming etc. This body of work implicitly assumes that structured representations of code are similarly statistically predictable, i.e. natural. We consider that this view should be made explicit and propose directly studying the Structured Naturalness Hypothesis. Beyond just naming existing research that assumes this hypothesis and formulating it, we also provide evidence for tree representations: TreeLSTM models over ASTs for some languages, such as Ruby, are competitive with n-gram models while handling the syntax token issue highlighted by previous research ‘for free’. For other languages, such as Java or Python, we find tree models to perform worse, suggesting that downstream task improvement is uncorrelated to the language modelling task. Further, we show how one may use naturalness signals for near state-of-the-art results on just-in-time defect prediction without manual feature engineering work.

DOI: 10.1145/3639478.3643517

Assessing AI-Based Code Assistants in Method Generation Tasks

作者: Corso, Vincenzo and Mariani, Leonardo and Micucci, Daniela and Riganelli, Oliviero
关键词: AI-based code assistants, code completion, empirical study

Abstract

AI-based code assistants are increasingly popular as a means to enhance productivity and improve code quality. This study compares four AI-based code assistants, GitHub Copilot, Tabnine, ChatGPT, and Google Bard, in method generation tasks, assessing their ability to produce accurate, correct, and efficient code. Results show that code assistants are useful, with complementary capabilities, although they rarely generate ready-to-use correct code.

DOI: 10.1145/3639478.3643122

Exploring the Impact of Inheritance on Test Code Maintainability

作者: Kim, Dong Jae and Chen, Tse-Hsun
关键词: software evolution, software test maintainability

Abstract

Since the advent of object-oriented programming languages, using inheritance has been a fundamental concept in software design. It is used to achieve polymorphism, facilitating code reuse, enable ease in extension of software program. Despite its benefits, inheritance may introduce tight coupling between classes and overtime can degrade maintainability of software systems. In this work, we take the first step by studying inheritance and interface, with the focus on impact on test code maintainability and design decisions. We have developed a tool capable of identifying inheritance and interface changes in modified test classes within the software evolution commit history. Our empirical study spans 12 open-source Java systems, covering their entire developmental history up to 2021. We have mined 4,662 instances of inheritance and interface changes in test code. We have compiled a comprehensive catalog of motivations driving these changes. This catalog offers insights on how inheritance impact test maintainability, providing valuable guidance for developers navigating the use of inheritance and interface in test code.

DOI: 10.1145/3639478.3643522

Improving Program Debloating with 1-DU Chain Minimality

作者: Kim, Myeongsoo and Pande, Santosh and Orso, Alessandro
关键词: No keywords

Abstract

Modern software often struggles with bloat, leading to increased memory consumption and security vulnerabilities from unused code. In response, various program debloating techniques have been developed, typically utilizing test cases that represent functionalities users want to retain. These methods range from aggressive approaches, which prioritize maximal code reduction but may overfit to test cases and potentially reintroduce past security issues, to conservative strategies that aim to preserve all influenced code, often at the expense of less effective bloat reduction and security improvement. In this research, we present RLDebloatDU, an innovative debloating technique that employs 1-DU chain minimality within abstract syntax trees. Our approach maintains essential program data dependencies, striking a balance between aggressive code reduction and the preservation of program semantics. We evaluated RLDebloatDU on ten Linux kernel programs, comparing its performance with two leading debloating techniques: Chisel, known for its aggressive debloating approach, and Razor, recognized for its conservative strategy. RLDebloatDU significantly lowers the incidence of Common Vulnerabilities and Exposures (CVEs) and improves soundness compared to both, highlighting its efficacy in reducing security issues without reintroducing resolved security issues.

DOI: 10.1145/3639478.3643518

Exploring Data Cleanness in Defects4J and Its Influence on Fault Localization Efficiency

作者: Rafi, Md Nakhla and Chen, An Ran and Chen, Tse-Hsun (Peter) and Wang, Shaohua
关键词: No keywords

Abstract

Defects4J stands out as the most popular benchmark in software testing research, known for its comprehensive collection of real bugs from open-source systems. This paper presents an in-depth study of Defects4J’s fault-triggering tests, particularly examining the influence of developer modifications post-bug reports on spectrum-based fault localization (SBFL) techniques. Our findings reveal that 55% of these tests were newly added and 22% modified with developer knowledge, impacting the accuracy of SBFL. Notably, SBFL techniques’ performance drops significantly (up to -415% in Mean First Rank) when developer knowledge is absent in the tests. We provide a curated dataset of bugs without this knowledge, facilitating more realistic evaluations of SBFL techniques using Defects4J. This research offers insights for the development of future bug benchmarks.

DOI: 10.1145/3639478.3643125

Towards Precise Observations of Neural Model Robustness in Classification

作者: Mu, Wenchuan and Lim, Kwan Hui
关键词: No keywords

Abstract

In deep learning applications, robustness measures the ability of neural models that handle slight changes in input data, which could lead to potential safety hazards, especially in safety-critical applications. Pre-deployment assessment of model robustness is essential, but existing methods often suffer from either high costs or imprecise results. To enhance safety in real-world scenarios, metrics that effectively capture the model’s robustness are needed. To address this issue, we compare the rigour and usage conditions of various assessment methods based on different definitions. Then, we propose a straightforward and practical metric utilizing hypothesis testing for probabilistic robustness and have integrated it into the TorchAttacks library. Through a comparative analysis of diverse robustness assessment methods, our approach contributes to a deeper understanding of model robustness in safety-critical applications.

DOI: 10.1145/3639478.3643519

Exploring the Effectiveness of LLM based Test-driven Interactive Code Generation： User Study and Empirical Evaluation

作者: Fakhoury, Sarah and Naik, Aaditya and Sakkas, Georgios and Chakraborty, Saikat and Musuvathi, Madan and Lahiri, Shuvendu
关键词: No keywords

Abstract

We introduce a novel workflow, TiCoder, designed to enhance the trust and accuracy of LLM-based code generation through interactive and guided intent formalization. TiCoder partially formalizes ambiguous intent in natural language prompts by generating a set of tests to distinguish common divergent behaviours in generated code suggestions. We evaluate the code generation accuracy improvements provided by TiCoder at scale across four competitive LLMs, and evaluate the cost-benefit trade off of evaluating tests surfaced by TiCoder through a user study with 15 participants.

DOI: 10.1145/3639478.3643525

Decoding Log Parsing Challenges： A Comprehensive Taxonomy for Actionable Solutions

作者: Sedki, Issam and Hamou-Lhadj, Abdelwahab and Ait-Mohamed, Otmane and Ezzati-Jivan, Naser and Shehab, Mohammed
关键词: No keywords

Abstract

Logging is a common practice in software engineering that is used by developers to understand the runtime aspects of a system. Log files, however, tend to vary in their structures, making it challenging to analyze their content. In this paper, we present a preliminary taxonomy of log event characteristics that commonly lead to log parsing errors. We achieve this through the analysis of 16 log datasets using eight different parsing tools. We believe that this taxonomy can be used to guide the design of better log parsing tools that can adapt to various log file structures. It can also pave the way to the development of logging guidelines and best practices.

DOI: 10.1145/3639478.3643523

GoSpeechLess： Interoperable Serverless ML-based Cloud Services

作者: Ristov, Sashko and Gritsch, Philipp and Meyer, David and Felderer, Michael
关键词: No keywords

Abstract

Recently, Backend-as-a-Service (BaaS)-enabled serverless functions have been rapidly gaining traction. However, the dependence on specific provider features and configurations still leads to challenges in terms of portability, underlying platform heterogeneity, and vendor lock-in. To bridge this gap, this paper introduces GoSpeechLess1, a GoLang library for serverless functions that allows developers to code serverless functions with interoperable BaaS services in a uniform manner. GoSpeechLess thereby is able to reduce development effort by improving the maintainability index by up to 23.53 % and reducing LOC by up to 59.4%. The trade-off is up to 9.21% higher runtime overhead.

DOI: 10.1145/3639478.3643123

Learning to Represent Patches

作者: Tang, Xunzhu and Tian, Haoye and Chen, Zhenghan and Pian, Weiguo and Ezzini, Saad and Kabore, Abdoul Kader and Habib, Andrew and Klein, Jacques and Bissyande, Tegawende F.
关键词: No keywords

Abstract

We propose Patcherizer, a novel patch representation methodology that combines context and structure intention features to capture the semantic changes in Abstract Syntax Trees (ASTs) and surrounding context of code changes. Utilizing graph convolutional neural networks and transformers, Patcherizer effectively captures the underlying intentions of patches, outperforming state-of-the-art representations with significant improvements in BLEU, ROUGE-L, and METEOR metrics for generating patch descriptions.

DOI: 10.1145/3639478.3643521

Automated Code Editing with Search-Generate-Modify

作者: Liu, Changshu and Cetin, Pelin and Patodia, Yogesh and Ray, Baishakhi and Chakraborty, Saikat and Ding, Yangruibo
关键词: bug fixing, automated program repair, edit-based neural network

Abstract

Code editing is essential in evolving software development. In literature, several automated code editing tools are proposed, which leverage Information Retrieval-based techniques and Machine Learning-based code generation and code editing models.A patch that is obtained by search & retrieval, even if incorrect, can provide helpful guidance to a code generation model. However, a retrieval-guided patch produced by a code generation model can still be a few tokens off from the intended patch. Such generated patches can be slightly modified to create the intended patches. We propose SarGaM which mimics a developer’s behavior - search for related patches, generate or write code and then modify to adapt it to the right context. Our evaluation of SarGaM on edit generation shows superior performance w.r.t. the current state-of-the-art techniques. SarGaM also shows its effectiveness on automated program repair tasks.

DOI: 10.1145/3639478.3643124

How are Contracts Used in Android Mobile Applications?

作者: Ferreira, David R. and Mendes, Alexandra and Ferreira, Joao F.
关键词: design by contract, Android, assertions, kotlin, Java

Abstract

Formal contracts and assertions are effective methods to enhance software quality by enforcing preconditions, postconditions, and invariants. However, the adoption and impact of contracts in the context of mobile application development, particularly of Android applications, remain unexplored. We present the first large-scale empirical study on the presence and use of contracts in Android applications, written in Java or Kotlin. We consider 2,390 applications and five categories of contract elements: conditional runtime exceptions, APIs, annotations, assertions, and other. We show that most contracts are annotation-based and are concentrated in a small number of applications.

DOI: 10.1145/3639478.3643536

Causal Graph Fuzzing for Fair ML Sofware Development

作者: Monjezi, Verya and Kumar, Ashish and Tan, Gang and Trivedi, Ashutosh and Tizpaz-Niari, Saeid
关键词: No keywords

Abstract

Machine learning (ML) is increasingly used in high-stakes areas like autonomous driving, finance, and criminal justice. However, it often unintentionally perpetuates biases against marginalized groups. To address this, the software engineering community has developed fairness testing and debugging methods, establishing best practices for fair ML software. These practices focus on training model design, including the selection of sensitive and non-sensitive attributes and hyperparameter configuration. However, the application of these practices across different socio-economic and cultural contexts is challenging, as societal constraints vary.Our study proposes a search-based software engineering approach to evaluate the robustness of these fairness practices. We formulate these practices as the first-order logic properties and search for two neighborhood datasets where the practice satisfies in one dataset, but fail in the other one. Our key observation is that these practices should be general and robust to various uncertainty such as noise, faulty labeling, and demographic shifts. To generate datasets, we sift to the causal graph representations of datasets and apply perturbations over the causal graphs to generate neighborhood datasets. In this short paper, we show our methodology using an example of predicting risks in the car insurance application.

DOI: 10.1145/3639478.3643530

Path Complexity Analysis for Interprocedural Code

作者: Kaniyur, Mira Bhagirathi and Cavalcante-Studart, Ana and Yang, Yihan and Park, Sangeon and Chen, David and Lam, Duy and Bang, Lucas
关键词: No keywords

Abstract

Symbolic execution’s path explosion is a critical issue in software testing, quantified by Asymptotic Path Complexity (APC) [3]. APC, more precise than cyclomatic [6] or NPATH [7] complexities, measures the effort to cover paths in code analysis [1]. It’s vital for testing, setting limits on path growth for tools like Klee [4], focusing previously on intraprocedural code [2, 8]. Our advancement, APC-IP, extends APC to interprocedural analysis, enhancing scalability and encompassing earlier models.

DOI: 10.1145/3639478.3643527

Extracting Relevant Test Inputs from Bug Reports for Automatic Test Case Generation

作者: Ou'{e
关键词: No keywords

Abstract

The pursuit of automating software test case generation, particularly for unit tests, has become increasingly important due to the labor-intensive nature of manual test generation [6]. However, a significant challenge in this domain is the inability of automated approaches to generate relevant inputs, which compromises the efficacy of the tests [6].

DOI: 10.1145/3639478.3643537

ICLNet： Stepping Beyond Dates for Robust Issue-Commit Link Recovery

作者: Kumar, Abhishek and Das, Partha Pratim and Chakrabarti, Partha Pratim
关键词: LIME, SHAP, BERT, issue report, adversarial training, commit report, neural network

Abstract

In the field of software engineering, effectively managing software systems is essential. A key aspect of this management is the issue-commit link, which connects reported problems or enhancement requests (issues) with the actual code changes implemented in the software (commits). However, the robustness of various automated link recovery techniques, including the leading ML based model Hybrid Linker remains a subject of discussion. In this study, we investigate the Hybrid Linker model using interpretability tools like LIME and SHAP to understand its decision-making, especially its reliance on specific features. We assess its robustness against adversarial attacks, revealing its sensitivity to non-textual features like issue and commit dates. To address this, we introduce ICLNet (Issue Commit Link Network), which leverages BERT embeddings in a custom neural network. Our extensive adversarial tests show that ICLNet outperforms Hybrid Linker in adversarial settings, demonstrating greater resilience. ICLNet achieves a remarkable average F-score of 88.39% in adversarial scenarios, significantly surpassing Hybrid Linker’s 62.11%. This confirms ICLNet’s superiority in diverse conditions, highlighting its accuracy and robustness.

DOI: 10.1145/3639478.3643532

NL2Fix： Generating Functionally Correct Code Edits from Bug Descriptions

作者: Fakhoury, Sarah and Chakraborty, Saikat and Musuvathi, Madanlal and Lahiri, Shuvendu K.
关键词: No keywords

Abstract

Despite the notable advancement of Large Language Models for Code Generation, there is a distinct gap in benchmark datasets and evaluation of LLMs’ proficiency in generating functionally correct code edits based on natural language descriptions of intended changes. We address this void by presenting the challenge of translating natural language descriptions of code changes, particularly bug fixes outlined in Issue reports within repositories, into accurate code fixes. To tackle this issue, we introduce Defects4J-Nl2fix, a dataset comprising 283 Java programs from the widely-used Defects4J dataset, augmented with high-level descriptions of bug fixes. Subsequently, we empirically evaluate three state-of-the-art LLMs on this task, exploring the impact of different prompting strategies on their ability to generate functionally correct edits. Results show varied ability across models on this novel task. Collectively, the studied LLMs are able to produce plausible fixes for 64.6% of the bugs.

DOI: 10.1145/3639478.3643526

Automated Security Repair for Helm Charts

作者: Minna, Francesco and Blaise, Agathe and Massacci, Fabio and Tuma, Katja
关键词: helm charts, automated security repair, kubernetes, misconfigurations

Abstract

We aim to evaluate and compare open-source static analyzers for Helm Charts, a package manager to deploy applications on Kubernetes (K8s). Specifically, we developed a pipeline to measure what misconfigurations are found by each tool, to provide automatic misconfiguration repair, and whether this latter breaks application functionalities. To evaluate our approach, we analyzed the 60 most common Helm Charts available on Artifact Hub, seven open-source Helm Charts analyzers, and generated functionality profiles for each chart application. We found several bugs and inconsistency issues with the tools, which we reported on respective tool repositories, and concluded that such tools that should provide automatic security repair still require significant manual intervention.

DOI: 10.1145/3639478.3643534

Multi-source Anomaly Detection For Microservice Systems

作者: Li, Zhengxin and Zhao, Junfeng and Kang, Jia
关键词: microservices, anomaly detection, multi-source data, attention network, feature extraction drfg

Abstract

Microservices architecture has advantages such as independent development, independent deployment, scalability, and reusability. However, faults are inevitable during the operation of microservices systems. This paper introduces an anomaly detection approach based on multi-source data, which combines multi-level attention mechanisms and multi-scale convolutional neural networks. It designs feature extraction modules for different data sources, effectively capturing features of log data and KPI (Key Performance Indicator) data. The extracted features from different data sources are input in parallel to an attention network, where they are weighted and fused. Finally, the fused features are input into the anomaly detection model for detection. We deployed an open-source benchmark microservices system, TrainTicket [1], and injected various typical faults to validate our approach. Experimental results indicate that compared to existing approach, this approach can more accurately identify anomalies.

DOI: 10.1145/3639478.3643535

F-CodeLLM： A Federated Learning Framework for Adapting Large Language Models to Practical Software Development

作者: Cai, Zeju and Chen, Jianguo and Chen, Wenqing and Wang, Weicheng and Zhu, Xiangyuan and Ouyang, Aijia
关键词: code intelligence, federated fine-tuning, large language model, software development

Abstract

Large Language Models (LLMs) have revolutionized code intelligence tasks, but their performance in specific software development tasks often requires fine-tuning with task-specific data. However, acquiring such data is challenging due to privacy concerns. We introduce F-CodeLLM, a novel federated learning framework for adapting LLMs to software development tasks while preserving code data privacy. Leveraging federated learning and LoRA-based efficient fine-tuning, F-CodeLLM allows organizations to collaboratively improve LLMs without sharing sensitive data. Our experiments demonstrate that F-CodeLLM achieves comparable results to centralized fine-tuning methods and excels in multi-language environments, marking a significant advancement in the application of LLMs for software engineering.

DOI: 10.1145/3639478.3643533

NomNom： Explanatory Function Names for Program Synthesizers

作者: Nazari, Amirmohammad and Chattopadhyay, Souti and Swayamdipta, Swabha and Raghothaman, Mukund
关键词: No keywords

Abstract

Despite great advances in program synthesis techniques, they remain algorithmic black boxes. Although they guarantee that when synthesis is successful, the implementation satisfies the specification, they provide no additional information regarding how the implementation works or the manner in which the specification is realized. One possibility to answer these questions is to use large language models to construct human-readable explanations. Unfortunately, experiments reveal that LLMs frequently produce nonsensical or misleading explanations when applied to the unidiomatic code produced by program synthesizers. In this paper, we develop an approach to reliably augment the implementation with explanatory names. Experiments and user studies indicate that these names help users in understanding synthesized implementations.

DOI: 10.1145/3639478.3643529

Improving Fairness in Machine Learning Software via Counterfactual Fairness Thinking

作者: Yin, Zhipeng and Wang, Zichong and Zhang, Wenbin
关键词: No keywords

Abstract

Machine Learning (ML) software is increasingly influencing decisions that impact individuals’ lives. However, some of these decisions show discrimination and thus introduce algorithmic biases against certain social subgroups defined by sensitive attributes (e.g., gender or race). This has elevated software fairness bugs to an increasingly significant concern for software engineering (SE). However, most existing bias mitigation works enhance software fairness, a non-functional software property, at the cost of software performance. To this end, we proposed a novel framework, namely Group Equality Counterfactual Fairness (GECF), which aims to mitigate sensitive attribute bias and labeling bias using counterfactual fairness while reducing the resulting performance loss based on ensemble learning. Experimental results on 6 real-world datasets show the superiority of our proposed framework from different aspects.

DOI: 10.1145/3639478.3643531

Imperfect Code Generation： Uncovering Weaknesses in Automatic Code Generation by Large Language Models

作者: Lian, Xiaoli and Wang, Shuaisong and Ma, Jieping and Tan, Xin and Liu, Fang and Shi, Lin and Gao, Cuiyun and Zhang, Li
关键词: No keywords

Abstract

The task of code generation has received significant attention in recent years, especially when the pre-trained large language models (LLMs) for code have consistently achieved state-of-the-art performance. However, there is currently a lack of a comprehensive weakness taxonomy in the field, uncovering weaknesses in automatic code generation by LLMs. This may lead the community to invest excessive efforts into well-known hotspots while neglecting many crucial yet unrecognized issues that deserve more attention. To bridge this gap, we conduct a systematic study on analyzing the weaknesses based on three state-of-the-art LLMs across three widely-used code generation datasets. Our study identifies eight types of weaknesses and assesses their prevalence across each LLM and dataset, aiming to inform and shape the trajectory of future research in the domain.

DOI: 10.1145/3639478.3643081

Analyzing Software Energy Consumption

作者: Noureddine, Adel
关键词: No keywords

Abstract

Analyzing the energy consumption of applications is a crucial step in building energy-efficient software. In this technical briefing, we detail software energy measurements, starting from hardware components all down towards measuring source code. In particular, we showcase how practitioners can diagnose the energy consumption of individual methods and execution branches on runtime. We show how this diagnosis helps in identifying energy hotspots and guiding practitioners in optimizing software energy.

DOI: 10.1145/3639478.3643058

Quantum Software Testing 101

作者: Ali, Shaukat
关键词: quantum computing, quantum programs, quantum software testing

Abstract

Quantum software testing (QST) is an emerging research area within quantum software engineering (QSE) to ensure quantum software functional and non-functional correctness and dependability. Since quantum computers perform computations significantly differently than classical computing, testing quantum software running on these quantum computers also differs due to quantum computing’s unique characteristics, e.g., entanglement and superposition. Due to the rising interest of the software engineering community in QSE, we will provide an introduction to QST. We will introduce quantum computing and its various principles, quantum software development as quantum circuits, and current QST literature, including a key set of techniques with examples. Finally, a set of future research challenges related to QST will be presented.

DOI: 10.1145/3639478.3643059

Technical Briefing on Deep Neural Network Repair

作者: Arcaini, Paolo and Ishikawa, Fuyuki and Ma, Lei and Maezawa, Yuta and Yoshioka, Nobukazu and Zhang, Fuyuan
关键词: deep neural networks, DNN repair, fault localisation

Abstract

Deep Neural Networks (DNNs) are used for different tasks in many domains, some safety critical like autonomous driving. When in operation, the DNN could misbehave on some inputs unseen during training. DNN repair is a new emerging technique that tries to improve the DNN to fix these misbehaviours, without affecting the correct behaviours. The technical briefing will give an overview of the different DNN repair techniques that have been proposed in the literature, that differ in the faults they target, the way to localise them, and the technical approach they employ to modify the network. Moreover, the technical briefing will also demonstrate the usage of some of these DNN repair techniques.

DOI: 10.1145/3639478.3643063

Technical Brief on Software Engineering for FMware

作者: Lin, Dayi and Cogo, Filipe Roseiro and Rajbahadur, Gopi Krishnan and Hassan, Ahmed E.
关键词: foundation model, FMware, software engineering for FMware

Abstract

Foundation Models (FM) like GPT-4 have given rise to FMware, FM-powered applications, which represent a new generation of software that is developed with new roles, assets, and paradigms. FMware has been widely adopted in both software engineering (SE) research (e.g., test generation) and industrial products (e.g., GitHub copilot), despite the numerous challenges introduced by the stochastic nature of FMs. Such challenges jeopardize the quality and trustworthiness of FMware. In our technical brief, we will present the latest research and industrial practices in engineering FMware, and discuss the SE challenges and opportunities facing both researchers and practitioners in the FMware era.The brief is unique in that it is presented from an SE point of view, not an AI point-of-view ensuring that attendees are not bogged into complex mathematical and AI details unless they are essential for contextualizing the SE challenges and opportunities.

DOI: 10.1145/3639478.3643062

Technical Briefing on Parameter Efficient Fine-Tuning of (Large) Language Models for Code-Intelligence

作者: H. Fard, Fatemeh
关键词: parameter efficient fine tuning, code language models, large language models

Abstract

Large Language Models (LLMs) have gained much attention in the Software Engineering (SE) community, specifically for code-related tasks. Though a common approach is to fine-tune these models fully, it is a computationally heavy and time-consuming process that is not accessible to all. More importantly, with billions of parameters in the models, fully fine-tuning them for new tasks or domains is infeasible or inefficient. This technical briefing covers the alternative approach -Parameter Efficient Fine Tuning (PEFT), discussing the state-of-the-art techniques and reflecting on the few studies of using PEFT in Software Engineering and how changing the current PEFT architectures in natural language processing could enhance the performance for code-related tasks.

DOI: 10.1145/3639478.3643060

Technical Briefing on Socio-Technical Grounded Theory for Qualitative Data Analysis

作者: Hoda, Rashina
关键词: qualitative data analysis, socio-technical grounded theory, STGT

Abstract

This technical briefing will focus on imparting a good understanding of the nature of qualitative data, relevant collection techniques, and the application of robust and systematic qualitative data analysis using socio-technical grounded theory (STGT). It will include hands-on exercises and examples from published studies that have applied STGT for qualitative data analysis in qualitative as well as mixed methods research studies.

DOI: 10.1145/3639478.3643061

An Ensemble Method for Bug Triaging using Large Language Models

作者: Kumar Dipongkor, Atish
关键词: No keywords

Abstract

This study delves into the automation of bug triaging — the process of assigning bug reports to appropriate developers and components in software development. At the core of our investigation are six transformer-based Large Language Models (LLMs), which we fine-tuned using a sequence classification method tailored for bug triaging tasks. Our results demonstrate a noteworthy performance of the DeBERTa model, which significantly outperforms its counterparts CodeBERT, DistilBERT, RoBERTa, ALBERT, and BERT in terms of effectiveness. However, it is crucial to note that despite the varying performance of each model, each model exhibits a unique degree of orthogonality, indicating distinct strengths in their bug triaging capabilities. Leveraging these orthogonal characteristics, we propose an ensemble method combining these LLMs through voting and stacking techniques. Remarkably, our findings reveal that the voting-based ensemble method surpasses all individual baselines in performance.

DOI: 10.1145/3639478.3641228

Flakiness Repair in the Era of Large Language Models

作者: Chen, Yang
关键词: software testing, test flakiness, large language models

Abstract

Flaky tests can non-deterministically pass or fail regardless of any change to the code, which negatively impacts the effectiveness of the regression testing. Prior repair techniques for flaky tests mainly leverage program analysis techniques to mitigate test flakiness, which only focus on Order-Dependent (OD) and Implementation-Dependent (ID) flakiness with known flakiness patterns and root causes. In this paper, we propose an approach to repair flaky tests with the power of Large Language Models (LLMs). Our approach successfully repaired 79% of OD tests and 58% of ID tests in an extensive evaluation using 666 flaky tests from 222 projects. We submitted pull requests to fix 61 flaky tests; at the time of submission, 19 tests have already been accepted. However, we observed that currently LLMs are ineffective in adequately repairing Non-Order-Dependent (NOD) flaky tests by analyzing 118 of such tests from 11 projects.

DOI: 10.1145/3639478.3641227

Vulnerability Root Cause Function Locating For Java Vulnerabilities

作者: Zhang, Lyuye
关键词: No keywords

Abstract

Software Composition Analysis has emerged as an essential solution for mitigating vulnerabilities within the dependencies of software projects. Reachability analysis has been increasingly leveraged to streamline vulnerability remediation procedures by prioritizing reachable vulnerabilities, which require the code-level root cause of vulnerabilities to perform reachability analysis. Notwithstanding, pinpointing the root cause leading to exploitation is laborious and resource-intensive, given the requisite manual oversight from specialists. To this end, we introduce root cause function Finder (RCFer), a solution capable of autonomously identifying root cause function utilizing semantic analysis of enriched vulnerability descriptions and source code. The top-10 outcomes successfully pinpoint root cause functions for 73.81% of assessed vulnerabilities.

DOI: 10.1145/3639478.3641225

IntTracer： Sanitization-aware IO2BO Vulnerability Detection across Codebases

作者: Chen, Xiang
关键词: integer overflow, taint analysis, recurring vulnerability, interval analysis

Abstract

Integer Overflow to Buffer Overflow (IO2BO) vulnerability represents a common vulnerability pattern in system software and can be detected by various program analysis methods. Mainstream static approaches apply taint analysis to find source-sink pairs and then submit those suspicious bug traces to dynamic instrumentation or static encoding.However, previous works utilizing those methods either fail to handle sanitization code well or cannot generalize across codebases. In this paper, we present IntTracer, which is enhanced with interval domain to model the effect of sanitization code in IO2BO bug trace and can find recurring vulnerabilities across different codebases. IntTracer can prevent false positives under 8 cases while keeping an overhead of 6.3% compared to previous work Tracer.

DOI: 10.1145/3639478.3641223

Classifying Source Code： How Far Can Compressor-based Classifiers Go?

作者: Yang, Zhou
关键词: defect software prediction, robustness, efficient learning

Abstract

Pre-trained language models of code, which are built upon large-scale datasets, millions of trainable parameters, and high computational resources cost, have achieved phenomenal success. Recently, researchers have proposed a compressor-based classifier (Cbc); it trains no parameters but is found to outperform BERT. We conduct the first empirical study to explore whether this lightweight alternative can accurately classify source code. Our study is more than applying Cbc to code-related tasks. We first identify an issue that the original implementation overestimates Cbc. After correction, Cbc’s performance on defect prediction drops from 80.7% to 63.0%, which is still comparable to CodeBERT (63.7%). We find that hyperparameter settings affect the performance. Besides, results show that Cbc can outperform CodeBERT when the training data is small, making it a good alternative in low-resource settings.

DOI: 10.1145/3639478.3641229

Program Decomposition and Translation with Static Analysis

作者: Ibrahimzada, Ali Reza
关键词: No keywords

Abstract

The rising popularity of Large Language Models (LLMs) has motivated exploring their use in code-related tasks. Code LLMs with more than millions of parameters are trained on a massive amount of code in different Programming Languages (PLs). Such models are used for automating various Software Engineering (SE) tasks using prompt engineering. However, given the very large size of industry-scale project files, a major issue of these LLMs is their limited context window size, motivating the question of “Can these LLMs process very large files and can we effectively perform prompt engineering?”. Code translation aims to convert source code from one PL to another. In this work, we assess the effect of method-level program decomposition on context window of LLMs and investigate how this approach can enable translation of very large files which originally could not be done due to out-of-context issue. Our observations from 20 well-known java projects and approximately 60K methods suggest that method-level program decomposition significantly improves the limited context window problem of LLMs by 99.5%. Furthermore, our empirical analysis indicate that with method-level decomposition, each input fragment on average only consumes 5% of the context window, leaving more context space for prompt engineering and the output. Finally, we investigate the effectiveness of a Call Graph (CG) approach for translating very large files when doing method-level program decomposition.

DOI: 10.1145/3639478.3641226

Refining Abstract Specifications into Dangerous Traffic Scenarios

作者: Babikian, Aren A.
关键词: No keywords

Abstract

Safety assurance of autonomous vehicles (AVs) is particularly challenging when considering the infinite number of scenarios an AV may encounter. As such, existing scenario generation approaches optimize search to derive dangerous refinements of a (same) abstract scenario given as input. In this paper, we propose a scenario generation approach that derives dangerous (collision-inducing) concrete scenarios from arbitrary abstract scenarios (under reasonable assumptions). As added novelty, our approach allows to compare the level of danger offered by different abstract scenarios. We evaluate the collision avoidance capacity of the Transfuser AV controller by generating, then simulating, collision-inducing 2-actor scenarios at a road junction. Results show that distinctions at higher abstraction levels yield measurable differences in simulation.

DOI: 10.1145/3639478.3641224

When Large Language Models Confront Repository-Level Automatic Program Repair： How Well They Done?

作者: Chen, Yuxiao and Wu, Jingzheng and Ling, Xiang and Li, Changjiang and Rui, Zhiqing and Luo, Tianyue and Wu, Yanjun
关键词: large language models, automatic program repair, repository-level bugs, context, static analysis

Abstract

In recent years, large language models (LLMs) have demonstrated substantial potential in addressing automatic program repair (APR) tasks. However, the current evaluation of these models for APR tasks focuses solely on the limited context of the single function or file where the bug is located, overlooking the valuable information in the repository-level context. This paper investigates the performance of popular LLMs in handling repository-level repair tasks. We introduce RepoBugs, a new benchmark comprising 124 typical repository-level bugs from open-source repositories. Preliminary experiments using GPT3.5 based on the function where the error is located, reveal that the repair rate on RepoBugs is only 22.58%, significantly diverging from the performance of GPT3.5 on function-level bugs in related studies. This underscores the importance of providing repository-level context when addressing bugs at this level. However, the repository-level context offered by the preliminary method often proves redundant and imprecise and easily exceeds the prompt length limit of LLMs. To solve the problem, we propose a simple and universal repository-level context extraction method (RLCE) designed to provide more precise context for repository-level code repair tasks. Evaluations of three mainstream LLMs show that RLCE significantly enhances the ability to repair repository-level bugs. The improvement reaches a maximum of 160% compared to the preliminary method. Additionally, we conduct a comprehensive analysis of the effectiveness and limitations of RLCE, along with the capacity of LLMs to address repository-level bugs, offering valuable insights for future research.

DOI: 10.1145/3639478.3647633

ReposVul： A Repository-Level High-Quality Vulnerability Dataset

作者: Wang, Xinchen and Hu, Ruida and Gao, Cuiyun and Wen, Xin-Cheng and Chen, Yujia and Liao, Qing
关键词: open-source software, software vulnerability datasets, data quality

Abstract

Open-Source Software (OSS) vulnerabilities bring great challenges to the software security and pose potential risks to our society. Enormous efforts have been devoted into automated vulnerability detection, among which deep learning (DL)-based approaches have proven to be the most effective. However, the performance of the DL-based approaches generally relies on the quantity and quality of labeled data, and the current labeled data present the following limitations: (1) Tangled Patches: Developers may submit code changes unrelated to vulnerability fixes within patches, leading to tangled patches. (2) Lacking Inter-procedural Vulnerabilities: The existing vulnerability datasets typically contain function-level and file-level vulnerabilities, ignoring the relations between functions, thus rendering the approaches unable to detect the inter-procedural vulnerabilities. (3) Outdated Patches: The existing datasets usually contain outdated patches, which may bias the model during training.To address the above limitations, in this paper, we propose an automated data collection framework and construct the first repository-level high-quality vulnerability dataset named ReposVul. The proposed framework mainly contains three modules: (1) A vulnerability untangling module, aiming at distinguishing vulnerability-fixing related code changes from tangled patches, in which the Large Language Models (LLMs) and static analysis tools are jointly employed. (2) A multi-granularity dependency extraction module, aiming at capturing the inter-procedural call relationships of vulnerabilities, in which we construct multiple-granularity information for each vulnerability patch, including repository-level, file-level, function-level, and line-level. (3) A trace-based filtering module, aiming at filtering the outdated patches, which leverages the file path trace-based filter and commit time trace-based filter to construct an up-to-date dataset.The constructed repository-level ReposVul encompasses 6,134 CVE entries representing 236 CWE types across 1,491 projects and four programming languages. Thorough data analysis and manual checking demonstrate that ReposVul is high in quality and alleviates the problems of tangled and outdated patches in previous vulnerability datasets.

DOI: 10.1145/3639478.3647634

MissConf： LLM-Enhanced Reproduction of Configuration-Triggered Bugs

作者: Fu, Ying and Wang, Teng and Li, Shanshan and Ding, Jinyan and Zhou, Shulin and Jia, Zhouyang and Li, Wang and Jiang, Yu and Liao, Xiangke
关键词: bug reproduction, software configuration, software maintenance

Abstract

Bug reproduction stands as a pivotal phase in software development, but the absence of configuration information emerges as the main obstacle to effective bug reproduction. Since configuration options generally control critical branches of the software, many bugs can only be triggered under specific configuration settings. We refer to these bugs as configuration-triggered bugs or CTBugs for short. The reproduction of CTBugs consumes considerable time and manual efforts due to the challenges in deducing the missing configuration options within the vast search space of configurations. This complexity contributes to a form of technical debt in software development.To address these challenges, we first conducted an empirical study on 120 CTBugs from 4 widely used systems to understand the root causes and factors influencing the reproduction of CTBugs. Based on our study, we designed and implemented MissConf, the first LLM-enhanced automated tool for CTBug reproduction. Miss-Conf first leverages the LLM to infer whether crucial configuration options are missing in the bug report. Once a suspect CTBug is found, MissConf employs configuration taint analysis and dynamic monitoring methods to filter suspicious configuration options set. Furthermore, it adopts a heuristic strategy for identifying crucial configuration options and their corresponding values. We evaluated MissConf on 5 real-world software systems. The experimental results demonstrate that MissConf successfully infers the 84% (41/49) of the CTBugs and reproduces the 65% (32/49) CTBugs. In the reproduction phase, MissConf eliminates up to 76% of irrelevant configurations, offering significant time savings for developers.

DOI: 10.1145/3639478.3647635

Challenges and Opportunities in Model Checking Large-scale Distributed Systems

Abstract

Software Engineering Research in a World with Generative Artificial Intelligence

Abstract

Trustworthy by Design

Abstract

Domain Knowledge Matters： Improving Prompts with Fix Templates for Repairing Python Type Errors

Abstract

Practical Program Repair via Preference-based Ensemble Strategy

Abstract

Learning and Repair of Deep Reinforcement Learning Policies from Fuzz-Testing Data

Abstract

BinAug： Enhancing Binary Similarity Analysis with Low-Cost Input Repairing

Abstract

VeRe： Verification Guided Synthesis for Repairing Deep Neural Networks

Abstract

RUNNER： Responsible UNfair NEuron Repair for Enhancing Deep Neural Network Fairness

Abstract

ITER： Iterative Neural Repair for Multi-Location Patches

Abstract

EGFE： End-to-end Grouping of Fragmented Elements in UI Designs with Multimodal Learning

Abstract

A Comprehensive Study of Learning-based Android Malware Detectors under Challenging Environments

Abstract

Toward Automatically Completing GitHub Workflows

Abstract

UniLog： Automatic Logging via LLM and In-Context Learning

Abstract

Predicting Performance and Accuracy of Mixed-Precision Programs for Precision Tuning

Abstract

Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection

Abstract

Large Language Models for Test-Free Fault Localization

Abstract

CrashTranslator： Automatically Reproducing Mobile Application Crashes Directly from Stack Trace

Abstract

Reorder Pointer Flow in Sound Concurrency Bug Prediction

Abstract

Object Graph Programming

Abstract

Semantic Analysis of Macro Usage for Portability

Abstract

NuzzleBug： Debugging Block-Based Programs in Scratch

Abstract

LogShrink： Effective Log Compression by Leveraging Commonality and Variability of Log Data

Abstract

Demystifying Compiler Unstable Feature Usage and Impacts in the Rust Ecosystem

Abstract

Resource Usage and Optimization Opportunities in Workflows of GitHub Actions

Abstract

Revealing Hidden Threats： An Empirical Study of Library Misuse in Smart Contracts

Abstract

Fine-SE： Integrating Semantic Features and Expert Features for Software Effort Estimation

Abstract

Kind Controllers and Fast Heuristics for Non-Well-Separated GR(1) Specifications

Abstract

It’s Not a Feature, It’s a Bug： Fault-Tolerant Model Mining from Noisy Data

Abstract

Enabling Runtime Verification of Causal Discovery Algorithms with Automated Conditional Independence Reasoning

Abstract

Modularizing while Training： A New Paradigm for Modularizing DNN Models

Abstract

KnowLog： Knowledge Enhanced Pre-trained Language Model for Log Understanding

Abstract

FAIR： Flow Type-Aware Pre-Training of Compiler Intermediate Representations

Abstract

Exploring the Potential of ChatGPT in Automated Code Refinement： An Empirical Study

Abstract

Deep Learning or Classical Machine Learning? An Empirical Study on Log-Based Anomaly Detection

Abstract

TRACED： Execution-aware Pre-training for Source Code

Abstract

CoderEval： A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models

Abstract

Inferring Data Preconditions from Deep Learning Models for Trustworthy Prediction in Deployment

Abstract

Large Language Models are Few-Shot Summarizers： Multi-Intent Comment Generation via In-Context Learning

Abstract

On Using GUI Interaction Data to Improve Text Retrieval-based Bug Localization

Abstract