ISSTA 2022

jTrans： jump-aware transformer for binary code similarity detection

作者: Wang, Hao and Qu, Wenjie and Katz, Gilad and Zhu, Wenyu and Gao, Zeyu and Qiu, Han and Zhuge, Jianwei and Zhang, Chao
关键词: Similarity Detection, Neural Networks, Datasets, Binary Analysis

Abstract

Binary code similarity detection (BCSD) has important applications in various fields such as vulnerabilities detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow information of binary code into Transformer-based language models, by using a novel jump-aware representation of the analyzed binaries and a newly-designed pre-training task. Additionally, we release to the community a newly-created large dataset of binaries, BinaryCorp, which is the most diverse to date. Evaluation results show that jTrans outperforms state-of-the-art (SOTA) approaches on this more challenging dataset by 30.5% (i.e., from 32.0% to 62.5%). In a real-world task of known vulnerability searching, jTrans achieves a recall that is 2X higher than existing SOTA baselines.

DOI: 10.1145/3533767.3534367

Replication Package for Article： “FDG： A Precise Measurement of Fault Diagnosability Gain of Test Cases”

作者: An, Gabin and Yoo, Shin
关键词: fault diagnosability, fault localisation, test augmentation, test generation

Abstract

This artifact contains a replication package for the paper “FDG: A Precise Measurement of Fault Diagnosability Gain of Test Cases”. It provides the scripts and documents to replicate the experiment described in the paper. The detailed guide to replication is provided in the artifact’s README.md file.

DOI: 10.1145/3533767.3534370

TeLL： log level suggestions via modeling multi-level code block information

作者: Liu, Jiahao and Zeng, Jun and Wang, Xiang and Ji, Kaihang and Liang, Zhenkai
关键词: Multi-level Code Block Information, Log Level Suggestion, Graph Neural Network

Abstract

Developers insert logging statements into source code to monitor system execution, which forms the basis for software debugging and maintenance.
For distinguishing diverse runtime information, each software log is assigned with a separate verbosity level (e.g., trace and error).
However, choosing an appropriate verbosity level is a challenging and error-prone task due to the lack of specifications for log level usages.
Prior solutions aim to suggest log levels based on the code block in which a logging statement resides (i.e., intra-block features).
Such suggestions, however, do not consider information from surrounding blocks (i.e., inter-block features), which also plays an important role in revealing logging characteristics.

To address this issue, we combine multiple levels of code block information (i.e., intra-block and inter-block features) into a joint graph structure called Flow of Abstract Syntax Tree (FAST).
To explicitly exploit multi-level block features, we design a new neural architecture, Hierarchical Block Graph Network (HBGN), on the FAST.
In particular, it leverages graph neural networks to encode both the intra-block and inter-block features into code block representations and guide log level suggestions.
We implement a prototype system, TeLL, and evaluate its effectiveness on nine large-scale software systems. Experimental results showcase TeLL’s advantage in predicting log levels over the state-of-the-art approaches.

DOI: 10.1145/3533767.3534379

An extensive study on pre-trained models for program understanding and generation

作者: Zeng, Zhengran and Tan, Hanzhuo and Zhang, Haotian and Li, Jing and Zhang, Yuqun and Zhang, Lingming
关键词: Pre-Trained Language Models, Deep Learning, Code Representation, Adversarial Attack

Abstract

Automatic program understanding and generation techniques could significantly advance the productivity of programmers and have been widely studied by academia and industry. Recently, the advent of pre-trained paradigm enlightens researchers to develop general-purpose pre-trained models which can be applied for a broad range of program understanding and generation tasks. Such pre-trained models, derived by self-supervised objectives on large unlabelled corpora, can be fine-tuned in downstream tasks (such as code search and code generation) with minimal adaptations. Although these pre-trained models claim superiority over the prior techniques, they seldom follow equivalent evaluation protocols, e.g., they are hardly evaluated on the identical benchmarks, tasks, or settings. Consequently, there is a pressing need for a comprehensive study of the pre-trained models on their effectiveness, versatility as well as the limitations to provide implications and guidance for the future development in this area. To this end, we first perform an extensive study of eight open-access pre-trained models over a large benchmark on seven representative code tasks to assess their reproducibility. We further compare the pre-trained models and domain-specific state-of-the-art techniques for validating pre-trained effectiveness. At last, we investigate the robustness of the pre-trained models by inspecting their performance variations under adversarial attacks. Through the study, we find that while we can in general replicate the original performance of the pre-trained models on their evaluated tasks and adopted benchmarks, subtle performance fluctuations can refute the findings in their original papers. Moreover, none of the existing pre-trained models can dominate over all other models. We also find that the pre-trained models can significantly outperform non-pre-trained state-of-the-art techniques in program understanding tasks. Furthermore, we perform the first study for natural language-programming language pre-trained model robustness via adversarial attacks and find that a simple random attack approach can easily fool the state-of-the-art pre-trained models and thus incur security issues. At last, we also provide multiple practical guidelines for advancing future research on pre-trained models for program understanding and generation.

DOI: 10.1145/3533767.3534390

Metamorphic relations via relaxations： an approach to obtain oracles for action-policy testing

作者: Eniser, Hasan Ferit and Gros, Timo P. and W"{u
关键词: metamorphic testing, fuzzing, action policies

Abstract

Testing is a promising way to gain trust in a learned action policy π, in particular if π is a neural network. A “bug” in this context constitutes undesirable or fatal policy behavior, e.g., satisfying a failure condition. But how do we distinguish whether such behavior is due to bad policy decisions, or whether it is actually unavoidable under the given circumstances? This requires knowledge about optimal solutions, which defeats the scalability of testing. Related problems occur in software testing when the correct program output is not known. Metamorphic testing addresses this issue through metamorphic relations, specifying how a given change to the input should affect the output, thus providing an oracle for the correct output. Yet, how do we obtain such metamorphic relations for action policies? Here, we show that the well explored concept of relaxations in the Artificial Intelligence community can serve this purpose. In particular, if state s′ is a relaxation of state s, i.e., s′ is easier to solve than s, and π fails on easier s′ but does not fail on harder s, then we know that π contains a bug manifested on s′. We contribute the first exploration of this idea in the context of failure testing of neural network policies π learned by reinforcement learning in simulated environments. We design fuzzing strategies for test-case generation as well as metamorphic oracles leveraging simple, manually designed relaxations. In experiments on three single-agent games, our technology is able to effectively identify true bugs, i.e., avoidable failures of π, which has not been possible until now.

DOI: 10.1145/3533767.3534392

Hunting bugs with accelerated optimal graph vertex matching

作者: Zhang, Xiaohui and Gong, Yuanjun and Liang, Bin and Huang, Jianjun and You, Wei and Shi, Wenchang and Zhang, Jian
关键词: optimal vertex matching, graph convolutional neural network, code similarity, bug detection

Abstract

Various techniques based on code similarity measurement have been proposed to detect bugs. Essentially, the code fragment can be regarded as a kind of graph. Performing code graph similarity comparison to identify the potential bugs is a natural choice. However, the logic of a bug often involves only a few statements in the code fragment, while others are bug-irrelevant. They can be considered as a kind of noise, and can heavily interfere with the code similarity measurement. In theory, performing optimal vertex matching can address the problem well, but the task is NP-complete and cannot be applied to a large-scale code base. In this paper, we propose a two-phase strategy to accelerate code graph vertex matching for detecting bugs. In the first phase, a vertex matching embedding model is trained and used to rapidly filter a limited number of candidate code graphs from the target code base, which are likely to have a high vertex matching degree with the seed, i.e., the known buggy code. As a result, the number of code graphs needed to be further analyzed is dramatically reduced. In the second phase, a high-order similarity embedding model based on graph convolutional neural network is built to efficiently get the approximately optimal vertex matching between the seed and candidates. On this basis, the code graph similarity is calculated to identify the potential buggy code. The proposed method is applied to five open source projects. In total, 31 unknown bugs were successfully detected and confirmed by developers. Comparative experiments demonstrate that our method can effectively mitigate the noise problem, and the detection efficiency can be improved dozens of times with the two-phase strategy.

DOI: 10.1145/3533767.3534393

Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper)

作者: Zhang, Jialu and Mytkowicz, Todd and Kaufman, Mike and Piskac, Ruzica and Lahiri, Shuvendu K.
关键词: language model, k-shot learning, Resolving merge conflicts, GPT-3

Abstract

Program merging is standard practice when developers integrate their individual changes to a common code base. When the merge algorithm fails, this is called a merge conflict. The conflict either manifests as a textual merge conflict where the merge fails to produce code, or as a semantic merge conflict where the merged code results in compiler errors or broken tests. Resolving these conflicts for large code projects is expensive because it requires developers to manually identify the sources of conflicts and correct them.
In this paper, we explore the feasibility of automatically repairing merge conflicts (both textual and semantic) using k-shot learning with pre-trained large neural language models (LM) such as GPT-3. One of the challenges in leveraging such language models is fitting the examples and the queries within a small prompt (2048 tokens). We evaluate LMs and k-shot learning for both textual and semantic merge conflicts for Microsoft Edge. Our results are mixed: on one-hand, LMs provide the state-of-the-art (SOTA) performance on semantic merge conflict resolution for Edge compared to earlier symbolic approaches; on the other hand, LMs do not yet obviate the benefits of special purpose domain-specific languages (DSL) for restricted patterns for program synthesis.

DOI: 10.1145/3533767.3534396

Combining solution reuse and bound tightening for efficient analysis of evolving systems

作者: Stevens, Clay and Bagheri, Hamid
关键词: speculative analysis, formal analysis, bounded verification

Abstract

Software engineers have long employed formal verification to ensure the safety and validity of their system designs. As the system changes—often via predictable, domain-specific operations—their models must also change, requiring system designers to repeatedly execute the same formal verification on similar system models. State-of-the-art formal verification techniques can be expensive at scale, the cost of which is multiplied by repeated analysis. This paper presents a novel analysis technique—implemented in a tool called SoRBoT—which can automatically determine domain-specific optimizations that can dramatically reduce the cost of repeatedly analyzing evolving systems. Different from all prior approaches, which focus on either tightening the bounds for analysis or reusing all or part of prior solutions, SoRBoT’s automated derivation of domain-specific optimizations combines the benefits of both solution reuse and bound tightening while avoiding the main pitfalls of each. We experimentally evaluate SoRBoT against state-of-the-art techniques for verifying evolving specifications, demonstrating that SoRBoT substantially exceeds the run-time performance of those state-of-the-art techniques while introducing only a negligible overhead, in contrast to the expensive additional computations required by the state-of-the-art verification techniques.

DOI: 10.1145/3533767.3534399

On the use of evaluation measures for defect prediction studies

作者: Moussa, Rebecca and Sarro, Federica
关键词: Software Defect Prediction, Evaluation Measures

Abstract

Software defect prediction research has adopted various evaluation measures to assess the performance of prediction models. In this paper, we further stress on the importance of the choice of appropriate measures in order to correctly assess strengths and weaknesses of a given defect prediction model, especially given that most of the defect prediction tasks suffer from data imbalance.

Investigating 111 previous studies published between 2010 and 2020, we found out that over a half either use only one evaluation measure, which alone cannot express all the characteristics of model performance in presence of imbalanced data, or a set of binary measures which are prone to be biased when used to assess models especially when trained with imbalanced data. We also unveil the magnitude of the impact of assessing popular defect prediction models with several evaluation measures based, for the first time, on both statistical significance test and effect size analyses.
Our results reveal that the evaluation measures produce a different ranking of the classification models in 82% and 85% of the cases studied according to the Wilcoxon statistical significance test and ^{A

DOI: 10.1145/3533767.3534405

Evolution-aware detection of order-dependent flaky tests

作者: Li, Chengpeng and Shi, August
关键词: order-dependent flaky test, flaky test detection, evolution-aware analysis

Abstract

Regression testing is an important part of the software development process but suffers from the presence of flaky tests. Flaky tests are tests that can nondeterministically pass or fail regardless of code changes. Order-dependent flaky tests are a prominent kind of flaky tests whose outcome depends on the test order in which they are run. Prior work has focused on detecting order-dependent flaky tests through rerunning all tests in different test orders on a single version of code. As code is constantly changing, rerunning all tests in different test orders after every change is costly.
In this work, we propose IncIDFlakies, a technique to detect order-dependent flaky tests by analyzing code changes to detect newly-introduced order-dependent flaky tests due to those changes. Building upon existing work in iDFlakies that reruns tests in dif- ferent test orders, IncIDFlakies analyzes and selects to run only the tests that (1) are affected by the change, and (2) can potentially result in a test-order dependency among each other due to potential shared state. Running IncIDFlakies on 67 order-dependent flaky tests across changes in code in their respective projects, including the changes where they became flaky, we find that IncIDFlakies can select to run on average 65.4% of all the tests, resulting in running 68.4% of the time that baseline iDFlakies would use when running the same number of test orders with the full test suite. Furthermore, we find that IncIDFlakies can still ensure that the test orders it runs can potentially detect the order-dependent flaky tests.

DOI: 10.1145/3533767.3534404

Reproduction Package for Article “ε-Weakened Robustness of Deep Neural Networks”

作者: Huang, Pei and Yang, Yuting and Liu, Minghao and Jia, Fuqi and Ma, Feifei and Zhang, Jian
关键词: adversarial attack, neural networks, robustness, testing

Abstract

The artifact is used for reproducing the experimental results in the article “ε-Weakened Robustness of Deep Neural Networks”, including the attack algorithm PGD, robustness enhancement algorithm FPP, ε-Weakened robustness evaluation and decision algorithms (EWRE and EWRD) for several neural networks (resnet18, densenet121, dpn92, regnetx_200, etc.)

DOI: 10.1145/3533767.3534373

Reproduction Package for Paper `Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replicability Study)`

作者: Weiss, Michael and Tonella, Paolo
关键词: neural networks, Test prioritization, uncertainty quantification

Abstract

This is the reproduction package of the paper Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning by M.Weiss and P.Tonella, published at ISSTA 2022.

For uses other than reproduction, we also extracted three standalone, general-purpose artifacts: - fashion-mnist-c dataset (Github: https://github.com/testingautomated-usi/fashion-mnist-c) - text corruptor (Github: https://github.com/testingautomated-usi/corrupted-text) - tip implementations (Github: https://github.com/testingautomated-usi/dnn-tip)

DOI: 10.1145/3533767.3534375

XBA

作者: Kim, Geunwoo and Hong, Sanghyun and Franz, Michael and Song, Dokyung
关键词: Binary analysis, Cross-platform, Graph alignment

Abstract

XBA is a deep learning tool for generating platform-agnostic binary code embeddings. XBA applies Graph Convolutional Network (GCN) on the graph representation of binary which we call Binary Disassembly Graph (BDG). XBA can learn semantic matchings of binary code compiled for different platforms that are not included in the training dataset. It outperformed prior works in aligning binary code blocks for different platforms, which shows the embeddings generated by XBA indeed are useful in the cross binary analysis. XBA is implemented with Python v3.8 and Tensorflow v2.7.0.

DOI: 10.1145/3533767.3534383

BET： black-box efficient testing for convolutional neural networks

作者: Wang, Jialai and Qiu, Han and Rong, Yi and Ye, Hengkai and Li, Qi and Li, Zongpeng and Zhang, Chao
关键词: Convolutional Neural Networks, Black-box Testing

Abstract

It is important to test convolutional neural networks (CNNs) to identify defects (e.g. error-inducing inputs) before deploying them in security-sensitive scenarios. Although existing white-box testing methods can effectively test CNN models with high neuron coverage, they are not applicable to privacy-sensitive scenarios where full knowledge of target CNN models is lacking. In this work, we propose a novel Black-box Efficient Testing (BET) method for CNN models. The core insight of BET is that CNNs are generally prone to be affected by continuous perturbations. Thus, by generating such continuous perturbations in a black-box manner, we design a tunable objective function to guide our testing process for thoroughly exploring defects in different decision boundaries of the target CNN models. We further design an efficiency-centric policy to find more error-inducing inputs within a fixed query budget. We conduct extensive evaluations with three well-known datasets and five popular CNN structures. The results show that BET significantly outperforms existing white-box and black-box testing methods considering the effective error-inducing inputs found in a fixed query/inference budget. We further show that the error-inducing inputs found by BET can be used to fine-tune the target model, improving its accuracy by up to 3%.

DOI: 10.1145/3533767.3534386

DocTer： documentation-guided fuzzing for testing deep learning API functions

作者: Xie, Danning and Li, Yitong and Kim, Mijung and Pham, Hung Viet and Tan, Lin and Zhang, Xiangyu and Godfrey, Michael W.
关键词: text analytics, testing, test generation, deep learning

Abstract

Input constraints are useful for many software development tasks. For example, input constraints of a function enable the generation of valid inputs, i.e., inputs that follow these constraints, to test the function deeper. API functions of deep learning (DL) libraries have DL-specific input constraints, which are described informally in the free-form API documentation. Existing constraint-extraction techniques are ineffective for extracting DL-specific input constraints.

To fill this gap, we design and implement a new technique—DocTer—to analyze API documentation to extract DL-specific input constraints for DL API functions. DocTer features a novel algorithm that automatically constructs rules to extract API parameter constraints from syntactic patterns in the form of dependency parse trees of API descriptions. These rules are then applied to a large volume of API documents in popular DL libraries to extract their input parameter constraints. To demonstrate the effectiveness of the extracted constraints, DocTer uses the constraints to enable the automatic generation of valid and invalid inputs to test DL API functions.

Our evaluation on three popular DL libraries (TensorFlow, PyTorch, and MXNet) shows that DocTer’s precision in extracting input constraints is 85.4%. DocTer detects 94 bugs from 174 API functions, including one previously unknown security vulnerability that is now documented in the CVE database, while a baseline technique without input constraints detects only 59 bugs. Most (63) of the 94 bugs are previously unknown, 54 of which have been fixed or confirmed by developers after we report them. In addition, DocTer detects 43 inconsistencies in documents, 39 of which are fixed or confirmed.

DOI: 10.1145/3533767.3534220

ASRTest： automated testing for deep-neural-network-driven speech recognition systems

作者: Ji, Pin and Feng, Yang and Liu, Jia and Zhao, Zhihong and Chen, Zhenyu
关键词: Metamorphic Testing, Deep Neural Networks, Automatic Speech Recognition, Automated Testing

Abstract

With the rapid development of deep neural networks and end-to-end learning techniques, automatic speech recognition (ASR) systems have been deployed into our daily and assist in various tasks. However, despite their tremendous progress, ASR systems could also suffer from software defects and exhibit incorrect behaviors. While the nature of DNN makes conventional software testing techniques inapplicable for ASR systems, lacking diverse tests and oracle information further hinders their testing. In this paper, we propose and implement a testing approach, namely ASR, specifically for the DNN-driven ASR systems. ASRTest is built upon the theory of metamorphic testing. We first design the metamorphic relation for ASR systems and then implement three families of transformation operators that can simulate practical application scenarios to generate speeches. Furthermore, we adopt Gini impurity to guide the generation process and improve the testing efficiency. To validate the effectiveness of ASRTest, we apply ASRTest to four ASR models with four widely-used datasets. The results show that ASRTest can detect erroneous behaviors under different realistic application conditions efficiently and improve 19.1% recognition performance on average via retraining with the generated data. Also, we conduct a case study on an industrial ASR system to investigate the performance of ASRTest under the real usage scenario. The study shows that ASRTest can detect errors and improve the performance of DNN-driven ASR systems effectively.

DOI: 10.1145/3533767.3534391

AEON： a method for automatic evaluation of NLP test cases

作者: Huang, Jen-tse and Zhang, Jianping and Wang, Wenxuan and He, Pinjia and Su, Yuxin and Lyu, Michael R.
关键词: test case quality, NLP software testing

Abstract

Due to the labor-intensive nature of manual test oracle construction, various automated testing techniques have been proposed to enhance the reliability of Natural Language Processing (NLP) software. In theory, these techniques mutate an existing test case (e.g., a sentence with its label) and assume the generated one preserves an equivalent or similar semantic meaning and thus, the same label. However, in practice, many of the generated test cases fail to preserve similar semantic meaning and are unnatural (e.g., grammar errors), which leads to a high false alarm rate and unnatural test cases. Our evaluation study finds that 44% of the test cases generated by the state-of-the-art (SOTA) approaches are false alarms. These test cases require extensive manual checking effort, and instead of improving NLP software, they can even degrade NLP software when utilized in model training. To address this problem, we propose AEON for Automatic Evaluation Of NLP test cases. For each generated test case, it outputs scores based on semantic similarity and language naturalness. We employ AEON to evaluate test cases generated by four popular testing techniques on five datasets across three typical NLP tasks. The results show that AEON aligns the best with human judgment. In particular, AEON achieves the best average precision in detecting semantic inconsistent test cases, outperforming the best baseline metric by 10%. In addition, AEON also has the highest average precision of finding unnatural test cases, surpassing the baselines by more than 15%. Moreover, model training with test cases prioritized by AEON leads to models that are more accurate and robust, demonstrating AEON’s potential in improving NLP software.

DOI: 10.1145/3533767.3534394

Grammar2Fix：Human-in-the-Loop Oracle Learning for Semantic Bugs in String Processing Programs

作者: Kapugama, Charaka Geethal and Pham, Van-Thuan and Aleti, Aldeida and B"{o
关键词: Automated Test Oracle, Grammar Inference, Software Debugging

Abstract

GRAMMAR2FIX is an active oracle learning technique for programs processing string inputs. Given a single failing input of a bug, it learns a grammar describing the pattern of all the failing inputs of the bug, interacting with a bug oracle systematically. GRAMMAR2FIX returns this grammar as a collection of Deterministic Finite Automata(DFA), and the grammar can serve as an automated test oracle for the bug. GRAMMAR2FIX also produces a test suite in grammar learning, which can be used as a repair test suite in Automated Program Repair.

DOI: 10.1145/3533767.3534406

Reproduction package for article “HybridRepair： Towards Annotation-Efficient Repair for Deep Learning Models”

作者: Li, Yu and Chen, Muxi and Xu, Qiang
关键词: deep neural networks, model repairing, model testing, semi-supervised learning

Abstract

This is an implementation of our paper “HybridRepair: Towards Annotation-Efficient Repair for Deep Learning Models”.HybridRepair is a holistic approach that combines the use of a small amount of labelled data and a large amount of unlabeled data for model repair, based on the observation that model repair requires sufficient local training data density. This artifact contains all the experiment codes in the paper.

DOI: 10.1145/3533767.3534408

Reproduction Package for Cross-Lingual Transfer Learning for Statistical Type Inference

作者: Li, Zhiming and Xie, Xiaofei and Li, Haoliang and Xu, Zhengzi and Li, Yi and Liu, Yang
关键词: Deep Learning, Transfer Learning, Type Inference

Abstract

The artifact contains data and code for the reproduction of Cross-Lingual Transfer Learning for Statistical Type Inference.

DOI: 10.1145/3533767.3534411

Unicorn： detect runtime errors in time-series databases with hybrid input synthesis

作者: Wu, Zhiyong and Liang, Jie and Wang, Mingzhe and Zhou, Chijin and Jiang, Yu
关键词: Time-series Databases, Runtime Error, Hybrid Input Synthesis

Abstract

The ubiquitous use of time-series databases in the safety-critical Internet of Things domain demands strict security and correctness. One successful approach in database bug detection is fuzzing, where
hundreds of bugs have been detected automatically in relational databases. However, it cannot be easily applied to time-series databases: the bulk of time-series logic is unreachable because of mismatched query specifications, and serious bugs are undetectable because of implicitly handled exceptions.
In this paper, we propose Unicorn to secure time-series databases with automated fuzzing. First, we design hybrid input synthesis to generate high-quality queries which not only cover time-series features but also ensure grammar correctness. Then, Unicorn uses proactive exception detection to discover minuscule-symptom bugs which hide behind implicit exception handling. With the specialized
design oriented to time-series databases, Unicorn outperforms the state-of-the-art database fuzzers in terms of coverage and bugs. Specifically, Unicorn outperforms SQLsmith and SQLancer on
widely used time-series databases IoTDB, KairosDB, TimescaleDB, TDEngine, QuestDB, and GridDB in the number of basic blocks by 21%-199% and 34%-693%, respectively. More importantly, Unicorn
has discovered 42 previously unknown bugs.

DOI: 10.1145/3533767.3534364

Artifact for “On the Use of Mutation Analysis For Evaluating Student Test Suite Quality”

作者: Perretta, James and DeOrio, Andrew and Guha, Arjun and Bell, Jonathan
关键词: mutation analysis, testing, testing education

Abstract

Contains data and analysis scripts accompanying the paper “On the Use of Mutation Analysis For Evaluating Student Test Suite Quality.”

DOI: 10.1145/3533767.3534217

Test mimicry to assess the exploitability of library vulnerabilities

作者: Kang, Hong Jin and Nguyen, Truong Giang and Le, Bach and P\u{a
关键词: Search-based Test Generation, Library Vulnerabilities

Abstract

Modern software engineering projects often depend on open-source software libraries, rendering them vulnerable to potential security issues in these libraries.
Developers of client projects have to stay alert of security threats in the software dependencies.
While there are existing tools that allow developers to assess if a library vulnerability is reachable from a project, they face limitations.
Call graph-only approaches may produce false alarms as the client project may not use the vulnerable code in a way that triggers the vulnerability,
while test generation-based approaches faces difficulties in overcoming the intrinsic complexity of exploiting a vulnerability,
where extensive domain knowledge may be required to produce a vulnerability-triggering input.

In this work, we propose a new framework named Test Mimicry,
that constructs a test case for a client project that exploits a vulnerability in its library dependencies.
Given a test case in a software library that reveals a vulnerability,
our approach captures the program state associated with the vulnerability.
Then, it guides test generation to construct a test case for the client program
to invoke the library such that it reaches the same program state as the library’s test case.
Our framework is implemented in a tool, TRANSFER, which uses search-based test generation.
Based on the library’s test case, we produce search goals that represent the program state triggering the vulnerability.
Our empirical evaluation on 22 real library vulnerabilities and 64 client programs shows that TRANSFER outperforms an existing approach, SIEGE;
TRANSFER generates 4x more test cases that demonstrate the exploitability of vulnerabilities from client projects than SIEGE.

DOI: 10.1145/3533767.3534398

Reproduction Package for Article `Automated Test Generation for REST APIs： No Time to Rest Yet`

作者: Kim, Myeongsoo and Xin, Qi and Sinha, Saurabh and Orso, Alessandro
关键词: Automated software testing, RESTful APIs

Abstract

This artifact is for reproducing the results of the article Automated Test Generation for REST APIs: No Time to Rest Yet. It has an automated script to run the ten state-of-the-art REST API testing tools for 20 RESTful services. Users can analyze the result using the provided script.

DOI: 10.1145/3533767.3534401

ISSTA 22 Artifact for “Finding Bugs in Gremlin-Based Graph Database Systems via Randomized Differential Testing”

作者: Zheng, Yingying and Dou, Wensheng and Wang, Yicheng and Qin, Zheng and Tang, Lei and Gao, Yu and Wang, Dong and Wang, Wei and Wei, Jun
关键词: differential testing, Graph database systems, Gremlin

Abstract

Grand is a tool for finding logic bugs in Gremlin-Based Graph Database Systems (GDBs). We refer to logic bugs as those bugs that GDBs return an unexpected result (e.g., incorrect query result or unexpected error) without crashing the GDBs for a given query.

Grand operates in the following three phases: 1. Graph database generation: The goal of this phase is to generate a populated graph database for each target GDB. Specially, Grand first randomly generates the graph schema to define the types of vertices and edges. Then, the detailed vertices and edges can be randomly generated according to the generated graph schema. Finally, the generated database will be written into target GDBs. 2. Gremlin query generation: This phase is aimed to generate syntactically correct and valid Gremlin queries. We first construct a traversal model for Gremlin APIs, and then generate Gremlin queries based on the constructed traversal model. 3. Differential Testing: Grand executes the generated Gremlin queries and validates the query results by differential testing.

DOI: 10.1145/3533767.3534409

RegMiner： Towards Constructing Large Regression Dataset from Code Evolution History

作者: Song, Xuezhi and Lin, Yun and Ng, Siang Hwee and Wu, Yijian and Peng, Xin and Dong, Jin Song and Mei, Hong
关键词: bug collection, mining code repository, regression bug

Abstract

Bug datasets lay significant empirical and experimental foundation for various SE/PL researches such as fault localization, software testing, and program repair. All well-known datasets are constructed manually, which inevitably limits their scalability, representativeness, and the support for the emerging data-driven research. In this work, we propose an approach to automate the process of harvesting replicable regression bugs from the code evolution history. We focus on regression bugs, as they (1) manifest how a bug is introduced and fixed (as non-regression bugs), (2) support regression bug analysis, and (3) incorporate a much stronger specification (i.e., the original passing version) than non-regression bug dataset for bug analysis. Technically, we address an information retrieval problem on code evolution history. Given a code repository, we search for regressions where a test can pass a regression-fixing commit, fail a regression-inducing commit, and pass a previous working commit. In this work, we address the challenges of (1) identifying potential regression-fixing commits from the code evolution history, (2) migrating the test and its code dependencies over the history, and (3) minimizing the compilation overhead during the regression search. We build our tool, RegMiner, which harvested 1035 regressions over 147 projects for 8 weeks, creating the largest replicable regression dataset within the shortest period, to the best of our knowledge. Our extensive experiments show that (1) RegMiner can construct the regression dataset with very high precision and acceptable recall, and (2) the constructed regression dataset is of high authenticity and diversity. We foresee that a continuously growing regression dataset opens many data-driven research opportunities in the SE/PL communities.

DOI: 10.1145/3533767.3534224

One step further： evaluating interpreters using metamorphic testing

作者: Fan, Ming and Wei, Jiali and Jin, Wuxia and Xu, Zhou and Wei, Wenying and Liu, Ting
关键词: Robustness, Metamorphic Testing, Interpreter Evaluation, DNN Model, Backdoor

Abstract

The black-box nature of the Deep Neural Network (DNN) makes it difficult for people to understand why it makes a specific decision, which restricts its applications in critical tasks. Recently, many interpreters (interpretation methods) are proposed to improve the transparency of DNNs by providing relevant features in the form of a saliency map. However, different interpreters might provide different interpretation results for the same classification case, which motivates us to conduct the robustness evaluation of interpreters.

However, the biggest challenge of evaluating interpreters is the testing oracle problem, i.e., hard to label ground-truth interpretation results. To fill this critical gap, we first use the images with bounding boxes in the object detection system and the images inserted with backdoor triggers as our original ground-truth dataset. Then, we apply metamorphic testing to extend the dataset by three operators, including inserting an object, deleting an object, and feature squeezing the image background. Our key intuition is that after the three operations which do not modify the primary detected objects, the interpretation results should not change for good interpreters. Finally, we measure the qualities of interpretation results quantitatively with the Intersection-over-Minimum (IoMin) score and evaluate interpreters based on the statistics of metamorphic relation’s failures.

We evaluate seven popular interpreters on 877,324 metamorphic images in diverse scenes. The results show that our approach can quantitatively evaluate interpreters’ robustness, where Grad-CAM provides the most reliable interpretation results among the seven interpreters.

DOI: 10.1145/3533767.3534225

Artefact for SnapFuzz： High-Throughput Fuzzing of Network Applications

作者: Andronidis, Anastasios and Cadar, Cristian
关键词: afl, aflnet, artefact, binary rewriting, fuzzing, snapfuzz

Abstract

This is the artefact submitted for the SnapFuzz: High-Throughput Fuzzing of Network Applications paper. The artefact includes a README file with full details on how to run and reproduce our results.

A latest version can always be found in: https://github.com/srg-imperial/SnapFuzz-artefact

DOI: 10.1145/3533767.3534376

Almost Correct Invariants： Synthesizing Inductive Invariants by Fuzzing Proofs

作者: Lahiri, Sumit and Roy, Subhajit
关键词: fuzzing, inductive invariant synthesis, testing, verification

Abstract

The artifact is a tar zip file containing the README along with install script to set the tool up and other assets which have been detailed in the README file.

DOI: 10.1145/3533767.3534381

SLIME： program-sensitive energy allocation for fuzzing

作者: Lyu, Chenyang and Liang, Hong and Ji, Shouling and Zhang, Xuhong and Zhao, Binbin and Han, Meng and Li, Yun and Wang, Zhe and Wang, Wenhai and Beyah, Raheem
关键词: Vulnerability discovery, Fuzzing, Data-driven technique

Abstract

The energy allocation strategy is one of the most popular techniques in fuzzing to improve code coverage and vulnerability discovery. The core intuition is that fuzzers should allocate more computational energy to the seed files that have high efficiency to trigger unique paths and crashes after mutation. Existing solutions usually define several properties, e.g., the execution speed, the file size, and the number of the triggered edges in the control flow graph, to serve as the key measurements in their allocation logics to estimate the potential of a seed. The efficiency of a property is usually assumed to be the same across different programs. However, we find that this assumption is not always valid. As a result, the state-of-the-art energy allocation solutions with static energy allocation logics are hard to achieve desirable performance on different programs.

To address the above problem, we propose a novel program-sensitive solution, named SLIME, to enable adaptive energy allocation on the seed files with various properties for each program. Specifically, SLIME first designs multiple property-aware queues, with each queue containing the seed files with a specific property. Second, to improve the return of investment, SLIME leverages a customized Upper Confidence Bound Variance-aware (UCB-V) algorithm to statistically select a property queue with the most estimated reward, i.e., finding the most new unique execution paths and crashes. Finally, SLIME mutates the seed files in the selected property queue to perform property-adaptive fuzzing on a program. We evaluate SLIME against state-of-the-art open source fuzzers AFL, MOPT, AFL++, AFL++HIER, EcoFuzz, and TortoiseFuzz on 9 real-world programs. The results demonstrate that SLIME discovers 3.53X, 0.24X, 0.62X, 1.54X, 0.88X, and 3.81X more unique vulnerabilities compared to the above fuzzers, respectively. We will open source the prototype of SLIME to facilitate future fuzzing research.

DOI: 10.1145/3533767.3534385

Reproduction package for article “MDPFuzz： Testing Models Solving Markov Decision Processes”

作者: Pang, Qi and Yuan, Yuanyuan and Wang, Shuai
关键词: Deep learning testing, Markov decision procedure

Abstract

Official implementation of ISSTA 2022 paper: MDPFuzz: Testing Models Solving Markov Decision Processes.

DOI: 10.1145/3533767.3534388

TensileFuzz： facilitating seed input generation in fuzzing via string constraint solving

作者: Liu, Xuwei and You, Wei and Zhang, Zhuo and Zhang, Xiangyu
关键词: software testing, fuzzing, dynamic analysis

Abstract

Seed inputs are critical to the performance of mutation based fuzzers. Existing techniques make use of symbolic execution and gradient descent to generate seed inputs. However, these techniques are not particular suitable for input growth (i.e., making input longer and longer), a key step in seed input generation. Symbolic execution models very low level constraints and prefer fix-sized inputs whereas gradient descent only handles cases where path conditions are arithmetic functions of inputs. We observe that growing an input requires considering a number of relations: length, offset, and count, in which a field is the length of another field, the offset of another field, and the count of some pattern in another field, respective. String solver theory is particularly suitable for addressing these relations. We hence propose a novel technique called TensileFuzz, in which we identify input fields and denote them as string variables such that a seed input is the concatenation of these string variables. Additional padding string variables are inserted in between field variables. The aforementioned relations are reverse-engineered and lead to string constraints, solving which instantiates the padding variables and hence grows the input. Our technique also integrates linear regression and gradient descent to ensure the grown inputs satisfy path constraints that lead to path exploration. Our comparison with AFL, and a number of state-of-the-art fuzzers that have similar target applications, including Qsym, Angora, and SLF, shows that TensileFuzz substantially outperforms the others, by 39% - 98% in terms of path coverage.

DOI: 10.1145/3533767.3534403

PrIntFuzz： fuzzing Linux drivers via automated virtual device simulation

作者: Ma, Zheyu and Zhao, Bodong and Ren, Letu and Li, Zheming and Ma, Siqi and Luo, Xiapu and Zhang, Chao
关键词: Interrupt, Fuzz, Device Driver

Abstract

Linux drivers share the same address space and privilege with the core of the kernel but have a much larger code base and attack surface. The Linux drivers are not well tested and have weaker security guarantees than the kernel. Missing support from hardware devices, existing fuzzing solutions fail to cover a large portion of the driver code, e.g., the initialization code and interrupt handlers.
In this paper, we present PrIntFuzz, an efficient and universal fuzzing framework that can test the overlooked driver code, including the PRobing code and INTerrupt handlers. PrIntFuzz first extracts knowledge from the driver through inter-procedural field-sensitive, path-sensitive, and flow-sensitive static analysis. Then it utilizes the information to build a flexible and efficient simulator, which supports device probing, hardware interrupts emulation and device I/O interception. Lastly, PrIntFuzz applies a multi-dimension fuzzing strategy to explore the overlooked code.
We have developed a prototype of PrIntFuzz and successfully simulated 311 virtual PCI (Peripheral Component Interconnect) devices, 472 virtual I2C (Inter-Integrated Circuit) devices, 169 virtual USB (Universal Serial Bus) devices, and found 150 bugs in the corresponding device drivers. We have submitted patches for these bugs to the Linux kernel community, and 59 patches have been merged so far. In a control experiment of Linux 5.10-rc6, PrIntFuzz found 99 bugs, while the state-of-the-art fuzzer only found 50. PrIntFuzz covers 11,968 basic blocks on the latest Linux kernel, while the state-of-the-art fuzzer Syzkaller only covers 2,353 basic blocks.

DOI: 10.1145/3533767.3534226

Reproduction Package for ‘Efficient Greybox Fuzzing of Applications in Linux-Based IoT Devices via Enhanced User-Mode Emulation’

作者: Zheng, Yaowen and Li, Yuekang and Zhang, Cen and Zhu, Hongsong and Liu, Yang and Sun, Limin
关键词: Enhanced User-mode Emulation, Greybox Fuzzing, Linux-based IoT Devices

Abstract

To easily use our prototype, you can follow the README.md to run the docker images and perform the testing. Source code is also included in EQUAFL_code.zip, so that others can extend it for further research.

DOI: 10.1145/3533767.3534414

Dataset for ISSTA’22 Understanding Device Integration Bugs in Smart Home System

作者: Wang, Tao and Zhang, Kangkang and Chen, Wei and Dou, Wensheng and Zhu, Jiaxin and Wei, Jun and Huang, Tao
关键词: Home Assistant, integration bug, smart home system

Abstract

This is the dataset for the ISSTA`22 submission “Understanding Device Integration Bugs in Smart Home System”. It contains 330 device integration bugs collected from the most popular open source SmartHome system, i.e., Home Assistant.

DOI: 10.1145/3533767.3534365

A large-scale empirical analysis of the vulnerabilities introduced by third-party components in IoT firmware

作者: Zhao, Binbin and Ji, Shouling and Xu, Jiacheng and Tian, Yuan and Wei, Qiuyang and Wang, Qinying and Lyu, Chenyang and Zhang, Xuhong and Lin, Changting and Wu, Jingzheng and Beyah, Raheem
关键词: Vulnerability, Third-party component, IoT firmware

Abstract

As the core of IoT devices, firmware is undoubtedly vital. Currently, the development of IoT firmware heavily depends on third-party components (TPCs), which significantly improves the development efficiency and reduces the cost. Nevertheless, TPCs are not secure, and the vulnerabilities in TPCs will turn back influence the security of IoT firmware. Currently, existing works pay less attention to the vulnerabilities caused by TPCs, and we still lack a comprehensive understanding of the security impact of TPC vulnerability against firmware.
To fill in the knowledge gap, we design and implement FirmSec, which leverages syntactical features and control-flow graph features to detect the TPCs at version-level in firmware, and then recognizes the corresponding vulnerabilities. Based on FirmSec, we present the first large-scale analysis of the usage of TPCs and the corresponding vulnerabilities in firmware. More specifically, we perform an analysis on 34,136 firmware images, including 11,086 publicly accessible firmware images, and 23,050 private firmware images from TSmart. We successfully detect 584 TPCs and identify 128,757 vulnerabilities caused by 429 CVEs. Our in-depth analysis reveals the diversity of security issues for different kinds of firmware from various vendors, and discovers some well-known vulnerabilities are still deeply rooted in many firmware images. We also find that the TPCs used in firmware have fallen behind by five years on average. Besides, we explore the geographical distribution of vulnerable devices, and confirm the security situation of devices in several regions, e.g., South Korea and China, is more severe than in other regions. Further analysis shows 2,478 commercial firmware images have potentially violated GPL/AGPL licensing terms.

DOI: 10.1145/3533767.3534366

Deadlock prediction via generalized dependency

作者: Zhou, Jinpeng and Yang, Hanmei and Lange, John and Liu, Tongping
关键词: Multithreaded Programs, Deadlock Prediction, Condition Variables

Abstract

Deadlocks are notorious bugs in multithreaded programs, causing serious reliability issues. However, they are difficult to be fully expunged before deployment, as their appearances typically depend on specific inputs and thread schedules, which require the assistance of dynamic tools. However, existing deadlock detection tools mainly focus on locks, but cannot detect deadlocks related to condition variables. This paper presents a novel approach to fill this gap. It extends the classic lock dependency to generalized dependency by abstracting the signal for the condition variable as a special resource so that communication deadlocks can be modeled as hold-and-wait cycles as well. It further designs multiple practical mechanisms to record and analyze generalized dependencies. In the end, this paper presents the implementation of the tool, called UnHang. Experimental results on real applications show that UnHang is able to find all known deadlocks and uncover two new deadlocks. Overall, UnHang only imposes around 3% performance overhead and 8% memory overhead, making it a practical tool for the deployment environment.

DOI: 10.1145/3533767.3534377

Automated testing of image captioning systems

作者: Yu, Boxi and Zhong, Zhiqing and Qin, Xinran and Yao, Jiayi and Wang, Yuancheng and He, Pinjia
关键词: testing, image captioning, Metamorphic testing, AI software

Abstract

Image captioning (IC) systems, which automatically generate a text description of the salient objects in an image (real or synthetic), have seen great progress over the past few years due to the development of deep neural networks. IC plays an indispensable role in human society, for example, labeling massive photos for scientific studies and assisting visually-impaired people in perceiving the world. However, even the top-notch IC systems, such as Microsoft Azure Cognitive Services and IBM Image Caption Generator, may return incorrect results, leading to the omission of important objects, deep misunderstanding, and threats to personal safety. To address this problem, we propose MetaIC, the first metamorphic testing approach to validate IC systems. Our core idea is that the object names should exhibit directional changes after object insertion. Specifically, MetaIC (1) extracts objects from existing images to construct an object corpus; (2) inserts an object into an image via novel object resizing and location tuning algorithms; and (3) reports image pairs whose captions do not exhibit differences in an expected way. In our evaluation, we use MetaIC to test one widely-adopted image captioning API and five state-of-the-art (SOTA) image captioning models. Using 1,000 seeds, MetaIC successfully reports 16,825 erroneous issues with high precision (84.9%-98.4%). There are three kinds of errors: misclassification, omission, and incorrect quantity. We visualize the errors reported by MetaIC, which shows that flexible overlapping setting facilitates IC testing by increasing and diversifying the reported errors. In addition, MetaIC can be further generalized to detect label errors in the training dataset, which has successfully detected 151 incorrect labels in MS COCO Caption, a standard dataset in image captioning.

DOI: 10.1145/3533767.3534389

LiRTest： augmenting LiDAR point clouds for automated testing of autonomous driving systems

作者: Guo, An and Feng, Yang and Chen, Zhenyu
关键词: Software Testing, Metamorphic Testing, Data Augmentation, Autonomous Driving System

Abstract

With the tremendous advancement of Deep Neural Networks (DNNs), autonomous driving systems (ADS) have achieved significant development and been applied to assist in many safety-critical tasks. However, despite their spectacular progress, several real-world accidents involving autonomous cars even resulted in a fatality. While the high complexity and low interpretability of DNN models, which empowers the perception capability of ADS, make conventional testing techniques inapplicable for the perception of ADS, the existing testing techniques depending on manual data collection and labeling become time-consuming and prohibitively expensive.

In this paper, we design and implement LiRTest, the first automated LiDAR-based autonomous vehicles testing tool. LiRTest implements the ADS-specific metamorphic relation and equips affine and weather transformation operators that can reflect the impact of the various environmental factors to implement the relation. We experiment LiRTest with multiple 3D object detection models to evaluate its performance on different tasks. The experiment results show that LiRTest can activate different neurons of the object detection models and effectively detect their erroneous behaviors under various driving conditions. Also, the results confirm that LiRTest can improve the object detection precision by retraining with the generated data.

DOI: 10.1145/3533767.3534397

AIasd/FusED

作者: Zhong, Ziyuan and Hu, Zhisheng and Guo, Shengjian and Zhang, Xinyang and Zhong, Zhenyu and Ray, Baishakhi
关键词: advanced driving assistance system, causal analysis, multi-sensor fusion, software testing

Abstract

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

DOI: 10.1145/3533767.3534223

Precise and efficient atomicity violation detection for interrupt-driven programs via staged path pruning

作者: Li, Chao and Chen, Rui and Wang, Boxiang and Yu, Tingting and Gao, Dongdong and Yang, Mengfei
关键词: static analysis, interrupt-driven programs, embedded software, concurrency bug, atomicity violation

Abstract

Interrupt-driven programs are widely used in aerospace and other safety-critical areas. However, uncertain interleaving execution of interrupts may cause concurrency bugs, which could result in serious safety problems. Most of the previous researches tackling the detection of interrupt concurrency bugs focus on data races, that are usually benign as shown in empirical studies. Some studies focus on pattern-based atomicity violations that are most likely harmful. However, they cannot achieve simultaneous high precision and scalability. This paper presents intAtom, a precise and efficient static detection technique for interrupt atomicity violations, described by access interleaving pattern. The key point is that it eliminates false violations by staged path pruning with constraint solving. It first identifies all the violation candidates using data flow analysis and access interleaving pattern matching. intAtom then analyzes the path feasibility between two consecutive accesses in preempted task/interrupt, in order to recognize the atomicity intention of developers, with the help of which it filters out some candidates. Finally, it performs a modular path pruning by constructing symbolic summary and representative preemption points selection to eliminate the infeasible path in concurrent context efficiently. All the path feasibility checking processes are based on sparse value-flow analysis, which makes intAtom scalable. intAtom is evaluated on a benchmark and 6 real-world aerospace embedded programs. The experimental results show that intAtom reduces the false positive by 72% and improves the detection speed by 3 times, compared to the state-of-the-art methods. Furthermore, it can finish analyzing the real-world aerospace embedded software very fast with an average FP rate of 19.6%, while finding 19 bugs that were confirmed by developers.

DOI: 10.1145/3533767.3534412

Path-sensitive code embedding via contrastive learning for software vulnerability detection

作者: Cheng, Xiao and Zhang, Guanqin and Wang, Haoyu and Sui, Yulei
关键词: vulnerabilities, contrastive learning, code embedding, Path sensitive

Abstract

Machine learning and its promising branch deep learning have shown success in a wide range of application domains. Recently, much effort has been expended on applying deep learning techniques (e.g., graph neural networks) to static vulnerability detection as an alternative to conventional bug detection methods. To obtain the structural information of code, current learning approaches typically abstract a program in the form of graphs (e.g., data-flow graphs, abstract syntax trees), and then train an underlying classification model based on the (sub)graphs of safe and vulnerable code fragments for vulnerability prediction. However, these models are still insufficient for precise bug detection, because the objective of these models is to produce classification results rather than comprehending the semantics of vulnerabilities, e.g., pinpoint bug triggering paths, which are essential for static bug detection. This paper presents ContraFlow, a selective yet precise contrastive value-flow embedding approach to statically detect software vulnerabilities. The novelty of ContraFlow lies in selecting and preserving feasible value-flow (aka program dependence) paths through a pretrained path embedding model using self-supervised contrastive learning, thus significantly reducing the amount of labeled data required for training expensive downstream models for path-based vulnerability detection. We evaluated ContraFlow using 288 real-world projects by comparing eight recent learning-based approaches. ContraFlow outperforms these eight baselines by up to 334.1%, 317.9%, 58.3% for informedness, markedness and F1 Score, and achieves up to 450.0%, 192.3%, 450.0% improvement for mean statement recall, mean statement precision and mean IoU respectively in terms of locating buggy statements.

DOI: 10.1145/3533767.3534371

A large-scale study of usability criteria addressed by static analysis tools

作者: Nachtigall, Marcus and Schlichtig, Michael and Bodden, Eric
关键词: user experience, tool support, static analysis, program analysis, explainability

Abstract

Static analysis tools support developers in detecting potential coding issues, such as bugs or vulnerabilities. Research on static analysis emphasizes its technical challenges but also mentions severe usability shortcomings. These shortcomings hinder the adoption of static analysis tools, and in some cases, user dissatisfaction even leads to tool abandonment.

To comprehensively assess the current state of the art, this paper presents the first systematic usability evaluation in a wide range of static analysis tools. We derived a set of 36 relevant criteria from the scientific literature and gathered a collection of 46 static analysis tools complying with our inclusion and exclusion criteria - a representative set of mainly non-proprietary tools. Then, we evaluated how well these tools fulfill the aforementioned criteria.

The evaluation shows that more than half of the considered tools offer poor warning messages, while about three-quarters of the tools provide hardly any fix support. Furthermore, the integration of user knowledge is strongly neglected, which could be used for improved handling of false positives and tuning the results for the corresponding developer.
Finally, issues regarding workflow integration and specialized user interfaces are proved further.

These findings should prove useful in guiding and focusing further research and development in the area of user experience for static code analyses.

DOI: 10.1145/3533767.3534374

Artifacts for the ISSTA 2022 Paper： An Empirical Study on the Effectiveness of Static C Code Analyzers for Vulnerability Detection

作者: Lipp, Stephan and Banescu, Sebastian and Pretschner, Alexander
关键词: empirical study, static code analysis, vulnerability detection

Abstract

This artifact contains the evaluation script and the corresponding data of the ISSTA’22 paper “An Empirical Study on the Effectiveness of Static C Code Analyzers for Vulnerability Detection”. It can be used to replicate the evaluation results as well as to perform further analyses on the effectiveness of static code analyzers.

DOI: 10.1145/3533767.3534380

Testing Dafny (experience paper)

作者: Irfan, Ahmed and Porncharoenwase, Sorawee and Rakamari'{c
关键词: program verification, fuzzing, differential testing

Abstract

Verification toolchains are widely used to prove the correctness of critical software systems. To build confidence in their results, it is important to develop testing frameworks that help detect bugs in these toolchains. Inspired by the success of fuzzing in finding bugs in compilers and SMT solvers, we have built the first fuzzing and differential testing framework for Dafny, a high-level programming language with a Floyd-Hoare-style program verifier and compilers to C#, Java, Go, and Javascript. This paper presents our experience building and using XDsmith, a testing framework that targets the entire Dafny toolchain, from verification to compilation. XDsmith randomly generates annotated programs in a subset of Dafny that is free of loops and heap-mutating operations. The generated programs include preconditions, postconditions, and assertions, and they have a known verification outcome. These programs are used to test the soundness and precision of the Dafny verifier, and to perform differential testing on the four Dafny compilers. Using XDsmith, we uncovered 31 bugs across the Dafny verifier and compilers, each of which has been confirmed by the Dafny developers. Moreover, 8 of these bugs have been fixed in the mainline release of Dafny.

DOI: 10.1145/3533767.3534382

Artefact for the ISSTA 2022 Paper： “Combining Static Analysis Error Traces with Dynamic Symbolic Execution (Experience Paper)”

作者: Busse, Frank and Gharat, Pritam and Cadar, Cristian and Donaldson, Alastair F.
关键词: Clang Static Analyzer, Infer, KLEE, software testing, static analysis, symbolic execution

Abstract

The artefact provides a Docker image that contains the source code and binaries of our instrumentation/bug injection tools and KLEE extension, all benchmarks with their bitcode files, scripts to reproduce our experiments, and the static analysis reports for investigated real-world applications.

DOI: 10.1145/3533767.3534384

The Raise of Machine Learning Hyperparameter Constraints in Python Code (Artifact)

作者: Rak-amnouykit, Ingkarat and Milanova, Ana and Baudart, Guillaume and Hirzel, Martin and Dolby, Julian
关键词: interprocedural analysis, machine learning libraries, Python

Abstract

The artifact for the paper “The Raise of Machine Learning Hyperparameter Constraints in Python Code”.

DOI: 10.1145/3533767.3534400

作者: Yang, Shuaihao and Zeng, Zigang and Song, Wei
关键词: Android, dynamic permission model, permission combinations, software testing, static analysis

Abstract

We propose PermDroid, an automatic testing method and open-source tool to testing permission-related behavior of Android apps. For more information, please read the readme file in the zip file uploaded.

DOI: 10.1145/3533767.3534221

Reproduction Package for Article ‘Detecting and Fixing Data Loss Issues in Android Apps’

作者: Guo, Wunan and Dong, Zhen and Shen, Liwei and Tian, Wei and Su, Ting and Peng, Xin
关键词: dynamic analysis, mobile testing, patching

Abstract

This tool(iFixDataloss) consists of three components: static analyzer, dynamic explorer and patch generator. It works as follows: In the first step, iFixDataloss uses the static analyzer to build an activity transition graph and get persistent data for the app under test; In the second step, iFixDataloss runs dynamic explorer to detect data loss issues in the app. The relevant data for the data loss issue is stored in a database. Lastly, iFixDataloss takes the source code of the app and the data obtained in the dynamic exploration to generate a patch.

DOI: 10.1145/3533767.3534402

Reproduction Packages for “Automatically Detecting API-induced Compatibility Issues in Android Apps： A Comparative Analysis (Replicability Study)”

作者: Liu, Pei and Zhao, Yanjie and Cai, Haipeng and Fazzini, Mattia and Grundy, John and Li, Li
关键词: Android, API, Compatibility Issues, Fragmentation

Abstract

This artefact provides the experimental datasets and results presented in our paper.

DOI: 10.1145/3533767.3534407

Artifacts for “NCScope： Hardware-Assisted Analyzer for Native Code in Android Apps”

作者: Zhou, Hao and Wu, Shuohan and Luo, Xiapu and Wang, Ting and Zhou, Yajin and Zhang, Chao and Cai, Haipeng
关键词: Android, App Analysis, Dynamic Analysis

Abstract

The artifacts contains the code and dataset associated with the ISSTA’22 paper “NCScope: Hardware-Assisted Analyzer for Native Code in Android Apps”.

DOI: 10.1145/3533767.3534410

The reproduction package for ‘Detecting Resource Utilization Bugs Induced by Variant Lifecycles in Android’

作者: Lu, Yifei and Pan, Minxue and Pei, Yu and Li, Xuandong
关键词: Android applications, resource utilization bugs, static analysis, variant lifecycles

Abstract

VALA is a static analyzer to detect resource utilization bugs in Android applications induced by variant lifecycles. The VALA artifact has three components: (1) the executable jar file with configuration files in the folder ‘bin’; (2) the benchmark including 35 apps and all the experiment results in the folder ‘BenchmarkAndResults’; (3) a ‘README.md’ and demonstration video to teacher users how to use VALA and reproduce the experiment. With the three components, users can easily reproduce the experiment and get the results claimed in our paper ‘Detecting Resource Utilization Bugs Induced by Variant Lifecycles in Android’ within around 10 minutes. If provided with APK files of other Android apps as input, VALA can also detect the resource utilization bugs induced by variant lifecycles in the apps.

DOI: 10.1145/3533767.3534413

Patch correctness assessment in automated program repair based on the impact of patches on production and test code

作者: Ghanbari, Ali and Marcus, Andrian
关键词: Similarity, Patch Correctness Assessment, Branch Coverage, Automated Program Repair

Abstract

Test-based generate-and-validate automated program repair (APR) systems often generate many patches that pass the test suite without fixing the bug.
The generated patches must be manually inspected by the developers, so previous research proposed various techniques for automatic correctness assessment of APR-generated patches.
Among them, dynamic patch correctness assessment techniques rely on the assumption that, when running the originally passing test cases, the correct patches will not alter the program behavior in a significant way, e.g., removing the code implementing correct functionality of the program.
In this paper, we propose and evaluate a novel technique, named Shibboleth, for automatic correctness assessment of the patches generated by test-based generate-and-validate APR systems.
Unlike existing works, the impact of the patches is captured along three complementary facets, allowing more effective patch correctness assessment.
Specifically, we measure the impact of patches on both production code (via syntactic and semantic similarity) and test code (via code coverage of passing tests) to separate the patches that result in similar programs and that do not delete desired program elements.
Shibboleth assesses the correctness of patches via both ranking and classification.
We evaluated Shibboleth on 1,871 patches, generated by 29 Java-based APR systems for Defects4J programs. The technique outperforms state-of-the-art ranking and classification techniques.
Specifically, in our ranking data set, in 43% (66%) of the cases, Shibboleth ranks the correct patch in top-1 (top-2) positions, and in classification mode applied on our classification data set, it achieves an accuracy and F1-score of 0.887 and 0.852, respectively.

DOI: 10.1145/3533767.3534368

ATR： template-based repair for Alloy specifications

作者: Zheng, Guolong and Nguyen, ThanhVu and Brida, Sim'{o
关键词: Template-based Repair and Synthesis, Counterexamples, Automatic Program Repair, Alloy specification

Abstract

Automatic Program Repair (APR) is a practical research topic that studies techniques to automatically repair programs to fix bugs. Most existing APR techniques are designed for imperative programming languages, such as C and Java, and rely on analyzing correct and incorrect executions of programs to identify and repair suspicious statements. We introduce a new APR approach for software specifications written in the Alloy declarative language, where specifications are not “executed”, but rather converted into logical formulas and analyzed using backend constraint solvers, to find specification instances and counterexamples to assertions. We present ATR, a technique that takes as input an Alloy specification with some violated assertion and returns a repaired specification that satisfies the assertion. The key ideas are (i) analyzing the differences between counterexamples that do not satisfy the assertion and instances that do satisfy the assertion to guide the repair and (ii) generating repair candidates from specific templates and pruning the space of repair candidates using the counterexamples and satisfying instances. Experimental results using existing large Alloy benchmarks show that ATR is effective in generating difficult repairs. ATR repairs 66.3% of 1974 fault specifications, including specification repairs that cannot be handled by existing Alloy repair techniques. ATR and all benchmarks are open-source and available in the following Github repository: https://github.com/guolong-zheng/atmprep.

DOI: 10.1145/3533767.3534369

CIRCLE： continual repair across programming languages

作者: Yuan, Wei and Zhang, Quanjun and He, Tieke and Fang, Chunrong and Hung, Nguyen Quoc Viet and Hao, Xiaodong and Yin, Hongzhi
关键词: Neural Machine Translation, Lifelong Learning, Automatic Program Repair, AI and Software Engineering

Abstract

Automatic Program Repair (APR) aims at fixing buggy source code with less manual debugging efforts, which plays a vital role in improving software reliability and development productivity. Recent APR works have achieved remarkable progress via applying deep learning (DL), particularly neural machine translation (NMT) techniques. However, we observe that existing DL-based APR models suffer from at least two severe drawbacks: (1) Most of them can only generate patches for a single programming language, as a result, to repair multiple languages, we have to build and train many repairing models. (2) Most of them are developed offline. Therefore, they won’t function when there are new-coming requirements. To address the above problems, a T5-based APR framework equipped with continual learning ability across multiple programming languages is proposed, namely ContInual Repair aCross Programming LanguagEs (CIRCLE). Specifically, (1) CIRCLE utilizes a prompting function to narrow the gap between natural language processing (NLP) pre-trained tasks and APR. (2) CIRCLE adopts a difficulty-based rehearsal strategy to achieve lifelong learning for APR without access to the full historical data. (3) An elastic regularization method is employed to strengthen CIRCLE’s continual learning ability further, preventing it from catastrophic forgetting. (4) CIRCLE applies a simple but effective re-repairing method to revise generated errors caused by crossing multiple programming languages. We train CIRCLE for four languages (i.e., C, JAVA, JavaScript, and Python) and evaluate it on five commonly used benchmarks. The experimental results demonstrate that CIRCLE not only effectively and efficiently repairs multiple programming languages in continual learning settings, but also achieves state-of-the-art performance (e.g., fixes 64 Defects4J bugs) with a single repair model.

DOI: 10.1145/3533767.3534219

Reproduction Package for Article `Program Vulnerability Repair via Inductive Inference’

作者: Zhang, Yuntong and Gao, Xiang and Duck, Gregory J. and Roychoudhury, Abhik
关键词: Automated program repair, Inductive inference, Snapshot fuzzing

Abstract

Contains source code, experimental subjects, and instructions to reproduce main results for the article `Program Vulnerability Repair via Inductive Inference’.

DOI: 10.1145/3533767.3534387

WASAI： Uncovering Vulnerabilities in Wasm Smart Contracts

作者: Chen, Weimin and Sun, Zihan and Wang, Haoyu and Luo, Xiapu and Cai, Haipeng and Wu, Lei
关键词: concolic fuzzing, dynamic software analysis, smart contracts

Abstract

WASAI is a concolic fuzzer for identifying vulnerabilities in Wasm smart contracts, taking EOSIO as the mainly Wasm favored blockchain. The source code is uploaded to 10.5281/zenodo.6517515.

DOI: 10.1145/3533767.3534218

Replication Data for： Finding Permission Bugs in Smart Contracts with Role Mining

作者: Liu, Ye and Li, Yi and Lin, Shang-Wei and Artho, Cyrille
关键词: access control, information flow policy, role mining, Smart contract

Abstract

Smart contracts deployed on permissionless blockchains, such as Ethereum, are accessible to any user in a trustless environment. Therefore, most smart contract applications implement access control policies to protect their valuable assets from unauthorized accesses. A difficulty in validating the conformance to such policies, i. e., whether the contract implementation adheres to the expected behaviors, is the lack of policy specifications. In this paper, we mine past transactions of a contract to recover a likely access control model, which can then be checked against various information flow policies and identify potential bugs related to user permissions. We implement our role mining and security policy validation in tool SPCon. The experimental evaluation on labeled smart contract role mining benchmark demonstrates that SPCon effectively mines more accurate user roles compared to the state-of-the-art role mining tools. Moreover, the experimental evaluation on real-world smart contract benchmark and access control CVEs indicates SPCon effectively detects potential permission bugs while having better scalability and lower false-positive rate compared to the state-of-the-art security tools, finding 11 previously unknown bugs and detecting six CVEs that no other tool can find.

DOI: 10.1145/3533767.3534372

作者: Ghaleb, Asem and Rubin, Julia and Pattabiraman, Karthik
关键词: Ethereum, security, Solidity, taint analysis

Abstract

The execution of smart contracts on the Ethereum blockchain consumes gas paid for by users submitting contracts’ invocation requests. A contract execution proceeds as long as the users dedicate enough gas for execution and the total gas for the execution is under the block gas limit set by Ethereum. Otherwise, the contract execution halts, and changes made during execution get reverted. Unfortunately, smart contracts may contain code patterns that increase their execution gas cost, causing them to run out of gas. These patterns can be manipulated by malicious attackers to induce unwanted behavior in the targeted victim contracts, e.g., Denial-of-Service (DoS) attacks. We call these gas-related vulnerabilities. The paper proposes eTainter, a static analyzer for detecting gas-related vulnerabilities based on taint tracking in the bytecode of smart contracts.

In this artifact, we provide the implementation of the proposed approach and the scripts to reproduce results shown in the paper. Further, we provide 3 datasets we have used in our experiments, one of the datasets is annotated dataset.

DOI: 10.1145/3533767.3534378

Park： accelerating smart contract vulnerability detection via parallel-fork symbolic execution

作者: Zheng, Peilin and Zheng, Zibin and Luo, Xiapu
关键词: Symbolic Execution, Smart Contract, Blockchain

Abstract

Symbolic detection has been widely used to detect vulnerabilities in smart contracts. Unfortunately, as reported, existing symbolic tools cost too much time, since they need to execute all paths to detect vulnerabilities. Thus, their accuracy is limited by time. To tackle this problem, in this paper, we propose Park, the first general framework of parallel-fork symbolic execution for smart contracts. The main idea is to use multiple processes during symbolic execution, leveraging multiple CPU cores to enhance efficiency. Firstly, we propose a fork-operation based dynamic forking algorithm to achieve parallel symbolic contract execution. Secondly, to address the SMT performance loss problem in parallelization, we propose an adaptive processes restriction and adjustment algorithm. Thirdly, we design a shared-memory based global variable reconstruction method to collect and rebuild the global variables from different processes. We implement Park as a plug-in and apply it to two popular symbolic execution tools for smart contracts: Oyente and Mythril. The experimental results with third-party datasets show that Park-Oyente and Park-Mythril can provide up to 6.84x and 7.06x speedup compared to original tools, respectively.

DOI: 10.1145/3533767.3534395

SmartDagger： a bytecode-based static analysis approach for detecting cross-contract vulnerability

作者: Liao, Zeqin and Zheng, Zibin and Chen, Xiao and Nan, Yuhong
关键词: static analysis, smart contract, interprocedure analysis, bug finding

Abstract

With the increasing popularity of blockchain, automatically detecting vulnerabilities in smart contracts is becoming a significant problem. Prior research mainly identifies smart contract vulnerabilities without considering the interactions between multiple contracts. Due to the lack of analyzing the fine-grained contextual information during cross-contract invocations, existing approaches often produced a large number of false positives and false negatives. This paper proposes SmartDagger, a new framework for detecting cross-contract vulnerability through static analysis at the bytecode level. SmartDagger integrates a set of novel mechanisms to ensure its effectiveness and efficiency for cross-contract vulnerability detection. Particularly, SmartDagger effectively recovers the contract attribute information from the smart contract bytecode, which is critical for accurately identifying cross-contract vulnerabilities. Besides, instead of performing the typical whole-program analysis which is heavy-weight and time-consuming, SmartDagger selectively analyzes a subset of functions and reuses the data-flow results, which helps to improve its efficiency. Our further evaluation over a manually labelled dataset showed that SmartDagger significantly outperforms other state-of-the-art tools (i.e., Oyente, Slither, Osiris, and Mythril) for detecting cross-contract vulnerabilities. In addition, running SmartDagger over a randomly selected dataset of 250 smart contracts in the real-world, SmartDagger detects 11 cross-contract vulnerabilities, all of which are missed by prior tools.

DOI: 10.1145/3533767.3534222

Reproduction package for “Automated, Cost-effective, and Update-driven App Testing” paper

作者: Ngo, Chanh-Duc and Pastore, Fabrizio and Briand, Lionel C.
关键词: Android Testing, Model-based Testing, Regression Testing, Upgrade Testing

Abstract

This repository contains the replicability package for the evaluation conducted in the paper “Chanh Duc Ngo, Fabrizio Pastore, and Lionel Briand. 2021. Automated, Cost-effective, and Update-driven App Testing. ACM Trans. Softw. Eng. Methodol. Just Accepted (November 2021). https://dl.acm.org/doi/10.1145/3502297”. The most recent version of the tool can also be found at https://github.com/SNTSVV/ATUA

DOI: 10.1145/3533767.3543293

Automatic generation of smoke test suites for kubernetes

作者: Cannavacciuolo, Cecilio and Mariani, Leonardo
关键词: micro-services, kubernetes, cloud, automation, Smoke test

Abstract

Setting up a reliable and automated testing process can be challenging in a cloud environment, due to the many ways automatic and repeated system deployment may unexpectedly fail. Imperfect deployments may cause spurious test failures, resulting in a waste of test resources and effort. To address this issue, developers can implement smoke test suites, which are shallow test suites that are executed before any other test suite to verify that the system under test is fully operational, and can be thus reliably tested.

This paper presents KubeSmokeTester, a tool that alleviates the effort necessary to implement smoke test suites by providing automated generation capabilities for Kubernetes applications. The tool has been experienced with 60 versions of two industrial systems demonstrating its suitability in anticipating spurious test failures.

DOI: 10.1145/3533767.3543298

ESBMC-CHERI： towards verification of C programs for CHERI platforms with ESBMC

作者: Brau\ss{
关键词: formal methods, capability hardware, bounded model checking, CHERI, ARM Morello

Abstract

This paper presents ESBMC-CHERI – the first bounded model checker capable of formally verifying C programs for CHERI-enabled platforms. CHERI provides run-time protection for the memory-unsafe programming languages such as C/C++ at the hardware level. At the same time, it introduces new semantics to C programs, making some safe C programs cause hardware exceptions on CHERI-extended platforms. Hence, it is crucial to detect memory safety violations and compatibility issues ahead of compilation. However, there are no current verification tools for reasoning over CHERI-C programs. We demonstrate the work undertaken towards implementing support for CHERI-C in our state-of-the-art bounded model checker ESBMC and the plans for future work and extensive evaluation of ESBMC-CHERI. The ESBMC-CHERI demonstration and the source code are available at https://github.com/esbmc/esbmc/tree/cheri-clang.

DOI: 10.1145/3533767.3543289

ESBMC-Jimple： verifying Kotlin programs via jimple intermediate representation

作者: Menezes, Rafael and Moura, Daniel and Cavalcante, Helena and de Freitas, Rosiane and Cordeiro, Lucas C.
关键词: Software Model Checking, Kotlin, Jimple, Formal Verification

Abstract

We describe and evaluate the first model checker for verifying Kotlin programs through the Jimple intermediate representation. The verifier, named ESBMC-Jimple, is built on top of the Efficient SMT-based Context-Bounded Model Checker (ESBMC). It uses the Soot framework to obtain the Jimple IR, representing a simplified version of the Kotlin source code, containing a maximum of three operands per instruction. ESBMC-Jimple processes Kotlin source code together with a model of the standard Kotlin libraries and checks a set of safety properties. Experimental results show that ESBMC-Jimple can correctly verify a set of Kotlin benchmarks from the literature; it is competitive with state-of-the-art Java bytecode verifiers. A demonstration is available at https://youtu.be/J6WhNfXvJNc.

DOI: 10.1145/3533767.3543294

Faster mutation analysis with MeMu

作者: Ghanbari, Ali and Marcus, Andrian
关键词: Test Case, Mutation Analysis, Mutant, Method, Memoization

Abstract

Mutation analysis is a program analysis method with applications in assessing the quality of test cases, fault localization, test input generation, security analysis, etc. The method involves repeated running of test suites against a large number of program mutants, often leading to poor scalability. A large body of research is aimed at accelerating mutation analysis via a variety of approaches such as, reducing the number of mutants, reducing the number of test cases to run, or reducing the execution time of individual mutants. This paper presents the implementation of a novel technique, named MeMu, for reducing mutant execution time, through memoizing the most expensive methods in the system. Memoization is a program optimization technique that allows bypassing the execution of expensive methods and reusing pre-calculated results, when repeated inputs are detected. MeMu can be used on its own or alongside existing mutation analysis acceleration techniques. The current implementation of MeMu achieves, on average, an 18.15% speed-up for PITest JVM-based mutation testing tool.

DOI: 10.1145/3533767.3543288

iFixDataloss： a tool for detecting and fixing data loss issues in Android apps

作者: Guo, Wunan and Dong, Zhen and Shen, Liwei and Tian, Wei and Su, Ting and Peng, Xin
关键词: patching, mobile testing, dynamic analysis

Abstract

Android apps are event-driven, and their execution is often interrupted by external events. This interruption can cause data loss issues that annoy users. For instance, when the screen is rotated, the current app page will be destroyed and recreated. If the app state is improperly preserved, user data will be lost. In this work, we present a tool iFixDataloss that automatically detects and fixes data loss issues in Android apps. To achieve this, we identify scenarios in which data loss issues may occur by analyzing the Android life cycle, developing strategies to reveal data loss issues, and designing patch templates to fix them. Our experiments on 66 Android apps show iFixDataloss detected 374 data loss issues (284 of them were previously unknown) and successfully generated patches for 188 of the 374 issues. Out of 20 submitted patches, 16 have been accepted by developers. In comparison with state-of-the-art techniques, iFixDataloss performed significantly better in terms of the number of detected data loss issues and the quality of generated patches. Video Link: https://www.youtube.com/watch?v=MAPsCo-dRKs Github Link: https://github.com/iFixDataLoss/iFixDataloss22

DOI: 10.1145/3533767.3543297

Maestro： a platform for benchmarking automatic program repair tools on software vulnerabilities

作者: Pinconschi, Eduard and Bui, Quang-Cuong and Abreu, Rui and Ad~{a
关键词: program repair, Vulnerability

Abstract

Automating the repair of vulnerabilities is emerging in the field of software security. Previous efforts have leveraged Automated Program Repair (APR) for the task. Reproducible pipelines of repair tools on vulnerability benchmarks can promote advances in the field, such as new repair techniques. We propose Maestro, a decentralized platform with RESTful APIs for performing automated software vulnerability repair. Our platform connects benchmarks of vulnerabilities with APR tools for performing controlled experiments. It also promotes fair comparisons among different APR tools. We compare the performance of Maestro with previous studies on four APR tools in finding repairs for ten projects. Our execution time results indicate an overhead of 23 seconds for projects in C and a reduction of 14 seconds for Java projects. We introduce an agnostic platform for vulnerability repair with preliminary tools/datasets for both C and Java. Maestro is modular and can accommodate tools, benchmarks, and repair workflows with dedicated plugins.

DOI: 10.1145/3533767.3543291

Pytest-Smell： a smell detection tool for Python unit tests

作者: Bodea, Alexandru
关键词: Unit test, Test smell detection, Software defects, Software Testing

Abstract

Code quality and design are key factors in building a successful software application. It is known that a good internal structure assures a good external quality. To improve code quality, several guidelines and best practices are defined. Along with these, a key contribution is brought by unit testing. Just like the source code, unit test code is subject to bad programming practices, known as defects or smells, that have a negative impact on the quality of the software system. As a consequence, the system becomes harder to understand, maintain, and more prone to issues and bugs. In this respect, methods and tools that automate the detection of the aforementioned unit test smells are of the utmost importance.

While there are several tools that aim to address the automatic detection of unit test smells, the majority of them are focused on Java software systems. Moreover, the only known such framework designed for applications written in Python performs the detection only on Unittest Python testing library. In addition to this, it relies on an IDE to run, which heavily restricts its usage. The tool proposed within this paper aims to close this gap, introducing a new framework which focuses on detecting Python test smells built with Pytest testing framework. As far as we know, a similar tool to automate the process of test smell detection for unit tests written in Pytest has not been developed yet. The proposed solution also addresses the portability issue, being a cross-platform, easy to install and use Python library.

DOI: 10.1145/3533767.3543290

QMutPy： a mutation testing tool for Quantum algorithms and applications in Qiskit

作者: Fortunato, Daniel and Campos, Jos'{e
关键词: Quantum software testing, Quantum software engineering, Quantum mutation testing, Quantum computing

Abstract

There is an inherent lack of knowledge and technology to test a
quantum program properly. In this paper, building on the definition
of syntactically equivalent quantum gates, we describe our efforts in
developing a tool, coined QMutPy, leveraging the well-known open-source mutation tool MutPy. We further discuss the design and
implementation of QMutPy, and the usage of a novel set of mutation
operators that generate mutants for qubit measurements and gates.
To evaluate QMutPy’s performance, we conducted a preliminary
study on 11 real quantum programs written in the IBM’s Qiskit library. QMutPy has proven to be an effective quantum mutation tool,
providing insight into the current state of quantum tests. QMutPy is
publicly available at https://github.com/danielfobooss/mutpy. Tool
demo: https://youtu.be/fC4tOY5trqc.

DOI: 10.1145/3533767.3543296

作者: Wang, Boxiang and Chen, Rui and Li, Chao and Yu, Tingting and Gao, Dongdong and Yang, Mengfei
关键词: interrupt, embedded software, data sharing analysis, abstract interpretation

Abstract

Concurrency bugs are common in interrupt-driven programs, which are widely used in safety-critical areas. These bugs are often caused by incorrect data sharing among tasks and interrupts. Therefore, data sharing analysis is crucial to reason about the concurrency behaviours of interrupt-driven programs. Due to the variety of data access forms, existing tools suffer from both extensive false positives and false negatives while applying to interrupt-driven programs. This paper presents SpecChecker-ISA, a tool that provides sound and precise data sharing analysis for interrupt-driven embedded software. The tool uses a memory access model parameterized by numerical invariants, which are computed by abstract interpretation based value analysis, to describe data accesses of various kinds, and then uses numerical meet operations to obtain the final result of data sharing. Our experiments on 4 real-world aerospace embedded software show that SpecChecker-ISA can find all shared data accesses with few false positives, significantly outperforming other existing tools. The demo can be accessed at https://github.com/wangilson/specchecker-isa.

DOI: 10.1145/3533767.3543295

UniRLTest： universal platform-independent testing with reinforcement learning via image understanding

作者: Zhang, Ziqian and Liu, Yulei and Yu, Shengcheng and Li, Xin and Yun, Yexiao and Fang, Chunrong and Chen, Zhenyu
关键词: Reinforcement Learning, Image Analysis, Cross-platform Testing

Abstract

GUI testing has been prevailing in software testing. However, existing automated GUI testing tools mostly rely on frameworks of a specific platform. Testers have to fully understand platform features before developing platform-dependent GUI testing tools. Starting from the perspective of tester’s vision, we observe that GUIs on different platforms share commonalities of widget images and layout designs, which can be leveraged to achieve platform-independent testing. We propose UniRLTest, an automated software testing framework, to achieve platform independence testing. UniRLTest utilizes computer vision techniques to capture all the widgets in the screenshot and constructs a widget tree for each page. A set of all the executable actions in each tree will be generated accordingly. UniRLTest adopts a Deep Q-Network, a reinforcement learning (RL) method, to the exploration process and formalize the Android GUI testing problem to a Marcov Decision Process (MDP), where RL could work. We have conducted evaluation experiments on 25 applications from different platforms. The result shows that UniRLTest outperforms baselines in terms of efficiency and effectiveness.

DOI: 10.1145/3533767.3543292

jTrans： jump-aware transformer for binary code similarity detection

Abstract

Replication Package for Article： “FDG： A Precise Measurement of Fault Diagnosability Gain of Test Cases”

Abstract

TeLL： log level suggestions via modeling multi-level code block information

Abstract

An extensive study on pre-trained models for program understanding and generation

Abstract

Metamorphic relations via relaxations： an approach to obtain oracles for action-policy testing

Abstract

Hunting bugs with accelerated optimal graph vertex matching

Abstract

Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper)

Abstract

Combining solution reuse and bound tightening for efficient analysis of evolving systems

Abstract

On the use of evaluation measures for defect prediction studies

Abstract

Evolution-aware detection of order-dependent flaky tests

Abstract

Reproduction Package for Article “ε-Weakened Robustness of Deep Neural Networks”

Abstract

Reproduction Package for Paper Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replicability Study)

Abstract

XBA

Abstract

BET： black-box efficient testing for convolutional neural networks

Abstract

DocTer： documentation-guided fuzzing for testing deep learning API functions

Abstract

ASRTest： automated testing for deep-neural-network-driven speech recognition systems

Abstract

AEON： a method for automatic evaluation of NLP test cases

Abstract

Grammar2Fix：Human-in-the-Loop Oracle Learning for Semantic Bugs in String Processing Programs

Abstract

Reproduction package for article “HybridRepair： Towards Annotation-Efficient Repair for Deep Learning Models”

Abstract

Reproduction Package for Cross-Lingual Transfer Learning for Statistical Type Inference

Abstract

Unicorn： detect runtime errors in time-series databases with hybrid input synthesis

Abstract

Artifact for “On the Use of Mutation Analysis For Evaluating Student Test Suite Quality”

Abstract

Test mimicry to assess the exploitability of library vulnerabilities

Abstract

Reproduction Package for Article Automated Test Generation for REST APIs： No Time to Rest Yet

Abstract

ISSTA 22 Artifact for “Finding Bugs in Gremlin-Based Graph Database Systems via Randomized Differential Testing”

Abstract

RegMiner： Towards Constructing Large Regression Dataset from Code Evolution History

Abstract

One step further： evaluating interpreters using metamorphic testing

Abstract

Artefact for SnapFuzz： High-Throughput Fuzzing of Network Applications

Abstract

Almost Correct Invariants： Synthesizing Inductive Invariants by Fuzzing Proofs

Abstract

SLIME： program-sensitive energy allocation for fuzzing

Abstract

Reproduction package for article “MDPFuzz： Testing Models Solving Markov Decision Processes”

Abstract

TensileFuzz： facilitating seed input generation in fuzzing via string constraint solving

Abstract

PrIntFuzz： fuzzing Linux drivers via automated virtual device simulation

Abstract

Reproduction Package for ‘Efficient Greybox Fuzzing of Applications in Linux-Based IoT Devices via Enhanced User-Mode Emulation’

Abstract

Dataset for ISSTA’22 Understanding Device Integration Bugs in Smart Home System

Abstract

A large-scale empirical analysis of the vulnerabilities introduced by third-party components in IoT firmware

Abstract

Deadlock prediction via generalized dependency

Abstract

Automated testing of image captioning systems

Abstract

LiRTest： augmenting LiDAR point clouds for automated testing of autonomous driving systems

Abstract

AIasd/FusED

Abstract

Reproduction Package for Paper `Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replicability Study)`

Reproduction Package for Article `Automated Test Generation for REST APIs： No Time to Rest Yet`