ISSTA 2023

cydios： a model-based testing framework for ios apps

作者: Wu, Shuohan and Li, Jianfeng and Zhou, Hao and Fang, Yongsheng and Zhao, Kaifa and Wang, Haoyu and Qian, Chenxiong and Luo, Xiapu
关键词: iOS, UI Testing

Abstract

This is the artifacts for paper cydios: a model-based testing framework for ios apps.

DOI: 10.1145/3597926.3598033

Improving Bit-Blasting for Nonlinear Integer Constraints

作者: Jia, Fuqi and Han, Rui and Huang, Pei and Liu, Minghao and Ma, Feifei and Zhang, Jian
关键词: nonlinear integer constraints, satisfiability modulo theories

Abstract

This Artifact Evaluation document provides an assessment of the artifact submitted with the paper titled “Improving Bit-Blasting for Nonlinear Integer Constraints”, which was accepted at ISSTA 2023. The purpose of this evaluation is to verify the artifact’s reproducibility and usefulness in advancing the field.

The tool names BLAN, i.e., Bit-bLAst to solve Nonlinear integer constraints. In the paper, we combine it with an SMT-LIB frontend so that it can solve QF_NIA (quantifier free nonlinear integer arithmetic) constraints. It is available at

https://github.com/MRVAPOR/BLAN

DOI: 10.1145/3597926.3598034

CONCORD： Clone-Aware Contrastive Learning for Source Code

作者: Ding, Yangruibo and Chakraborty, Saikat and Buratti, Luca and Pujar, Saurabh and Morari, Alessandro and Kaiser, Gail and Ray, Baishakhi
关键词: Bug Detection, Code Clone, Source Code Pre-training

Abstract

Deep Learning (DL) models to analyze source code have shown immense promise during the past few years.
More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection.

While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for learning general-purpose representation. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors.

Thus, as a proxy to incorporate developers’ coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised pre-training strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD’s clone-aware pre-training drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code.

DOI: 10.1145/3597926.3598035

Towards Efficient Fine-Tuning of Pre-trained Code Models： An Experimental Study and Beyond

作者: Shi, Ensheng and Wang, Yanlin and Zhang, Hongyu and Du, Lun and Han, Shi and Zhang, Dongmei and Sun, Hongbin
关键词: Efficient Fine-tuning, Empirical study, Pre-Trained Language Models, Probing Techniques, Representational Similarity Analysis

Abstract

Recently, fine-tuning pre-trained code models such as CodeBERT on downstream tasks has achieved great success in many software testing and analysis tasks. While effective and prevalent, fine-tuning the pre-trained parameters incurs a large computational cost. In this paper, we conduct an extensive experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning. We then propose efficient alternatives to fine-tune the large pre-trained code model based on the above findings. Our experimental study shows that (1) lexical, syntactic and structural properties of source code are encoded in the lower, intermediate, and higher layers, respectively, while the semantic property spans across the entire model. (2) The process of fine-tuning preserves most of the code properties. Specifically, the basic code properties captured by lower and intermediate layers are still preserved during fine-tuning. Furthermore, we find that only the representations of the top two layers change most during fine-tuning for various downstream tasks. (3) Based on the above findings, we propose Telly to efficiently fine-tune pre-trained code models via layer freezing. The extensive experimental results on five various downstream tasks demonstrate that training parameters and the corresponding time cost are greatly reduced, while performances are similar or better.

DOI: 10.1145/3597926.3598036

Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper)

作者: Nie, Xu and Li, Ningke and Wang, Kailong and Wang, Shangguang and Luo, Xiapu and Wang, Haoyu
关键词: deep learning, denoising, vulnerability detection

Abstract

Software system complexity and security vulnerability diversity are plausible sources of the persistent challenges in software vulnerability research. Applying deep learning methods for automatic vulnerability detection has been proven an effective means to complement traditional detection approaches. Unfortunately, lacking well-qualified benchmark datasets could critically restrict the effectiveness of deep learning-based vulnerability detection techniques. Specifically, the long-term existence of erroneous labels in the existing vulnerability datasets may lead to inaccurate, biased, and even flawed results. In this paper, we aim to obtain an in-depth understanding and explanation of the label error causes. To this end, we systematically analyze the diversified datasets used by state-of-the-art learning-based vulnerability detection approaches, and examine their techniques for collecting vulnerable source code datasets. We find that label errors heavily impact the mainstream vulnerability detection models, with a worst-case average F1 drop of 20.7%. As mitigation, we introduce two approaches to dataset denoising, which will enhance the model performance by an average of 10.4%. Leveraging dataset denoising methods, we provide a feasible solution to obtain high-quality labeled datasets.

DOI: 10.1145/3597926.3598037

Pattern-Based Peephole Optimizations with Java JIT Tests

作者: Zang, Zhiqiang and Thimmaiah, Aditya and Gligoric, Milos
关键词: Just-in-time compilers, code generation, peephole optimizations

Abstract

We present JOG, a framework that facilitates developing Java JIT
peephole optimizations alongside JIT tests. JOG enables developers
to write a pattern, in Java itself, that specifies desired code transformations by writing code before and after the optimization, as
well as any necessary preconditions. Such patterns can be written
in the same way that tests of the optimization are already written
in OpenJDK. JOG translates each pattern into C/C++ code that
can be integrated as a JIT optimization pass. JOG also generates
Java tests for optimizations from patterns. Furthermore, JOG can
automatically detect possible shadow relation between a pair of
optimizations where the effect of the shadowed optimization is
overridden by another. Our evaluation shows that JOG makes it
easier to write readable JIT optimizations alongside tests without
decreasing the effectiveness of JIT optimizations. We wrote 162
patterns, including 68 existing optimizations in OpenJDK, 92 new
optimizations adapted from LLVM, and two new optimizations that
we proposed. We opened eight pull requests (PRs) for OpenJDK,
including six for new optimizations, one on removing shadowed
optimizations, and one for newly generated JIT tests; seven PRs
have already been integrated into the master branch of OpenJDK.

DOI: 10.1145/3597926.3598038

Icicle： A Re-designed Emulator for Grey-Box Firmware Fuzzing

作者: Chesser, Michael and Nepal, Surya and Ranasinghe, Damith C.
关键词: Fuzzing, embedded systems, emulation

Abstract

Emulation-based fuzzers enable testing binaries without source code and facilitate testing embedded applications where automated execution on the target hardware architecture is difficult and slow. The instrumentation techniques added to extract feedback and guide input mutations towards generating effective test cases is at the core of modern fuzzers. But, modern emulation-based fuzzers have evolved by re-purposing general-purpose emulators; consequently, developing and integrating fuzzing techniques, such as instrumentation methods, is difficult and often added in an ad-hoc manner, specific to an instruction set architecture (ISA). This limits state-of-the-art fuzzing techniques to a few ISAs such as x86/x86-64 or ARM/AArch64; a significant problem for firmware fuzzing of diverse ISAs. This study presents our efforts to re-think emulation for fuzzing. We design and implement a fuzzing-specific, multi-architecture emulation framework—Icicle. We demonstrate the capability to add instrumentation once, in an architecture agnostic manner, with low execution overhead. We employ Icicle as the emulator for a state-of-the-art ARM firmware fuzzer—Fuzzware—and replicate results. Significantly, we demonstrate the availability of new instrumentation in Icicle enabled the discovery of new bugs. We demonstrate the fidelity of Icicle and efficacy of architecture agnostic instrumentation by discovering bugs in benchmarks that require a known and specific operational capability of instrumentation techniques, across a diverse set of instruction set architectures (x86-64, ARM/AArch64, RISC-V, MIPS). Further, to demonstrate the effectiveness of Icicle to discover bugs in a currently unsupported architecture in emulation-based fuzzers, we perform a fuzzing campaign with real-world firmware binaries for Texas Instruments’ MSP430 ISA and discovered 7 new bugs.

DOI: 10.1145/3597926.3598039

Fine-Grained Code Clone Detection with Block-Based Splitting of Abstract Syntax Tree

作者: Hu, Tiancheng and Xu, Zijing and Fang, Yilin and Wu, Yueming and Yuan, Bin and Zou, Deqing and Jin, Hai
关键词: Abstract Syntax Tree, Clone Detection, Fine-grained, Splitting

Abstract

Code clone detection aims to find similar code fragments and gains increasing importance in the field of software engineering. There are several types of techniques for detecting code clones. Text-based or token-based code clone detectors are scalable and efficient but lack consideration of syntax, thus resulting in poor performance in detecting syntactic code clones. Although some tree-based methods have been proposed to detect syntactic or semantic code clones with decent performance, they are mostly time-consuming and lack scalability. In addition, these detection methods can not realize fine-grained code clone detection. They are unable to distinguish the concrete code blocks that are cloned. In this paper, we design Tamer, a scalable and fine-grained tree-based syntactic code clone detector. Specifically, we propose a novel method to transform the complex abstract syntax tree into simple subtrees. It can accelerate the process of detection and implement the fine-grained analysis of clone pairs to locate the concrete clone parts of the code. To examine the detection performance and scalability of Tamer, we evaluate it on a widely used dataset BigCloneBench. Experimental results show that Tamer outperforms ten state-of-the-art code clone detection tools (i.e., CCAligner, SourcererCC, Siamese, NIL, NiCad, LVMapper, Deckard, Yang2018, CCFinder, and CloneWorks).

DOI: 10.1145/3597926.3598040

Reducing the Memory Footprint of IFDS-based Data-Flow Analyses Using Fine-Grained Garbage Collection (Artifact)

作者: He, Dongjie and Gui, Yujiang and Gao, Yaoqing and Xue, Jingling
关键词: IFDS, Path Edge Collection, Taint Analysis

Abstract

The artifact contains our implementation of the Fine-grained Garbage Collection algorithm introduced in our paper “Reducing the Memory Footprint of IFDS-based Data-Flow Analyses Using Fine-Grained Garbage Collection”. The artifact includes all scripts and benchmarks for reproducing the results and claims made in our paper.

DOI: 10.1145/3597926.3598041

Hybrid Inlining： A Framework for Compositional and Context-Sensitive Static Analysis

作者: Liu, Jiangchao and Liu, Jierui and Di, Peng and Wu, Diyu and Zheng, Hengjie and Liu, Alex X. and Xue, Jingling
关键词: compositional static analysis, context sensitivity, pointer analysis

Abstract

Context-sensitivity is essential for achieving good precision in inter-procedural static analysis. To be context-sensitive, top-down analysis needs to fully inline all the statements in a callee at all its callsites, leading to statement explosion. Compositional analysis, which inlines summaries of all the callees, scales up but often loses precision, as it is not strictly context-sensitive. We propose a compositional and strictly context-sensitive framework for static analysis. This framework is based on a key observation: a compositional analysis often loses precision only on some critical statements that need to be analyzed context-sensitively. Our approach hybridly inlines the critical statements and the summaries of non-critical statements of each callee, thus avoiding re-analyzing non-critical ones. In addition, our analysis lazily summarizes the critical statements, by stopping propagating the critical statements once the calling context accumulated is adequate. We have designed and implemented several analyses (including a pointer analysis) based on this framework. Our evaluation on the pointer analysis shows that it can analyze large Java programs from the DaCapo benchmark suite and industry in minutes. Compared to context-insensitive analysis, Hybrid Inlining introduces only 65% and 1% additional time overheads on DaCapo and industrial applications, respectively.

DOI: 10.1145/3597926.3598042

Artifacts for the paper： “Green Fuzzing： A Saturation-Based Stopping Criterion using Vulnerability Prediction”

作者: Lipp, Stephan and Elsner, Daniel and Kacianka, Severin and Pretschner, Alexander and B"{o
关键词: fuzzing, stopping criteria, vulnerability prediction

Abstract

This repository contains the training and evaluation data, including the analysis script and machine-learned vulnerability prediction models, of the paper “Green Fuzzing: A Saturation-Based Stopping Criterion using Vulnerability Prediction”.

DOI: 10.1145/3597926.3598043

Reproduction artifact for “Testing Graph Database Engines via Query Partitioning”

作者: Kamm, Matteo and Rigger, Manuel and Zhang, Chengyu and Su, Zhendong
关键词: automatic testing, database testing, graph databases, test oracle

Abstract

The artifact consists of two main components: - GDBMeter, the tool which implements Predicate Partitioning and was used to find all bugs reported in the paper. - A SQLite database with a list of bugs that we reported and additional meta information.

DOI: 10.1145/3597926.3598044

Semantic-Based Neural Network Repair

作者: Schumi, Richard and Sun, Jun
关键词: AI model generation, Prolog, TensorFlow, automatic AI model repair, deep learning models, neural network generation, semantics, specification

Abstract

Recently, neural networks have spread into numerous fields including many safety-critical systems. Neural networks are built (and trained) by programming in frameworks such as TensorFlow and PyTorch. Developers apply a rich set of pre-defined layers to manually program neural networks or to automatically generate them (e.g., through AutoML). Composing neural networks with different layers is error-prone due to the non-trivial constraints that must be satisfied in order to use those layers. In this work, we propose an approach to automatically repair erroneous neural networks. The challenge is in identifying a minimal modification to the network so that it becomes valid. Modifying a layer might have cascading effects on subsequent layers and thus our approach must search recursively to identify a ‘‘globally’’ minimal modification. Our approach is based on an executable semantics of deep learning layers and focuses on four kinds of errors which are common in practice. We evaluate our approach for two usage scenarios, i.e., repairing automatically generated neural networks and manually written ones suffering from common model bugs. The results show that we are able to repair 100% of a set of randomly generated neural networks (which are produced with an existing AI framework testing approach) effectively and efficiently (with an average repair time of 21.08s) and 93.75% of a collection of real neural network bugs (with an average time of 3min 40s).

DOI: 10.1145/3597926.3598045

GDsmith： Detecting Bugs in Cypher Graph Database Engines

作者: Hua, Ziyue and Lin, Wei and Ren, Luyao and Li, Zongyang and Zhang, Lu and Jiao, Wenpin and Xie, Tao
关键词: Cypher, Differential testing, Graph database systems

Abstract

Graph database engines stand out in the era of big data for their
efficiency of modeling and processing linked data. To assure high quality of graph database engines, it is highly critical to conduct automatic test generation for graph database engines, e.g., random test generation, the most commonly adopted approach in practice. However, random test generation faces the challenge of generating complex inputs (i.e., property graphs and queries) for producing non-empty query results; generating such type of inputs is important especially for detecting wrong-result bugs. To address this challenge, in this paper, we propose GDsmith, the first approach for testing Cypher graph database engines. GDsmith ensures that each randomly generated query satisfies the semantic requirements. To increase the probability of producing complex queries that return non-empty results, GDsmith includes two new techniques: graph-guided generation of complex pattern combinations and data-guided generation of complex conditions. Our evaluation results demonstrate that GDsmith is effective and efficient for producing complex queries that return non-empty results for bug detection, and substantially outperforms the baselines. GDsmith successfully detects 28 bugs on the released versions of three highly popular open-source graph database engines and receives positive feedback from their developers.

DOI: 10.1145/3597926.3598046

The tool LIPuS and its experiment package of the paper ‘‘Loop Invariant Inference through SMT Solving Enhanced Reinforcement Learning’’

作者: Yu, Shiwen and Wang, Ting and Wang, Ji
关键词: loop invariant, reinforcement learning

Abstract

This is the repository of LIPuS, which is a loop invariant inference tool based on SMT Solving Enhanced Reinforcement Learning. LIPuS is the implemented tool for the method proposed in the paper: “Loop Invariant Inference through SMT Solving Enhanced Reinforcement Learning”.

DOI: 10.1145/3597926.3598047

CODEP： Grammatical Seq2Seq Model for General-Purpose Code Generation

作者: Dong, Yihong and Li, Ge and Jin, Zhi
关键词: Code generation, Code intelligence, PDA, PL, Seq2Seq

Abstract

General-purpose code generation aims to automatically convert the natural language description to code snippets in a general-purpose programming language (GPL) such as Python. In the process of code generation, it is essential to guarantee the generated code satisfies grammar constraints of GPL. However, existing sequence-to-sequence (Seq2Seq) approaches neglect grammar rules when generating GPL code. In this paper, we devise a pushdown automaton (PDA)-based methodology to make the first attempt to consider grammatical Seq2Seq models for general-purpose code generation, exploiting the principle that PL is a subset of PDA recognizable language and code accepted by PDA is grammatical. Specifically, we construct a PDA module and design an algorithm to constrain the generation of Seq2Seq models to ensure grammatical correctness. Guided by this methodology, we further propose CODEP, a code generation framework equipped with a PDA module, to integrate the deduction of PDA into deep learning. This framework leverages the state of PDA deduction (including state representation, state prediction task, and joint prediction with state) to assist models in learning PDA deduction. To comprehensively evaluate CODEP, we construct a PDA for Python and conduct extensive experiments on four public benchmark datasets. CODEP can employ existing sequence-based models as base models, and we show that it achieves 100% grammatical correctness percentage on these benchmark datasets. Consequently, CODEP relatively improves 17% CodeBLEU on CONALA, 8% EM on DJANGO, and 15% CodeBLEU on JUICE-10K compared to base models. Moreover, PDA module also achieves significant improvements on the pre-trained models.

DOI: 10.1145/3597926.3598048

Concept-Based Automated Grading of CS-1 Programming Assignments

作者: Fan, Zhiyu and Tan, Shin Hwei and Roychoudhury, Abhik
关键词: automated grading, concept graph, programming education

Abstract

Due to the increasing enrolments in Computer Science programs, teaching of introductory programming needs to be scaled up. This places significant strain on teaching resources for programming courses for tasks such as grading of submitted programming assignments. Conventional attempts at automated grading of programming assignment rely on test-based grading which assigns scores based on the number of passing tests in a given test-suite. Since test-based grading may not adequately capture the student’s understanding of the programming concepts needed to solve a programming task, we propose the notion of a concept graph which is essentially an abstracted control flow graph. Given the concept graphs extracted from a student’s solution and a reference solution, we define concept graph matching and comparing of differing concepts. Our experiments on 1540 student submissions from a publicly available dataset show the efficacy of concept-based grading vis-a-vis test-based grading. Specifically, the concept based grading is (experimentally) shown to be closer to the grade manually assigned by the tutor. Apart from grading, the concept graph used by our approach is also useful for providing feedback to struggling students, as confirmed by our user study among tutors.

DOI: 10.1145/3597926.3598049

Artifact for “Beware of the Unexpected： Bimodal Taint Analysis”

作者: Chow, Yiu Wai and Sch"{a
关键词: AI4SE, software security

Abstract

This artifact contains supplementary material for the paper “Beware of the Unexpected: Bimodal Taint Analysis” (ISSTA’23).

DOI: 10.1145/3597926.3598050

DeUEDroid system

作者: Chen, Zhuo and Liu, Jie and Hu, Yubo and Wu, Lei and Zhou, Yajin and He, Yiling and Liao, Xianhao and Wang, Ke and Li, Jinku and Qin, Zhan
关键词: machine learning, UTG

Abstract

Deuedroid is a detection system designed for Underground economy apps, which consists of two parts: statistic analysis part and machine learning part.

DOI: 10.1145/3597926.3598051

Dependency-Aware Metamorphic Testing of Datalog Engines

作者: Mansur, Muhammad Numair and W"{u
关键词: Datalog, fuzzing, metamorphic testing

Abstract

Datalog is a declarative query language with wide applicability,
especially in program analysis. Queries are evaluated by Datalog
engines, which are complex and thus prone to returning
incorrect results. Such bugs, called query bugs, may compromise the
soundness of upstream program analyzers, having potentially
detrimental consequences in safety-critical settings.

To address this issue, we develop a metamorphic testing approach for
detecting query bugs in Datalog engines. In comparison to existing
work, our approach is based on rich precedence information capturing
dependencies among relations in the program. This enables much more
general and effective metamorphic transformations. We implement our
approach in DLSmith, which detected 16 previously unknown query
bugs in four Datalog engines.

DOI: 10.1145/3597926.3598052

Artifact for the ISSTA2023 Paper Fuzzing Deep Learning Compilers with HirGen

作者: Ma, Haoyang and Shen, Qingchao and Tian, Yongqiang and Chen, Junjie and Cheung, Shing-Chi
关键词: Fuzzer, Program Generator

Abstract

This is the artifact of HirGen. It contains about 3K LOC C++ code and cmake files for building the softwares. The main purpose of it is to generate executable hirgen for generating computational graphs and use them to test DL compilers.

DOI: 10.1145/3597926.3598053

API2Vec： Learning Representations of API Sequences for Malware Detection

作者: Cui, Lei and Cui, Jiancong and Ji, Yuede and Hao, Zhiyu and Li, Lun and Ding, Zhenquan
关键词: Deep Learning, Embedding, Malware Detection, Random Walk

Abstract

Analyzing malware based on API call sequence is an effective approach as the sequence reflects the dynamic execution behavior of malware.Recent advancements in deep learning have led to the application of these techniques for mining useful information from API call sequences. However, these methods mainly operate on raw sequences and may not effectively capture important information especially for multi-process malware, mainly due to the API call interleaving problem.

Motivated by that, this paper presents API2Vec, a graph based API embedding method for malware detection. First, we build a graph model to represent the raw sequence. In particular, we design the temporal process graph (TPG) to model inter-process behavior and temporal API graph (TAG) to model intra-process behavior. With such graphs, we design a heuristic random walk algorithm to generate a number of paths that can capture the fine-grained malware behavior. By pre-training the paths using the Doc2Vec model, we are able to generate the embeddings of paths and APIs, which can further be used for malware detection. The experiments on a real malware dataset demonstrate that API2Vec outperforms the state-of-the-art embedding methods and detection methods for both accuracy and robustness, especially for multi-process malware.

DOI: 10.1145/3597926.3598054

June： A Type Testability Transformation for Improved ATG Performance

作者: Bruce, Dan and Kelly, David and Menendez, Hector and Barr, Earl T. and Clark, David
关键词: Automated Test Generator, Java, June, Search-based Testing, Testing

Abstract

Strings are universal containers: they are flexible to use, abundant in code, and difficult to test. String-controlled programs are programs that make branching decisions based on string input. Automatically generating valid test inputs for these programs considering only character sequences rather than any underlying string-encoded structures, can be prohibitively expensive.
We present June, a tool that enables Java developers to expose any present latent string structure to test generation tools. June is an annotation-driven testability transformation and an extensible library, JuneLib, of structured string definitions. The core JuneLib definitions are empirically derived and provide templates for all structured strings in our test set.
June takes lightly annotated source code and injects code that permits an automated test generator (ATG) to focus on the creation of mutable substrings inside a structured string. Using June costs the developer little, with an average of 2.1 annotations per string-controlled class. June uses standard Java build tools and therefore deploys seamlessly within a Java project.
By feeding string structure information to an ATG tool, June dramatically reduces wasted effort; branches are effortlessly covered that would otherwise be extremely difficult, or impossible, to cover. This waste reduction both increases and speeds coverage. EvoSuite, for example, achieves the same coverage on June-ed classes in 1 minute, on average, as it does in 9 minutes on the un-June-ed class. These gains increase over time. On our corpus, June-ing a program compresses 24 hours of execution time into ca. 2 hours. We show that many ATG tools can reuse the same June-ed code: a few June annotations, a one-off cost, benefit many different testing regimes.

DOI: 10.1145/3597926.3598055

A Comprehensive Study on Quality Assurance Tools for Java

作者: Liu, Han and Chen, Sen and Feng, Ruitao and Liu, Chengwei and Li, Kaixuan and Xu, Zhengzi and Nie, Liming and Liu, Yang and Chen, Yixiang
关键词: Bug finding, CWE, Quality assurance tools, Scanning rules

Abstract

Quality assurance (QA) tools are receiving more and more attention and are widely used by developers. Given the wide range of solutions for QA technology, it is still a question of evaluating QA tools. Most existing research is limited in the following ways: (i) They compare tools without considering scanning rules analysis. (ii) They disagree on the effectiveness of tools due to the study methodology and benchmark dataset. (iii) They do not separately analyze the role of the warnings. (iv) There is no large-scale study on the analysis of time performance. To address these problems, in the paper, we systematically select 6 free or open-source tools for a comprehensive study from a list of 148 existing Java QA tools. To carry out a comprehensive study and evaluate tools in multi-level dimensions, we first mapped the scanning rules to the CWE and analyze the coverage and granularity of the scanning rules. Then we conducted an experiment on 5 benchmarks, including 1,425 bugs, to investigate the effectiveness of these tools. Furthermore, we took substantial effort to investigate the effectiveness of warnings by comparing the real labeled bugs with the warnings and investigating their role in bug detection. Finally, we assessed these tools’ time performance on 1,049 projects. The useful findings based on our comprehensive study can help developers improve their tools and provide users with suggestions for selecting QA tools.

DOI: 10.1145/3597926.3598056

IcyChecker-Artifact： Detecting State Inconsistency Bugs in DApps via On-Chain Transaction Replay and Fuzzing

作者: Ye, Mingxi and Nan, Yuhong and Zheng, Zibin and Wu, Dongpeng and Li, Huizhong
关键词: Decentralized Application, Fuzz Testing, Smart Contract, Vulnerability detection

Abstract

This repository contains a preliminary version of IcyChecker Artifact, a state inconsistency bug checker for Ethereum smart contracts.

DOI: 10.1145/3597926.3598057

FairRec： Fairness Testing for Deep Recommender Systems

作者: Guo, Huizhong and Li, Jinfeng and Wang, Jingyi and Liu, Xiangyu and Wang, Dongxia and Hu, Zehong and Zhang, Rong and Xue, Hui
关键词: AI Ethics, Fairness Testing, Recommender Systems

Abstract

Deep learning-based recommender systems (DRSs) are increasingly and widely deployed in the industry, which brings significant convenience to people’s daily life in different ways. However, recommender systems are also shown to suffer from multiple issues, e.g., the echo chamber and the Matthew effect, of which the notation of “fairness” plays a core role. For instance, the system may be regarded as unfair to 1) a specific user, if the user gets worse recommendations than other users, or 2) an item (to recommend), if the item is much less likely to be exposed to the users than other items. While many fairness notations and corresponding fairness testing approaches have been developed for traditional deep classification models, they are essentially hardly applicable to DRSs. One major challenge is that there still lacks a systematic understanding and mapping between the existing fairness notations and the diverse testing requirements for deep recommender systems, not to mention further testing or debugging activities. To address the gap, we propose FairRec, a unified framework that supports fairness testing of DRSs from multiple customized perspectives, e.g., model utility, item diversity, item popularity, etc. We also propose a novel, efficient search-based testing approach to tackle the new challenge, i.e., double-ended discrete particle swarm optimization (DPSO) algorithm, to effectively search for hidden fairness issues in the form of certain disadvantaged groups from a vast number of candidate groups. Given the testing report, by adopting a simple re-ranking mitigation strategy on these identified disadvantaged groups, we show that the fairness of DRSs can be significantly improved. We conducted extensive experiments on multiple industry-level DRSs adopted by leading companies. The results confirm that FairRec is effective and efficient in identifying the deeply hidden fairness issues, e.g., achieving ∼95% testing accuracy with ∼half to 1/8 time.

DOI: 10.1145/3597926.3598058

ItyFuzz： Snapshot-Based Fuzzer for Smart Contract

作者: Shou, Chaofan and Tan, Shangyin and Sen, Koushik
关键词: DeFi security, blockchain, fuzzing, on-chain testing, smart contract

Abstract

Smart contracts are critical financial instruments, and their security is of utmost importance. However, smart contract programs are difficult to fuzz due to the persistent blockchain state behind all transactions. Mutating sequences of transactions are complex and often lead to a suboptimal exploration for both input and program spaces. In this paper, we introduce a novel snapshot-based fuzzer ItyFuzz for testing smart contracts. In ItyFuzz, instead of storing sequences of transactions and mutating from them, we snapshot states and singleton transactions. To explore interesting states, ItyFuzz introduces a dataflow waypoint mechanism to identify states with more potential momentum. ItyFuzz also incorporates comparison waypoints to prune the space of states. By maintaining snapshots of the states, ItyFuzz can synthesize concrete exploits like reentrancy attacks quickly. Because ItyFuzz has second-level response time to test a smart contract, it can be used for on-chain testing, which has many benefits compared to local development testing. Finally, we evaluate ItyFuzz on real-world smart contracts and some hacked on-chain DeFi projects. ItyFuzz outperforms existing fuzzers in terms of instructional coverage and can find and generate realistic exploits for on-chain projects quickly.

DOI: 10.1145/3597926.3598059

TrickyBugs

作者: Liu, Kaibo and Han, Yudong and Zhang, Jie M. and Chen, Zhenpeng and Sarro, Federica and Harman, Mark and Huang, Gang and Ma, Yun
关键词: Online judge platform, Software testing, Test assessment

Abstract

This is TrickyBugs, the dataset of the ISSTA’23 paper entitled “Who Judges the Judge: An Empirical Study on Online Judge Tests”. This dataset contains the detected false positive solutions (bugs) and the corresponding generated hack test inputs and hack test outputs in the paper. Read the paper for detailed information.

DOI: 10.1145/3597926.3598060

Reproduction Package for Ariticle “Precise and Efficient Patch Presence Test for Android Applications against Code Obfuscation”

作者: Xie, Zifan and Wen, Ming and Jia, Haoxiang and Guo, Xiaochen and Huang, Xiaotong and Zou, Deqing and Jin, Hai
关键词: Android Security, Library Detection, Patch Presence Test

Abstract

This is the repository for paper submission “Precise and Efficient Patch Presence Test for Android Applications against Code Obfuscation”. It introduces PHunter, which is a precise and efficient patch presence test tool for Android applications against code obfuscation, including identifier renaming, package flattening, control flow randomization, and dead code removal. PHunter does not rely on debug information and uses fine-grained anti-obfuscation

DOI: 10.1145/3597926.3598061

Reproduction Package for Article `Detecting Vulnerabilities in Linux-based Embedded Firmware with SSE-based On-demand Alias Analysis’

作者: Cheng, Kai and Zheng, Yaowen and Liu, Tao and Guan, Le and Liu, Peng and Li, Hong and Zhu, Hongsong and Ye, Kejiang and Sun, Limin
关键词: Embedded firmware, On-demand alias analysis, Taint analysis

Abstract

EmTaint, a novel static analysis tool for accurate and fast detection of taint-style vulnerabilities in embedded firmware. In EmTaint, we design a structured symbolic expression-based (SSE-based) on-demand alias analysis technique, which serves as a basis for resolving both implicit data flow and control flow on potential vulnerable paths. Based on it, we come up with indirect call resolution and accurate taint analysis scheme. Combined with sanitization rule checking, EmTaint can eventually discovers a large number of taint-style vulnerabilities accurately within a limited time.

DOI: 10.1145/3597926.3598062

Definition and Detection of Defects in NFT Smart Contracts

作者: Yang, Shuo and Chen, Jiachi and Zheng, Zibin
关键词: NFTs, defects definition and detection, smart contracts, symbolic execution

Abstract

Recently, the birth of non-fungible tokens (NFTs) has attracted great attention. NFTs are capable of representing users’ ownership on the blockchain and have experienced tremendous market sales due to their popularity. Unfortunately, the high value of NFTs also makes them a target for attackers. The defects in NFT smart contracts could be exploited by attackers to harm the security and reliability of the NFT ecosystem. Despite the significance of this issue, there is a lack of systematic work that focuses on analyzing NFT smart contracts, which may raise worries about the security of users’ NFTs. To address this gap, in this paper, we introduce 5 defects in NFT smart contracts. Each defect is defined and illustrated with a code example highlighting its features and consequences, paired with possible solutions to fix it. Furthermore, we propose a tool named NFTGuard to detect our defined defects based on a symbolic execution framework. Specifically, NFTGuard extracts the information of the state variables from the contract abstract syntax tree (AST), which is critical for identifying variable-loading and storing operations during symbolic execution. Furthermore, NFTGuard recovers source-code-level features from the bytecode to effectively locate defects and report them based on predefined detection patterns. We run NFTGuard on 16,527 real-world smart contracts and perform an evaluation based on the manually labeled results. We find that 1,331 contracts contain at least one of the 5 defects, and the overall precision achieved by our tool is 92.6%.

DOI: 10.1145/3597926.3598063

Eunomia： Enabling User-Specified Fine-Grained Search in Symbolically Executing WebAssembly Binaries

作者: He, Ningyu and Zhao, Zhehao and Wang, Jikai and Hu, Yubin and Guo, Shengjian and Wang, Haoyu and Liang, Guangtai and Li, Ding and Chen, Xiangqun and Guo, Yao
关键词: Domain Specific Language, Path Explosion, Symbolic Execution, WebAssembly

Abstract

Although existing techniques have proposed automated approaches to alleviate the path explosion problem of symbolic execution, users still need to optimize symbolic execution by applying various searching strategies carefully. As existing approaches mainly support only coarse-grained global searching strategies, they cannot efficiently traverse through complex code structures. In this paper, we propose Eunomia, a symbolic execution technique that supports fine-grained search with local domain knowledge. Eunomia uses Aes, a DSL that lets users specify local searching strategies for different parts of the program. Eunomia also isolates the context of variables for different local searching strategies, avoiding conflicts. We implement Eunomia for WebAssembly, which can analyze applications written in various languages. Eunomia is the first symbolic execution engine that supports the full features of WebAssembly. We evaluate Eunomia with a microbenchmark suite and six real-world applications. Our evaluation shows that Eunomia improves bug detection by up to three orders of magnitude. We also conduct a user study that shows the benefits of using Aes. Moreover, Eunomia verifies six known bugs and detects two new zero-day bugs in Collections-C.

DOI: 10.1145/3597926.3598064

Reproduction tool and data for article “Type Batched Program Reduction”

作者: Gharachorlu, Golnaz and Sumner, Nick
关键词: Delta Debugging, Machine Learning, Program Reduction

Abstract

This artifact contains both the data and the underlying implementation for the paper “Type Batched Program Reduction”. The implementation of Type Batched Reducer, a tool for simplifying programs of multiple programming languages while preserving a property of interest is in C++. The artifact also includes the training data required to train logistic regression models used in this reducer. The included benchmark is from “Perses: Syntax-Guided Program Reduction” paper.

DOI: 10.1145/3597926.3598065

Reproduction Package for Article “Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement Learning”

作者: Zhang, Zhaoxu and Winn, Robert and Zhao, Yu and Yu, Tingting and Halfond, William G.J.
关键词: Android Bug Report Reproduction

Abstract

This is the artifact of our work “Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement Learning” accepted at The ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) 2023. This artifact has the source code of the research prototype, evaluation data and results of the paper. We provided detailed instructions for running our tool in the REAEME file.

DOI: 10.1145/3597926.3598066

ISSTA2023 Artifact for “Large Language Models Are Zero-Shot Fuzzers： Fuzzing Deep-Learning Libraries via Large Language Models”

作者: Deng, Yinlin and Xia, Chunqiu Steven and Peng, Haoran and Yang, Chenyuan and Zhang, Lingming
关键词: Fuzz Testing, Large Language Model, Test Generation

Abstract

The artifact provides the source code of the ISSTA’2023 paper “Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models”. Specifically, it contains TitanFuzz’s implementation for fuzzing PyTorch and TensorFlow.

DOI: 10.1145/3597926.3598067

Exploring Missed Optimizations in WebAssembly Optimizers

作者: Liu, Zhibo and Xiao, Dongwei and Li, Zongjie and Wang, Shuai and Meng, Wei
关键词: Compiler Optimization, Software Testing, WebAssembly

Abstract

The prosperous trend of deploying complex applications to web browsers has
boosted the development of WebAssembly (wasm) compilation toolchains. Software
written in different high-level programming languages are compiled into wasm
executables, which can be executed fast and safely in a virtual machine. The
performance of wasm executables depends highly on compiler optimizations.
Despite the prosperous use of wasm executables, recent research has indicated
that real-world wasm applications are slower than anticipated, suggesting
deficiencies in wasm optimizations.

This paper aims to present the first systematic and in-depth understanding of
the status quo of wasm optimizations. To do so, we present DITWO, a
differential testing framework to uncover missed optimizations
(MO) of wasm optimizers. DITWO compiles a C program into both
native x86 executable and wasm executable, and differentiates
optimization indication traces (OITraces) logged by running each
executable to uncover MO. Each OITrace is composed with global variable writes
and function calls, two performance indicators that practically and
systematically reflect the optimization degree across wasm and native
executables.
Our analysis of the official wasm optimizer, wasm-opt, successfully
identifies 1,293 inputs triggering MO of wasm-opt. With extensive manual effort, we
identify nine root causes for all MO, and we estimate that fixing discovered MO
can result in a performance improvement of at least 17.15%. We also summarize
four lessons from our findings to deliver better wasm optimizations.

DOI: 10.1145/3597926.3598068

PhysCov： Physical Test Coverage for Autonomous Vehicles

作者: Hildebrandt, Carl and von Stein, Meriel and Elbaum, Sebastian
关键词: Autonomous Systems, Coverage Metrics, Test Adequacy

Abstract

Adequately exercising the behaviors of autonomous vehicles is fundamental to their validation. However, quantifying an autonomous vehicle’s testing adequacy is challenging as the system’s behavior is influenced both by its state as well as its physical environment. To address this challenge, our work builds on two insights. First, data sensed by an autonomous vehicle provides a unique spatial signature of the physical environment inputs. Second, given the vehicle’s current state, inputs residing outside the autonomous vehicle’s physically reachable regions are less relevant to its behavior. Building on those insights, we introduce an abstraction that enables the computation of a physical environment-state coverage metric, PhysCov. The abstraction combines the sensor readings with a physical reachability analysis based on the vehicle’s state and dynamics to determine the region of the environment that may affect the autonomous vehicle. It then characterizes that region through a parameterizable geometric approximation that can trade quality for cost. Tests with the same characterizations are deemed to have had similar internal states and exposed to similar environments and thus likely to exercise the same set of behaviors, while tests with distinct characterizations will increase PhysCov. A study on two simulated and one real system’s dataset examines PhysCovs’s ability to quantify an autonomous vehicle’s test suite, showcases its characterization cost and precision, investigates its correlation with failures found and potential for test selection, and assesses its ability to distinguish among real-world scenarios.

DOI: 10.1145/3597926.3598069

Building Critical Testing Scenarios for Autonomous Driving from Real Accidents

作者: Zhang, Xudong and Cai, Yan
关键词: Autonomous Driving Systems, Panoptic Segmentation, Parameter Mutation, Scene Recovery, Testing

Abstract

One of the aims of the development and spread of autonomous driving technology is to reduce traffic accidents caused by human factors.
But recently reported data on fatal accidents involving autonomous driving system (ADS) shows that this important goal has not been achieved.
So there is an emerge requirement on more comprehensive and targeted testing especially on safe driving.
In this paper, we propose an approach to automatically building critical testing scenarios from real-world accident data.
Firstly, we propose a new model called M-CPS (Multi-channel Panoptic Segmentation) to extract the effective information from the accident record (such as images or videos), and separate the independent individuals of different traffic participants for further scene recovery.
Compared with the traditional panoramic segmentation models, M-CPS model is able to effectively handle segmentation challenges due to the shooting angle, image quality, pixel overlap and other problems existing in the accident record.
Next, the extracted core information is then connected with the virtual testing platform to generate the original scene set.
Besides, we also design a mutation testing solution on the basis of the original scene set, thus greatly enriching the scene library for testing.
In our experiments,
the M-CPS model reaches a result of 66.1% PQ on CityScapes test set, shows that our model has only slight fluctuations on performance compared with the best benchmark model on pure panoptic segmentation task.
It also reaches a result of 84.5% IoU for semantic segmentation branch and 40.3% mAP for instance segmentation branch on SHIFT dataset.
Then we use UCF-Crime, CADP and US-Accidents datasets to generate the original and mutated scene set.
Those generated scene sets are connected to Apollo and Carla simulation platforms to test ADS prototypes.
We find three types of scenarios that can lead to accidents of ADS prototypes, which indicates that the existing ADS prototype has defects.
Our solution provides a new possible direction for the recovery of key scenarios in ADS testing, and can improve the efficiency in related fields.

DOI: 10.1145/3597926.3598070

作者: Gao, Xuanqi and Zhai, Juan and Ma, Shiqing and Shen, Chao and Chen, Yufei and Wang, Shiwei
关键词: fairness, incremental learning, neural network

Abstract

Due to the model aging problem, Deep Neural Networks (DNNs) need updates to adjust them to new data distributions. The common practice leverages incremental learning (IL), e.g., Class-based Incremental Learning (CIL) that updates output labels, to update the model with new data and a limited number of old data. This avoids heavyweight training (from scratch) using conventional methods and saves storage space by reducing the number of old data to store. But it also leads to poor performance in fairness. In this paper, we show that CIL suffers both dataset and algorithm bias problems, and existing solutions can only partially solve the problem. We propose a novel framework, CILIATE, that fixes both dataset and algorithm bias in CIL. It features a novel differential analysis guided dataset and training refinement process that identifies unique and important samples overlooked by existing CIL and enforces the model to learn from them. Through this process, CILIATE improves the fairness of CIL by 17.03%, 22.46%, and 31.79% compared to state-of-the-art methods, iCaRL, BiC, and WA, respectively, based on our evaluation on three popular datasets and widely used ResNet models. Our code is available at https://github.com/Antimony5292/CILIATE.

DOI: 10.1145/3597926.3598071

BehAVExplor： Behavior Diversity Guided Testing for Autonomous Driving Systems

作者: Cheng, Mingfei and Zhou, Yuan and Xie, Xiaofei
关键词: Apollo, Autonomous driving systems, behavior diversity, critical scenarios, fuzzing

Abstract

Testing Autonomous Driving Systems (ADSs) is a critical task for ensuring the reliability and safety of autonomous vehicles. Existing methods mainly focus on searching for safety violations while the diversity of the generated test cases is ignored, which may generate many redundant test cases and failures. Such redundant failures can reduce testing performance and increase failure analysis costs. In this paper, we present a novel behavior-guided fuzzing technique (BehAVExplor) to explore the different behaviors of the ego vehi- cle (i.e., the vehicle controlled by the ADS under test) and detect diverse violations. Specifically, we design an efficient unsupervised model, called BehaviorMiner, to characterize the behavior of the ego vehicle. BehaviorMiner extracts the temporal features from the given scenarios and performs a clustering-based abstraction to group behaviors with similar features into abstract states. A new test case will be added to the seed corpus if it triggers new behav- iors (e.g., cover new abstract states). Due to the potential conflict between the behavior diversity and the general violation feedback, we further propose an energy mechanism to guide the seed selec- tion and the mutation. The energy of a seed quantifies how good it is. We evaluated BehAVExplor on Apollo, an industrial-level ADS, and LGSVL simulation environment. Empirical evaluation results show that BehAVExplor can effectively find more diverse violations than the state-of-the-art.

DOI: 10.1145/3597926.3598072

In Defense of Simple Techniques for Neural Network Test Case Selection

作者: Bao, Shenglin and Sha, Chaofeng and Chen, Bihuan and Peng, Xin and Zhao, Wenyun
关键词: deep learning, k-nearest neighbor, test case selection

Abstract

Although deep learning (DL) software has been pervasive in various applications, the brittleness of deep neural networks (DNN) hinders their deployment in many tasks especially high-stake ones. To mitigate the risk accompanied with DL software fault, a variety of DNN testing techniques have been proposed such as test case selection. Among those test case selection or prioritization methods, the uncertainty-based ones such as DeepGini have demonstrated their effectiveness in finding DNN’s faults. Recently, TestRank, a learning based test ranking method has shown their out-performance over simple uncertainty-based test selection methods. However, this is achieved with a more complicated design which needs to train a graph convolutional network and a multi-layer Perceptron. In this paper, we propose a novel and lightweight DNN test selection method to enhance the effectiveness of existing simple ones. Besides the DNN model’s uncertainty on test case itself, we take into account model’s uncertainty on its neighbors. This could diversify the selected test cases and improve the effectiveness of existing uncertainty-based test selection methods. Extensive experiments on 5 datasets demonstrate the effectiveness of our approach.

DOI: 10.1145/3597926.3598073

ConfFix： Repairing Configuration Compatibility Issues in Android Apps

作者: Huang, Huaxun and Xu, Chi and Wen, Ming and Liu, Yepang and Cheung, Shing-Chi
关键词: Android, Automated Repair, Compatibility, XML Configuration

Abstract

XML configuration files are widely-used to specify the user interfaces (UI) of Android apps. Configuration compatibility (CC) issues are induced owing to the inconsistent handling of such XML configuration files across different Android framework versions. CC issues can cause software crashes and inconsistent look-and-feels, severely impacting the user experience of Android apps. However, there is no universal solution to resolve CC issues and app developers need to handle CC issues case by case. Existing tools are designed based on predefined rules or visual features that are possibly manifested by CC issues. Unfortunately, they can fail or generate overfitting patches when the CC issues are beyond their capabilities. To fill the above research gaps, we first empirically studied the app developers’ common strategies in patching real-world CC issues. Based on the findings, we propose ConfFix, an automatic approach to repair CC issues in Android apps. ConfFix is driven by the knowledge of how an XML element is handled inconsistently in different versions of the Android framework and generates patches to eliminate such inconsistencies. We evaluated ConfFix on a set of 77 reproducible CC issues in 13 open-source Android apps. The results show that ConfFix outperforms baselines in successfully repairing 64 CC issues with a high precision. Encouragingly, the patches for 38 CC issues have been confirmed and merged by app developers.

DOI: 10.1145/3597926.3598074

Vectorizing Program Ingredients for Better JVM Testing

作者: Gao, Tianchang and Chen, Junjie and Zhao, Yingquan and Zhang, Yuqun and Zhang, Lingming
关键词: JVM Testing, Java Virtual Machine, Program Synthesis, Test Oracle

Abstract

JVM testing is one of the most widely-used methodologies for guaranteeing the quality of JVMs. Among various JVM testing techniques, synthesis-based JVM testing, which constructs a test program by synthesizing various code snippets (also called program ingredients), has been demonstrated state-of-the-art. The existing synthesis-based JVM testing work puts more efforts in ensuring the validity of synthesized test programs, but ignores the influence of huge ingredient space, which largely limits the ingredient exploration efficiency as well as JVM testing performance. In this work, we propose Vectorized JVM Testing (called VECT) to further promote the performance of synthesis-based JVM testing. Its key insight is to reduce the huge ingredient space by clustering semantically similar ingredients via vectorizing ingredients using state-of-the-art code representation. To make VECT complete and more effective, based on vectorized ingredients, VECT further designs a feedback-driven ingredient selection strategy and an enhanced test oracle. We conducted an extensive study to evaluate VECT on three popular JVMs (i.e., HotSpot, OpenJ9, and Bisheng JDK) involving five OpenJDK versions. The results demonstrate VECT detects 115.03% ~ 776.92% more unique inconsistencies than the state-of-the-art JVM testing technique during the same testing time. In particular, VECT detects 26 previously unknown bugs for them, 15 of which have already been confirmed/fixed by developers.

DOI: 10.1145/3597926.3598075

What You See Is What You Get? It Is Not the Case! Detecting Misleading Icons for Mobile Applications

作者: Li, Linlin and Wang, Ruifeng and Zhan, Xian and Wang, Ying and Gao, Cuiyun and Wang, Sinan and Liu, Yepang
关键词: Android Apps, Deep Learning, Discrepancy Detection, Icon Design

Abstract

With the prevalence of smartphones, people nowadays can access a wide variety of services through diverse apps. A good Graphical User Interface (GUI) can make an app more appealing and competitive in app markets. Icon widgets, as an essential part of an app’s GUI, leverage icons to visually convey their functionalities to facilitate user interactions. Whereas, designing intuitive icon widgets can be a non-trivial job. Developers should follow a series of guidelines and make appropriate choices from a plethora of possibilities. Inappropriately designed or misused icons may cause user confusion, lead to wrong operations, and even result in security risks (e.g., revenue loss and privacy leakage). To investigate the problem, we manually checked 9,075 icons of 1,111 top-ranked commercial apps from Google Play and found 640 misleading icons in 312 ( ‍28%) of these apps. This shows that misleading icons are prevalent among real-world apps, even the top ones. Manually identifying misleading icons to improve app quality is time-consuming and laborious. In this work, we propose the first framework, IconSeer, to automatically detect misleading icons for mobile apps. Our basic idea is to find the discrepancies between the commonly perceived intentions of an icon and the actual functionality of the corresponding icon widget. IconSeer takes an Android app as input and reports potential misleading icons. It is powered by a comprehensive icon-intention mapping constructed by analyzing 268,353 icons collected from 15,571 popular Android apps in Google Play. The mapping includes 179 icon classes and 852 intention classes. Given an icon widget under analysis, IconSeer first employs a pre-trained open-set deep learning model to infer the possible icon class and the potential intentions. IconSeer then extracts developer-specified text properties of the icon widget, which indicate the widget’s actual functionality. Finally, IconSeer determines whether an icon is misleading by comparing the semantic similarity between the inferred intentions and the extracted text properties of the widget. We have evaluated IconSeer on the 1,111 Android apps with manually established ground truth. IconSeer successfully identified 1,172 inconsistencies (with an accuracy of 0.86), among which we further found 482 real misleading icons.

DOI: 10.1145/3597926.3598076

Testing the Compiler for a New-Born Programming Language： An Industrial Case Study (Experience Paper)

作者: Zhao, Yingquan and Chen, Junjie and Fu, Ruifeng and Ye, Haojie and Wang, Zan
关键词: Compiler Testing, Metamorphic Testing, Program Synthesis

Abstract

Due to the critical role of compilers, many compiler testing techniques have been proposed, two most notable categories among which are grammar-based and metamorphic-based techniques. All of them have been extensively studied for testing mature compilers. However, it is typical to develop a new compiler for a new-born programming language in practice. In this scenario, the existing techniques are hardly applicable due to some major reasons: (1) no reference compilers to support differential testing, (2) lack of program analysis tools to support most of metamorphic-based compiler testing, (3) substantial implementation effort incurred by different programming language features. Hence, it is unknown how the existing techniques perform in this new scenario.

In this work, we conduct the first exploration (i.e., an industrial case study) to investigate the performance of the existing techniques in this new scenario with substantial adaptations. We adapted grammar-based compiler testing to this scenario by synthesizing new test programs based on code snippets and using compilation crash as test oracle due to the lack of reference compilers for differential testing. We also adapted metamorphic-based compiler testing to this scenario by constructing equivalent test programs under any inputs to relieve the dependence on program analysis tools. We call the adapted techniques SynFuzz and MetaFuzz, respectively.

We evaluated both SynFuzz and MetaFuzz on two versions of a new compiler for a new-born programming language in a global IT company. By comparing with the testing practice adopted by the testing team and the general fuzzer (AFL), SynFuzz can detect more bugs during the same testing time, and both SynFuzz and MetaFuzz can complement the other two techniques. In particular, SynFuzz and MetaFuzz have detected 11 previously unknown bugs, all of which have been fixed by the developers. From the industrial case study, we summarized a series of lessons and suggestions for practical use and future research.

DOI: 10.1145/3597926.3598077

Quantitative Policy Repair for Access Control on the Cloud

作者: Eiers, William and Sankaran, Ganesh and Bultan, Tevfik
关键词: access control, policy analysis, policy repair, quantitative analysis

Abstract

With the growing prevalence of cloud computing, providing secure access to information stored in the cloud has become a critical problem. Due to the complexity of access control policies, administrators may inadvertently allow unintended access to private information, and this is a common source of data breaches in cloud based services. In this paper, we present a quantitative symbolic analysis approach for automated policy repair in order to fix overly permissive policies. We encode the semantics of the access control policies using SMT formulas and assess their permissiveness using model counting. Given a policy, a permissiveness bound, and a set of requests that should be allowed, we iteratively repair the policy through permissiveness reduction and refinement, so that the permissiveness bound is reached while the given set of requests are still allowed. We demonstrate the effectiveness of our automated policy repair technique by applying it to policies written in Amazon’s AWS Identity and Access Management (IAM) policy language.

DOI: 10.1145/3597926.3598078

Validating Multimedia Content Moderation Software via Semantic Fusion

作者: Wang, Wenxuan and Huang, Jingyuan and Chen, Chang and Gu, Jiazhen and Zhang, Jianping and Wu, Weibin and He, Pinjia and Lyu, Michael
关键词: Software testing, metamorphic testing, multimedia content moderation, semantic fusion

Abstract

The exponential growth of social media platforms, such as Facebook, Instagram, Youtube, and TikTok, has revolutionized communication and content publication in human society. Users on these platforms can publish multimedia content that delivers information via the combination of text, audio, images, and video. Meanwhile, the multimedia content release facility has been increasingly exploited to propagate toxic content, such as hate speech, malicious advertisement, and pornography. To this end, content moderation software has been widely deployed on these platforms to detect and blocks toxic content. However, due to the complexity of content moderation models and the difficulty of understanding information across multiple modalities, existing content moderation software can fail to detect toxic content, which often leads to extremely negative impacts (e.g., harmful effects on teen mental health).
We introduce Semantic Fusion, a general, effective methodology for validating multimedia content moderation software. Our key idea is to fuse two or more existing single-modal inputs (e.g., a textual sentence and an image) into a new input that combines the semantics of its ancestors in a novel manner and has toxic nature by construction. This fused input is then used for validating multimedia content moderation software. We realized Semantic Fusion as DUO, a practical content moderation software testing tool. In our evaluation, we employ DUO to test five commercial content moderation software and two state-of-the-art models against three kinds of toxic contents. The results show that DUO achieves up to 100% error finding rate (EFR) when testing moderation software and it obtains up to 94.1% EFR when testing the state-of-the-art models. In addition, we leverage the test cases generated by DUO to retrain the two models we explored, which largely improves model robustness (2.5%∼5.7% EFR) while maintaining the accuracy on the original test set.

DOI: 10.1145/3597926.3598079

Towards More Realistic Evaluation for Neural Test Oracle Generation

作者: Liu, Zhongxin and Liu, Kui and Xia, Xin and Yang, Xiaohu
关键词: Neural Network, Realistic Evaluation, Test Oracle Generation

Abstract

Unit testing has become an essential practice during software development and maintenance. Effective unit tests can help guard and improve software quality but require a substantial amount of time and effort to write and maintain. A unit test consists of a test prefix and a test oracle. Synthesizing test oracles, especially functional oracles, is a well-known challenging problem. Recent studies proposed to leverage neural models to generate test oracles, i.e., neural test oracle generation (NTOG), and obtained promising results. However, after a systematic inspection, we find there are some inappropriate settings in existing evaluation methods for NTOG. These settings could mislead the understanding of existing NTOG approaches’ performance. We summarize them as 1) generating test prefixes from bug-fixed program versions, 2) evaluating with an unrealistic metric, and 3) lacking a straightforward baseline. In this paper, we first investigate the impacts of these settings on evaluating and understanding the performance of NTOG approaches. We find that 1) unrealistically generating test prefixes from bug-fixed program versions inflates the number of bugs found by the state-of-the-art NTOG approach TOGA by 61.8%, 2) FPR (False Positive Rate) is not a realistic evaluation metric and the Precision of TOGA is only 0.38%, and 3) a straightforward baseline NoException, which simply expects no exception should be raised, can find 61% of the bugs found by TOGA with twice the Precision. Furthermore, we introduce an additional ranking step to existing evaluation methods and propose an evaluation metric named Found@K to better measure the cost-effectiveness of NTOG approaches in terms of bug-finding. We propose a novel unsupervised ranking method to instantiate this ranking step, significantly improving the cost-effectiveness of TOGA. Eventually, based on our experimental results and observations, we propose a more realistic evaluation method TEval+ for NTOG and summarize seven rules of thumb to boost NTOG approaches into their practical usages.

DOI: 10.1145/3597926.3598080

Back Deduction Based Testing for Word Sense Disambiguation Ability of Machine Translation Systems

作者: Wang, Jun and Li, Yanhui and Huang, Xiang and Chen, Lin and Zhang, Xiaofang and Zhou, Yuming
关键词: Back Deduction, Machine Translation, Software Testing, Word Sense Disambiguation

Abstract

Machine translation systems have penetrated our daily lives, providing translation services from source language to target language to millions of users online daily. Word Sense Disambiguation (WSD) is one of the essential functional requirements of machine translation systems, which aims to determine the exact sense of polysemes in the given context. Commercial machine translation systems (e.g., Google Translate) have been shown to fail in identifying the proper sense and consequently cause translation errors. However, to our knowledge, no prior studies focus on testing such WSD bugs for machine translation systems. To tackle this challenge, we propose a novel testing method Back Deduction based Testing for Word Sense Disambiguation (BDTD). Our method’s main idea is to obtain the hidden senses of source words via back deduction from the target language, i.e., employ translation words in the target language to deduce senses of original words identified in the translation procedure. To evaluate BDTD, we conduct an extensive empirical study with millions of sentences under three popular translators, including Google Translate and Bing Microsoft Translator. The experimental results indicate that BDTD can identify a considerable number of WSD bugs with high accuracy, more than 80%, under all three translators.

DOI: 10.1145/3597926.3598081

Reproducation package for “DyCL： Dynamic Neural Network Compilation Via Program Rewriting and Graph Optimization”

作者: Chen, Simin and Wei, Shiyi and Liu, Cong and Yang, Wei
关键词: Deep Learning Compiler, Dynamic Neural Networks, Static analysis.

Abstract

This artifact contains three parts: (1) the dynamic neural networks used in our evaluation, (2) the core implementation of DyCL, and (3) the script to automatically launch the experiments and test the compilation results.

DOI: 10.1145/3597926.3598082

Systematically Producing Test Orders to Detect Order-Dependent Flaky Tests

作者: Li, Chengpeng and Khosravi, M. Mahdi and Lam, Wing and Shi, August
关键词: flaky test detection, order-dependent flaky test

Abstract

Software testing suffers from the presence of flaky tests, which can pass or fail when run on the same version of code. Order- dependent tests (OD tests) are flaky tests whose outcome depends on the order in which they are run. An OD test can be detected if specific tests are run or not run before it, resulting in a difference in test outcome. While prior work has proposed rerunning tests in different random test orders, this approach does not provide guarantees toward detecting all OD tests. Later work that proposed a more systematic approach to ordering tests still fails to account for the relationships between all tests in the test suite.
We propose three new techniques to detect OD tests through a more systematic means of producing test orders. Our techniques build upon prior work in Tuscan squares to cover test pairs in a minimal set of test orders while also obeying the constraints of how tests can be positioned in a test order w.r.t. their test classes. Further, as there are many test pairs that need to be covered, we develop a technique that can take a specified set of test pairs to cover and produce test orders that aim to cover just those test pairs. Our evaluation with 289 known OD tests across 47 test suites from open-source projects shows that our most cost-effective technique can detect 97.2% of the known OD tests with 104.7 test orders, on average, per subject. While all techniques produce a relatively large number of test orders, our analysis of the minimal set of test orders needed to detect OD tests shows a tremendous reduction in the test orders needed to detect OD tests – representing an opportunity for future work to prioritize test orders.

DOI: 10.1145/3597926.3598083

Security Checking of Trigger-Action-Programming Smart Home Integrations

作者: Bu, Lei and Zhang, Qiuping and Li, Suwan and Dai, Jinglin and Bai, Guangdong and Chen, Kai and Li, Xuandong
关键词: IFTTT, IoT, Model Checking, Security Modeling and Verification

Abstract

Internet of Things (IoT) has become prevalent in various fields, especially in the context of home automation (HA). To better control HA-IoT devices, especially to integrate several devices for rich smart functionalities, trigger-action programming, such as the If This Then That (IFTTT), has become a popular paradigm. Leveraging it, novice users can easily specify their intent in applets regarding how to control a device/service through another once a specific condition is met. Nevertheless, the users may design IFTTT-style integrations inappropriately, due to lack of security experience or
unawareness of the security impact of cyber-attacks against individual devices. This has caused financial loss, privacy leakage, unauthorized access and other security issues. To address these problems, this work proposes a systematic framework named MEDIC to model smart home integrations and check their security. It automatically generates models incorporating the service/device behaviors and action rules of the applets, while taking into consideration the external attacks and in-device vulnerabilities. Our approach takes around one second to complete the modeling and checking of one integration. We carried out experiments based on 200 integrations created from a user study and a dataset crawled from ifttt.com. To our great surprise, nearly 83% of these integrations have security issues.

DOI: 10.1145/3597926.3598084

LiResolver： License Incompatibility Resolution for Open Source Software

作者: Xu, Sihan and Gao, Ya and Fan, Lingling and Li, Linyu and Cai, Xiangrui and Liu, Zheli
关键词: License, License Incompatibility Resolution, Open Source Software

Abstract

Open source software (OSS) licenses regulate the conditions under which OSS can be legally reused, distributed, and modified. However, a common issue arises when incorporating third-party OSS accompanied with licenses, i.e., license incompatibility, which occurs when multiple licenses exist in one project and there are conflicts between them. Despite being problematic, fixing license incompatibility issues requires substantial efforts due to the lack of license understanding and complex package dependency. In this paper, we propose LiResolver, a fine-grained, scalable, and flexible tool to resolve license incompatibility issues for open source software. Specifically, it first understands the semantics of licenses through fine-grained entity extraction and relation extraction. Then, it detects and resolves license incompatibility issues by recommending official licenses in priority. When no official licenses can satisfy the constraints, it generates a custom license as an alternative solution. Comprehensive experiments demonstrate the effectiveness of LiResolver, with 4.09% false positive (FP) rate and 0.02% false negative (FN) rate for incompatibility issue localization, and 62.61% of 230 real-world incompatible projects resolved by LiResolver. We discuss the feedback from OSS developers and the lessons learned from this work. All the datasets and the replication package of LiResolver have been made publicly available to facilitate follow-up research.

DOI: 10.1145/3597926.3598085

More Precise Regression Test Selection via Reasoning about Semantics-Modifying Changes

作者: Liu, Yu and Zhang, Jiyang and Nie, Pengyu and Gligoric, Milos and Legunsen, Owolabi
关键词: Regression test selection, change-impact analysis, regression testing, semantics-modifying changes

Abstract

Regression test selection (RTS) speeds up regression testing by only re-running tests that might be affected by code changes. Ideal RTS safely selects all affected tests and precisely selects only affected tests. But, aiming for this ideal is often slower than re-running all tests. So, recent RTS techniques use program analysis to trade precision for speed, i.e., lower regression testing time, or even use machine learning to trade safety for speed. We seek to make recent analysis-based RTS techniques more precise, to further speed up regression testing. Independent studies suggest that these techniques reached a “performance wall” in the speed-ups that they provide.
We manually inspect code changes to discover those that do not require re-running tests that are only affected by such changes. We categorize 29 kinds of changes that we find from five projects into 13 findings, 11 of which are semantics-modifying. We enhance two RTS techniques—Ekstazi and STARTS—to reason about our findings. Using 1,150 versions of 23 projects, we evaluate the impact on safety and precision of leveraging such changes. We also evaluate if our findings from a few projects can speed up regression testing in other projects. The results show that our enhancements are effective and they can generalize. On average, they result in selecting 41.7% and 31.8% fewer tests, and take 33.7% and 28.7% less time than Ekstazi and STARTS, respectively, with no loss in safety.

DOI: 10.1145/3597926.3598086

Silent Compiler Bug De-duplication via Three-Dimensional Analysis

作者: Yang, Chen and Chen, Junjie and Fan, Xingyu and Jiang, Jiajun and Sun, Jun
关键词: Bug Deduplication, Compiler Bugs, Fuzzing

Abstract

Compiler testing is an important task for assuring the quality of compilers, but investigating test failures is very time-consuming. This is because many test failures are caused by the same compiler bug (known as bug duplication problem). In particular, this problem becomes much more challenging on silent compiler bugs (also called wrong code bugs), since these bugs can provide little information (unlike crash bugs that can produce error messages) for bug de-duplication. In this work, we propose a novel technique (called D3) to solve the duplication problem on silent compiler bugs. Its key insight is to characterize the silent bugs from the testing process and identify three-dimensional information (i.e., test program, optimizations, and test execution) for bug de-duplication. However, there are huge amount of bug-irrelevant details on the three dimensions, D3 then systematically conducts causal analysis to identify bug-causal features from each of the three dimensions for more accurate bug de-duplication. Finally, D3 ranks the test failures that are more likely to be caused by different silent bugs higher by measuring the distance among test failures based on the three-dimensional bug-causal features. Our experimental results on four datasets (including duplicate bugs of both GCC and LLVM) demonstrate the significant superiority of D3 over the two state-of-the-art compiler bug de-duplication techniques, achieving the average improvement of 19.36% and 51.43% in identifying unique silent compiler bugs when analyzing the same number of test failures.

DOI: 10.1145/3597926.3598087

ACETest： Automated Constraint Extraction for Testing Deep Learning Operators

作者: Shi, Jingyi and Xiao, Yang and Li, Yuekang and Li, Yeting and Yu, Dongsong and Yu, Chendong and Su, Hui and Chen, Yufeng and Huo, Wei
关键词: Constraint Extraction, Deep Learning Library Testing, Symbolic Execution, Test Generation

Abstract

Deep learning (DL) applications are prevalent nowadays as they can help with multiple tasks. DL libraries are essential for building DL applications. Furthermore, DL operators are the important building blocks of the DL libraries, that compute the multi-dimensional data (tensors). Therefore, bugs in DL operators can have great impacts. Testing is a practical approach for detecting bugs in DL operators. In order to test DL operators effectively, it is essential that the test cases pass the input validity check and are able to reach the core function logic of the operators. Hence, extracting the input validation constraints is required for generating high-quality test cases. Existing techniques rely on either human effort or documentation of DL library APIs to extract the constraints. They cannot extract complex constraints and the extracted constraints may differ from the actual code implementation.
To address the challenge, we propose ACETest, a technique to automatically extract input validation constraints from the code to build valid yet diverse test cases which can effectively unveil bugs in the core function logic of DL operators. For this purpose, ACETest can automatically identify the input validation code in DL operators, extract the related constraints and generate test cases according to the constraints. The experimental results on popular DL libraries, TensorFlow and PyTorch, demonstrate that ACETest can extract constraints with higher quality than state-of-the-art (SOTA) techniques. Moreover, ACETest is capable of extracting 96.4% more constraints and detecting 1.95 to 55 times more bugs than SOTA techniques. In total, we have used ACETest to detect 108 previously unknown bugs on TensorFlow and PyTorch, with 87 of them confirmed by the developers. Lastly, five of the bugs were assigned with CVE IDs due to their security impacts.

DOI: 10.1145/3597926.3598088

Reproduction Package for Article “DDLDroid： Efficiently Detecting Data Loss Issues in Android Apps”

作者: Zhou, Yuhao and Song, Wei
关键词: Android apps, bug detection, data flow analysis, data loss

Abstract

DDLDroid is a static analyzer for detecting data loss issues in Android apps during activity restart or app relaunch. It is bootstrapped by a saving-restoring bipartite graph which correlates variables that need saving to those that need restoring according to their carrier widgets, and is based on the analysis of saving and restoring data flows. It reports data loss issues once missed or broken data flows are identified.

Based on a set of available tools (e.g., Soot, FlowDroid, ApkTool), DDLDroid is implemented in Java and has three analyzers: pretreatment analyzer, static analyzer, and data loss reporter.

DOI: 10.1145/3597926.3598089

Reproduction Package for An Empricial Study of Mutation Testing Kills

作者: Du, Hang and Palepu, Vijay Krishna and Jones, James A.
关键词: empirical study, mutant detection, mutation testing, test failure classification

Abstract

This project provides an experimental replication setup and source code for To Kill a Mutant: An Empirical Study of Mutation Testing Kills. Artifact’s data structure, experiments’ general setups and detailed instructions are provided.

DOI: 10.1145/3597926.3598090

iSyn： Semi-automated Smart Contract Synthesis from Legal Financial Agreements

作者: Fang, Pengcheng and Zou, Zhenhua and Xiao, Xusheng and Liu, Zhuotao
关键词: Natural Language Processing, Program Synthesis, Smart Contracts

Abstract

Embracing software-driven smart contracts to fulfill legal agreements is a promising direction for digital transformation in the legal sector. Existing solutions mostly consider smart contracts as simple add-ons, without leveraging the programmability of smart contracts to realize complex semantics of legal agreements. In this paper, we propose iSyn, the first end-to-end system that synthesizes smart contracts to fulfill the semantics of financial legal agreements, with minimal human interventions. The design of iSyn centers around a novel intermediate representation (SmartIR) that closes the gap between the natural language sentences and smart contract statements. Specifically, iSyn includes a synergistic pipeline that unifies multiple NLP-techniques to accurately construct SmartIR instances given legal agreements, and performs template-based synthesis based on the SmartIR instances to synthesize smart contracts. We also design a validation framework to verify the correctness and detect known vulnerabilities of the synthesized smart contracts.We evaluate iSyn using legal agreements centering around financial transactions. The results show that iSyn-synthesized smart contracts are syntactically similar and semantically correct (or within a few edits), compared with the “ground truth” smart contracts manually developed by inspecting the legal agreements.

DOI: 10.1145/3597926.3598091

RefBERT： A Two-Stage Pre-trained Framework for Automatic Rename Refactoring

作者: Liu, Hao and Wang, Yanlin and Wei, Zhao and Xu, Yong and Wang, Juhong and Li, Hui and Ji, Rongrong
关键词: bag-of-tokens loss, contrastive learning, language modeling, rename refactoring

Abstract

Refactoring is an indispensable practice of improving the quality and maintainability of source code in software evolution. Rename refactoring is the most frequently performed refactoring that suggests a new name for an identifier to enhance readability when the identifier is poorly named. However, most existing works only identify renaming activities between two versions of source code, while few works express concern about how to suggest a new name. In this paper, we study automatic rename refactoring on variable names, which is considered more challenging than other rename refactoring activities. We first point out the connections between rename refactoring and various prevalent learning paradigms and the difference between rename refactoring and general text generation in natural language processing. Based on our observations, we propose RefBERT, a two-stage pre-trained framework for rename refactoring on variable names. RefBERT first predicts the number of sub-tokens in the new name and then generates sub-tokens accordingly. Several techniques, including constrained masked language modeling, contrastive learning, and the bag-of-tokens loss, are incorporated into RefBERT to tailor it for automatic rename refactoring on variable names. Through extensive experiments on our constructed refactoring datasets, we show that the generated variable names of RefBERT are more accurate and meaningful than those produced by the existing method. Our implementation and data are available at https://github.com/KDEGroup/RefBERT.

DOI: 10.1145/3597926.3598092

CoopHance： Cooperative Enhancement for Robustness of Deep Learning Systems

作者: Zhang, Quan and Tian, Yongqiang and Ding, Yifeng and Li, Shanshan and Sun, Chengnian and Jiang, Yu and Sun, Jiaguang
关键词: Deep Learning System, Robustness Enhancement

Abstract

Adversarial attacks have been a threat to Deep Learning (DL) systems to be reckoned with. By adding human-imperceptible perturbation to benign inputs, adversarial attacks can cause the incorrect behavior of DL systems. Considering the popularity of DL systems in the industry, it is critical and urgent for developers to enhance the robustness of DL systems against adversarial attacks.

In this study, we propose a novel enhancement technique for DL systems, namely CoopHance. CoopHance leverages two specifically customized components, Regulator and Inspector, to cooperatively enhance the DL systems’ robustness against adversarial examples with different distortions. Regulator can purify adversarial examples with low or moderate distortions, while Inspector is responsible for detecting these adversarial examples with high distortion by capturing the abnormal status of DL systems. Our evaluation using various attacks shows that, on average, CoopHance can successfully resist 90.62% and 96.56% of the adversarial examples that are generated for the unprotected systems on CIFAR-10 and SVHN datasets separately, which is 188.14% more effective than five state-of-the-art enhancement techniques, including Feature Squeeze, LID, SOAP, Adversarial Training, and MagNet. Meanwhile, when attackers generate new adversarial examples on the enhanced systems, CoopHance can reject 78.06% of attacks, which outperforms the best of five enhancement techniques by 82.71% on average.

DOI: 10.1145/3597926.3598093

Reproduction Package for Article “ROME： Testing Image Captioning Systems via Recursive Object Melting”

作者: Yu, Boxi and Zhong, Zhiqing and Li, Jiaqi and Yang, Yixing and He, Shilin and He, Pinjia
关键词: AI software, image captioning, Metamorphic testing, testing

Abstract

This artifact is the package for ROME along with the video tutorials for reproducing the experiments, which includes the following content: Content Description Video tutorials In Tutorial_1, we demonstrate how to perform object selection, image mutation, and fine-tuning.

DOI: 10.1145/3597926.3598094

GPUHarbor： Testing GPU Memory Consistency At Large (Experience Paper)： Artifact

作者: Levine, Reese and Cho, Mingun and McKee, Devon and Quinn, Andrew and Sorensen, Tyler
关键词: GPUs, memory consistency, mutation testing

Abstract

Artifact for the ISSTA 2023 paper “GPUHarbor: Testing GPU Memory Consistency At Large (Experience Paper)”, containing the data used in the paper as well as the tools we used to collect and analyze the data.

DOI: 10.1145/3597926.3598095

COME： Commit Message Generation with Modification Embedding

作者: He, Yichen and Wang, Liran and Wang, Kaiyi and Zhang, Yupeng and Zhang, Hang and Li, Zhoujun
关键词: Automatic Commit Message Generation, Contextualized Code Change Representation Learning, Self-supervised Learning

Abstract

Commit messages concisely describe code changes in natural language and are important for program comprehension and maintenance. Previous studies proposed some approaches for automatic commit message generation, but their performance is limited due to inappropriate representation of code changes and improper combination of translation-based and retrieval-based approaches. To address these problems, this paper introduces a novel framework named COME, in which modification embeddings are used to represent code changes in a fine-grained way, a self-supervised generative task is designed to learn contextualized code change representation, and retrieval-based and translation-based methods are combined through a decision algorithm. The average improvement of COME over the state-of-the-art approaches is 9.2% on automatic evaluation metrics and 8.0% on human evaluation metrics. We also analyse the effectiveness of COME’s three main components and each of them results in an improvement of 8.6%, 8.7% and 5.2%.

DOI: 10.1145/3597926.3598096

OCFI： Make Function Entry Identification Hard Again

作者: Pang, Chengbin and Zhang, Tiantai and Xu, Xuelan and Wang, Linzhang and Mao, Bing
关键词: binary disassembly, function entry detection, obfuscation

Abstract

We introduce OCFI, a modified LLVM/Clang compiler that offers the capability to obfuscate the .eh_frame section of compiled binaries. This obfuscation process aims to make it more challenging for disassemblers to identify function entries.

By leveraging OCFI, C/C++ projects can be compiled with the obfuscation feature enabled. This means that the resulting binaries will have their .eh_frame sections modified, enhancing their resistance to reverse engineering attempts and making the analysis of function boundaries more difficult for disassemblers. The application of OCFI as a compiler tool provides an additional layer of security for C/C++ projects, safeguarding sensitive code and intellectual property from potential attackers or unauthorized access.

DOI: 10.1145/3597926.3598097

Artifact Package for Article ‘Catamaran： Low-Overhead Memory Safety Enforcement via Parallel Acceleration’

作者: Zhang, Yiyu and Liu, Tianyi and Sun, Zewen and Chen, Zhe and Li, Xuandong and Zuo, Zhiqiang
关键词: memory safety enforcement, parallel acceleration, program analysis

Abstract

This artifact contains the main implementation of Catamaran, as well as the scripts used for running it. This artifact claims the availability and the functionality of Catamaran.

DOI: 10.1145/3597926.3598098

Latent Imitator： Generating Natural Individual Discriminatory Instances for Black-Box Fairness Testing

作者: Xiao, Yisong and Liu, Aishan and Li, Tianlin and Liu, Xianglong
关键词: Fairness Testing, Individual Discrimination, Latent Space, Natural Individual Discriminatory Instances

Abstract

Machine learning (ML) systems have achieved remarkable performance across a wide area of applications. However, they frequently exhibit unfair behaviors in sensitive application domains (e.g., employment and loan), raising severe fairness concerns. To evaluate and test fairness, engineers often generate individual discriminatory instances to expose unfair behaviors before model deployment. However, existing baselines ignore the naturalness of generation and produce instances that deviate from the real data distribution, which may fail to reveal the actual model fairness since these unnatural discriminatory instances are unlikely to appear in practice. To address the problem, this paper proposes a framework named Latent Imitator (LIMI) to generate more natural individual discriminatory instances with the help of a generative adversarial network (GAN), where we imitate the decision boundary of the target model in the semantic latent space of GAN and further samples latent instances on it. Specifically, we first derive a surrogate linear boundary to coarsely approximate the decision boundary of the target model, which reflects the nature of the original data distribution. Subsequently, to obtain more natural instances, we manipulate random latent vectors to the surrogate boundary with a one-step movement, and further conduct vector calculation to probe two potential discriminatory candidates that may be more closely located in the real decision boundary. Extensive experiments on various datasets demonstrate that our LIMI outperforms other baselines largely in effectiveness (\texttimes{

DOI: 10.1145/3597926.3598099

Reproduction Package for `Simulation-Based Validation for Autonomous Driving Systems’

作者: Li, Changwen and Sifakis, Joseph and Wang, Qiang and Yan, Rongjie and Zhang, Jian
关键词: Autonomous driving systems, Formal specification, LGSVL, Runtime verification, Simulation-based validation, Temporal logic

Abstract

The artifact provides RvADS, a simulation-based validation framework for autonomous driving systems that contains three components: 1)Simulator, 2) Scenario Generator, and 3) Monitor.

DOI: 10.1145/3597926.3598100

SimAPR framework used in “Automated Program Repair from Fuzzing Perspective”

作者: Kim, YoungJae and Han, Seungheon and Khamit, Askar Yeltayuly and Yi, Jooyong
关键词: Automated Program Repair, Fuzzing, Multi-Armed Bandit, Patch Scheduling

Abstract

This artifact contains the SimAPR framework designed to simulate the existing and new patch-scheduling algorithms of APR tools. SimAPR enables users to easily assess the efficiency of a patch-scheduling algorithm under study without the need to run APR tools. Currently, SimAPR supports six APR tools: AlphaRepair, Recoder, TBar, Avatar, FixMiner, and kPar. Furthermore, SimAPR can be expanded to include additional APR tools.

SimAPR also supports the new patch-scheduling algorithm named Casino, as presented in the paper titled “Automated Program Repair from Fuzzing Perspective.”

DOI: 10.1145/3597926.3598101

1dFuzz： Reproduce 1-Day Vulnerabilities with Directed Differential Fuzzing

作者: Yang, Songtao and He, Yubo and Chen, Kaixiang and Ma, Zheyu and Luo, Xiapu and Xie, Yong and Chen, Jianjun and Zhang, Chao
关键词: 1-day vulnerability, differential testing, directed fuzzing, patch

Abstract

1-day vulnerabilities are common in practice and have posed severe threats to end users, as adversaries could learn from released patches to find them and exploit them. Reproducing 1-day vulnerabilities is also crucial for defenders, e.g., to block attack traffic against 1-day vulnerabilities. A core question that affects the effectiveness of recognizing and triggering 1-day vulnerabilities is what is the unique feature of a security patch. After conducting a large-scale empirical study, we point out that a common and unique feature of patches is the trailing call sequence (TCS) and present a novel directed differential fuzzing solution 1dFuzz to efficiently reproduce 1-day vulnerabilities in this paper. Based on the TCS feature, we present a locator 1dLoc able to find candidate patch locations via static analysis, a novel TCS-based distance metric for directed fuzzing, and a novel sanitizer 1dSan able to catch PoCs for 1-day vulnerabilities during fuzzing. We have systematically evaluated 1dFuzz on a set of real-world software vulnerabilities in 11 different settings. Results show that 1dFuzz significantly outperforms state-of-the-art (SOTA) baselines and could find up to 2.26x more 1-day vulnerabilities with a 43% shorter time.

DOI: 10.1145/3597926.3598102

A Bayesian Framework for Automated Debugging

作者: Kang, Sungmin and Choi, Wonkeun and Yoo, Shin
关键词: automated debugging, automated program repair, bayesian statistics, fault localization

Abstract

Debugging takes up a significant portion of developer time. As a result, automated debugging techniques including Fault Localization (FL) and Automated Program Repair (APR) have garnered significant attention due to their potential to aid developers in debugging tasks. With the recent advance in techniques that treat the two tasks as closely coupled, such as Unified Debugging, a framework to formally express these two tasks together would heighten our understanding of automated debugging and provide a way to formally analyze techniques and approaches. To this end, we propose a Bayesian framework of understanding automated debugging. We find that the Bayesian framework, along with a concrete statement of the objective of automated debugging, can recover maximal fault localization formulae from prior work, as well as analyze existing APR techniques and their underlying
assumptions.
As a means of empirically demonstrating our framework, we further propose BAPP, a Bayesian Patch Prioritization technique that incorporates intermediate program values to analyze likely patch locations and repair actions, with its core equations being derived by our Bayesian framework. We find that incorporating program values allows BAPP to identify correct patches more precisely: the rankings produced by BAPP reduced the number of required patch evaluations by 68% and consequently reduced the repair time by 34 minutes on average. Further, our Bayesian framework suggests a number of changes to the way fault localization information is used in program repair, which we validate is useful for BAPP. These results highlight the potential of value-cognizant automated debugging techniques, and further verifies our theoretical framework.

DOI: 10.1145/3597926.3598103

Artifact for “That’s a Tough Call： Studying the Challenges of Call Graph Construction for WebAssembly”

作者: Lehmann, Daniel and Thalakottur, Michelle and Tip, Frank and Pradel, Michael
关键词: call graphs, dataset, WebAssembly

Abstract

This artifact contains supplementary material for the paper “That’s a Tough Call: On Static Call Graph Construction for WebAssembly Binaries” (ISSTA’23).

DOI: 10.1145/3597926.3598104

Artifact for Paper “GenCoG： A DSL-Based Approach to Generating Computation Graphs for TVM Testing”

作者: Wang, Zihan and Nie, Pengbo and Miao, Xinyuan and Chen, Yuting and Wan, Chengcheng and Bu, Lei and Zhao, Jianjun
关键词: Computation Graph Generation, Constraint Solving, Deep Learning Compiler

Abstract

This is the artifact for the ISSTA ’23 paper “GenCoG: A DSL-Based Approach to Generating Computation Graphs for TVM Testing”. This artifact contains the implementation of GenCoG, the adapted versions or reimplementation of the baselines, and the bug-triggering cases.

DOI: 10.1145/3597926.3598105

Alligator in Vest： A Practical Failure-Diagnosis Framework via Arm Hardware Features

作者: Zhang, Yiming and Hu, Yuxin and Li, Haonan and Shi, Wenxuan and Ning, Zhenyu and Luo, Xiapu and Zhang, Fengwei
关键词: Debugging, ETM tracing, Failure Diagnosis

Abstract

Failure diagnosis in practical systems is difficult, and the main obstacle is that the information a developer has access to is limited. This information is usually not enough to help developers fix or even locate the related bug. Moreover, due to the vast difference between the development and production environments, it is not trivial to reproduce failures from the production environment in the development environment. When failures are caused by non-deterministic events such as race conditions or unforeseen inputs, reproducing them is even more challenging.

In this paper, we present Investigator, a failure diagnosis framework for practical systems running on Arm. At runtime, Investigator leverages the hardware tracing component called Embedded Trace Macrocell (ETM) and a lightweight event capturer to collect information with low overhead. With the collected trace and analysis, Investigator identifies the control and data flow related to the cause of a failure, which helps developers in bug fixing. We implemented a prototype of Investigator and evaluated it with real-world bugs. The results show that Investigator diagnoses these bugs effectively and efficiently while introducing a low performance overhead at runtime.

DOI: 10.1145/3597926.3598106

Mu2： Guiding Greybox Fuzzing with Mutation Testing (Artifact)

作者: Vikram, Vasudev and Laybourn, Isabella and Li, Ao and Nair, Nicole and OBrien, Kelton and Sanna, Rafaello and Padhye, Rohan
关键词: fuzz testing, mutation testing, test generation

Abstract

This artifact accompanies the paper “Guiding Greybox Fuzzing with Mutation Testing”, published at ISSTA 2023. It contains a replication package for experiments and evaluation data used to generate the figures in the paper. The evaluation data contains logs of the fuzzing experiments described in the paper.

DOI: 10.1145/3597926.3598107

Testing Automated Driving Systems by Breaking Many Laws Efficiently

作者: Zhang, Xiaodong and Zhao, Wei and Sun, Yang and Sun, Jun and Shen, Yulong and Dong, Xuewen and Yang, Zijiang
关键词: Automated Driving System, Baidu Apollo, Generative Flow Network, Testing Scenario Generation, Traffic Laws

Abstract

An automated driving system (ADS), as the brain of an autonomous vehicle (AV), should be tested thoroughly ahead of deployment.
ADS must satisfy a complex set of rules to ensure road safety, e.g., the existing traffic laws and possibly future laws that are dedicated to AVs.
To comprehensively test an ADS, we would like to systematically discover diverse scenarios in which certain traffic law is violated. The challenge is that (1) there are many traffic laws (e.g., 13 testable articles in Chinese traffic laws and 16 testable articles in Singapore traffic laws, with 81 and 43 violation situations respectively); and (2) many of traffic laws are only relevant in complicated specific scenarios.

Existing approaches to testing ADS either focus on simple oracles such as no-collision or have limited capacity in generating diverse law-violating scenarios.
In this work, we propose ABLE, a new ADS testing method inspired by the success of GFlowNet, which Aims to Break many Laws Efficiently by generating diverse scenarios.
Different from vanilla GFlowNet, ABLE drives the testing process with dynamically updated testing objectives (based on a robustness semantics of signal temporal logic) as well as active learning, so as to effectively explore the vast search space.
We evaluate ABLE based on Apollo and LGSVL, and the results show that ABLE outperforms the state-of-the-art by violating 17% and 25% more laws when testing Apollo 6.0 and Apollo 7.0, most of which are hard-to-violate laws, respectively.

DOI: 10.1145/3597926.3598108

DeepAtash： Focused Test Generation for Deep Learning systems

作者: Zohdinasab, Tahereh and Riccio, Vincenzo and Tonella, Paolo
关键词: deep learning, search based software engineering, software testing

Abstract

The source code and the data of the article “DeepAtash: Focused Test Generation for Deep Learning systems”

DOI: 10.1145/3597926.3598109

SBDT： Search-Based Differential Testing of Certificate Parsers in SSL/TLS Implementations

作者: Chen, Chu and Ren, Pinghong and Duan, Zhenhua and Tian, Cong and Lu, Xu and Yu, Bin
关键词: SSL/TLS, certificate parser, differential testing, search, syntax tree model

Abstract

Certificate parsers, which are critical components of Secure Sockets Layer or Transport Layer Security (SSL/TLS) implementations, parse incomprehensible certificates into comprehensible inputs to certificate validators and humans. Thus, certificate parsers profoundly affect decision-makings of validators and humans, which in turn affect security. To guarantee the correctness of certificate parsers, an approach for search-based differential testing of certificate parsers, namely SBDT, is put forward. SBDT begins with modeling certificate structures, mutation operations, and bounds. Based on the initial model, SBDT searches for the most promising model node and mutation operator that trigger discrepancies, and generates a certificate from the node and operator it finds. Then, SBDT feeds the certificate to certificate parsers, and searches for multiple types of discrepancies after normalizing the results output by parsers. Distinct discrepancies are employed as feedback to update and prune the model. SBDT starts the next iteration from the updated and pruned model, unless all nodes and mutation operators have been pruned due to reaching their upper bounds. Our work has the following contributions: (1) To the best of our knowledge, this is the first time that testing of certificate parsers has been clearly distinguished from testing of certificate validators, which will facilitate accurate testing of certificate parsers and validators; (2) SBDT is the first systematic and efficient approach for differential testing of certificate parsers by searching, updating, and pruning models; and (3) We have implemented an open-source prototype tool of SBDT, and experimental results show that SBDT is effective and efficient in finding new bugs and enhancements of certificate parsers.

DOI: 10.1145/3597926.3598110

SmartState： Detecting State-Reverting Vulnerabilities in Smart Contracts via Fine-Grained State-Dependency Analysis

作者: Liao, Zeqin and Hao, Sicheng and Nan, Yuhong and Zheng, Zibin
关键词: bug finding, smart contract, state dependency, static analysis

Abstract

Smart contracts written in Solidity are widely used in different
blockchain platforms such as Ethereum, TRON and BNB Chain.
One of the unique designs in Solidity smart contracts is its statereverting
mechanism for error handling and access control. Unfortunately,
a number of recent security incidents showed that
adversaries also utilize this mechanism to manipulate critical states
of smart contracts, and hence, bring security consequences such as
illegal profit-gain and Deny-of-Service (DoS). In this paper, we call
such vulnerabilities as the State-reverting Vulnerability (SRV). Automatically
identifying SRVs poses unique challenges, as it requires
an in-depth analysis and understanding of the state-dependency
relations in smart contracts.

This paper presents SmartState, a new framework for detecting
state-reverting vulnerability in Solidity smart contracts via finegrained
state-dependency analysis. SmartState integrates a set of
novel mechanisms to ensure its effectiveness. Particularly, Smart-
State extracts state dependencies from both contract bytecode and
historical transactions. Both of them are critical for inferring dependencies
related to SRVs. Further, SmartState models the generic
patterns of SRVs (i.e., profit-gain and DoS) as SRV indicators, and
hence effectively identify SRVs based on the constructed statedependency
graph. To evaluate SmartState, we manually annotated
a ground-truth dataset which contains 91 SRVs in the real world.
Evaluation results showed that SmartState achieves a precision of
87.23% and a recall of 89.13%. In addition, SmartState successfully
identifies 406 new SRVs from 47,351 real-world smart contracts. 11
of these SRVs are from popular smart contracts with high transaction
amounts (i.e., top 2000). In total, our reported SRVs affect a
total amount of digital assets worth 428,600 USD.

DOI: 10.1145/3597926.3598111

Artifact for wTest

作者: Hu, Jiajun and Wei, Lili and Liu, Yepang and Cheung, Shing-Chi
关键词: Android, Testing, WebView

Abstract

This is the artifact for wTest

DOI: 10.1145/3597926.3598112

ModelObfuscator： Obfuscating Model Information to Protect Deployed ML-Based Systems

作者: Zhou, Mingyi and Gao, Xiang and Wu, Jing and Grundy, John and Chen, Xiao and Chen, Chunyang and Li, Li
关键词: AI safety, SE for AI, model deployment, model obfuscation

Abstract

More and more edge devices and mobile apps are leveraging deep learning (DL) capabilities. Deploying such models on devices – referred to as on-device models – rather than as remote cloud-hosted services, has gained popularity because it avoids transmitting user’s data off of the device and achieves high response time. However, on-device models can be easily attacked, as they can be accessed by unpacking corresponding apps and the model is fully exposed to attackers. Recent studies show that attackers can easily generate white-box-like attacks for an on-device model or even inverse its training data. To protect on-device models from white-box attacks, we propose a novel technique called model obfuscation. Specifically, model obfuscation hides and obfuscates the key information – structure, parameters and attributes – of models by renaming, parameter encapsulation, neural structure obfuscation, shortcut injection, and extra layer injection. We have developed a prototype tool ModelObfuscator to automatically obfuscate on-device TFLite models. Our experiments show that this proposed approach can dramatically improve model security by significantly increasing the difficulty of parsing models’ inner information, without increasing the latency of DL models. Our proposed on-device model obfuscation has the potential to be a fundamental technique for on-device model deployment. Our prototype tool is publicly available at https://github.com/zhoumingyi/ModelObfuscator.

DOI: 10.1145/3597926.3598113

[Supplementary material] AGORA： Automated Generation of Test Oracles for REST APIs

作者: Alonso, Juan C. and Segura, Sergio and Ruiz-Cort'{e
关键词: automated testing, invariant detection, REST APIs, test oracle

Abstract

In order to enable reproducibility of the results reported in this paper, we provide a supplementary material containing the source code of the scripts and projects developed, videos explaining how to use the provided software, the data generated in our experiments, bug reports with the corresponding responses from the developers, as well as a Docker image and an Ubuntu virtual machine with all the projects configured. With these resources, we aim to provide a robust foundation for replicating and validating our findings.

To use the most up-to-date version of AGORA, please refer to the official GitHub repository: https://github.com/isa-group/Beet

DOI: 10.1145/3597926.3598114

Replication Package for ‘Fuzzing Embedded Systems using Debugger Interfaces’

作者: Eisele, Max and Ebert, Daniel and Huth, Christopher and Zeller, Andreas
关键词: embedded fuzzing, embedded systems, fuzzing, gdb, ghidra, hardware breakpoint

Abstract

The idea of GDBFuzz is to leverage hardware breakpoints from microcontrollers as feedback for coverage-guided fuzzing. Therefore, GDB is used as a generic interface to enable broad applicability. For binary analysis of the firmware, Ghidra is used. The code contains a benchmark setup for evaluating the method. Additionally, example firmware files are included. The replication package allows the users to reproduce and extend the results reported in the paper.

DOI: 10.1145/3597926.3598115

Splendor： Static Detection of Stored XSS in Modern Web Applications

作者: Su, He and Li, Feng and Xu, Lili and Hu, Wenbo and Sun, Yujie and Sun, Qing and Chao, Huina and Huo, Wei
关键词: Static Taint Analysis, Stored XSS, Vulnerability Detection, Web Application Security

Abstract

In modern websites, stored Cross-Site Scripting (XSS) is the most dangerous XSS vulnerability, which can store payloads in the web system and be triggered directly by the victim. Database (DB) as the most commonly used storage medium for data on websites is therefore also the most common place where stored XSS occurs. Due to the modularity of modern programming architectures, the complex underlying database operations will often be encapsulated and abstracted as a Data Access Layer (DAL) to provide unified data access services to the business layer. The heavy use of Object-Oriented (OO) and dynamic language features involved in the encapsulation makes it increasingly challenging for static taint analysis tools to understand how tainted data flows between the source code and the exact locations in database. In this paper, we propose the first static analysis framework for detecting stored XSS in modern web applications using DAL and implement a prototype Splendor for PHP code analysis. The highlight in the framework is the design of a heuristic but precise token-matching method to locate the flows of taint data between database and source code. The precisions of the identified DB read and write (R/W) locations are 91.3% and 82.6%, respectively. With the identified R/W locations, the disconnected taint paths can be statically stitched to obtain a complete taint propagation path of stored XSS. Comparisons with existing works on 5 real-world applications and large-scale experiments on PHP web applications in Github show that Splendor significantly outperforms both the state-of-the-art static and dynamic approaches on stored-XSS detection, and detects 17 zero-day vulnerabilities.

DOI: 10.1145/3597926.3598116

Applying and Extending the Delta Debugging Algorithm for Elevator Dispatching Algorithms (Experience Paper)

作者: Valle, Pablo and Arrieta, Aitor and Arratibel, Maite
关键词: Cyber-Physical Systems, Delta Debugging, Simulation-based Testing

Abstract

Elevator systems are one kind of Cyber-Physical Systems (CPSs), and as such, test cases are usually complex and long in time. This is mainly because realistic test scenarios are employed (e.g., for testing elevator dispatching algorithms, typically a full day of passengers traveling through a system of elevators is used). However, in such a context, when needing to reproduce a failure, it is of high benefit to provide the minimal test input to the software developers. This way, analyzing and trying to localize the root-cause of the failure is easier and more agile. Delta debugging has been found to be an efficient technique to reduce failure-inducing test inputs. In this paper, we enhance this technique by first monitoring the environment at which the CPS operates as well as its physical states. With the monitored information, we search for stable states of the CPS during the execution of the simulation. In a second step, we use such identified stable states to help the delta debugging algorithm isolate the failure-inducing test inputs more efficiently.

We report our experience of applying our approach into an industrial elevator dispatching algorithm. An empirical evaluation carried out with real operational data from a real installation of elevators suggests that the proposed environment-wise delta debugging algorithm is between 1.3 to 1.8 times faster than the traditional delta debugging, while producing a larger reduction in the failure-inducing test inputs. The results provided by the different implemented delta debugging algorithm versions are qualitatively assessed with domain experts. This assessment provides new insights and lessons learned, such as, potential applications of the delta debugging algorithm beyond debugging.

DOI: 10.1145/3597926.3598117

Artifact： Finding Short Slow Inputs Faster with Grammar-Based Search

作者: Alsaeed, Ziyad and Young, Michal
关键词: Input Generation, MCTS, Performance Analysis

Abstract

Two grammar based performance fuzzing tools. SlackLine and TreeLine that find short strings that trigger worst-case performance.

TreeLine and SlackLine artifact, each in its own compressed file, which has a dedicated README file. These are related to the ISSTA’23 publication titled “Finding Short Slow Inputs Faster with Grammar-Based Search.”

The attached files are the ISSTA’23 snapshots of the projects. The up-to-date versions can be found in the dedicated GitHub repos of each project.

TreeLine: https://github.com/uo-se-research/treeline SlackLine: https://github.com/uo-se-research/slackline In addition, each tool has its own Docker repository. They can be found through the following links:

TreeLine: https://hub.docker.com/r/zalsaeed/treeline SlackLine: https://hub.docker.com/r/zalsaeed/slackline We share all the experimental data in a compressed file. The data size is slightly larger than 7GB when you uncompress it.

https://doi.org/10.6084/m9.figshare.22114373.v1

DOI: 10.1145/3597926.3598118

Reproduction package of “Transforming Test Suites Into Croissants”

作者: Chen, Yang and Yildiz, Alperen and Marinov, Darko and Jabbarvand, Reyhaneh
关键词: Fault Injection, Mutation Testing, Software Testing, Test Flakiness

Abstract

This package includes all data and code to reproduce the results for paper “Transforming Test Suites Into Croissants”.

DOI: 10.1145/3597926.3598119

Tai-e： A Developer-Friendly Static Analysis Framework for Java by Harnessing the Good Designs of Classics (Artifact)

作者: Tan, Tian and Li, Yue
关键词: Java, static analysis

Abstract

This artifact is provided to reproduce the results of RQ4 in Section 6 of our companion paper, i.e., the data in: Table 1 (for pointer analysis) and Table 2 (for data flow analysis).

DOI: 10.1145/3597926.3598120

Artifact for DiEmph

作者: Xu, Xiangzhe and Feng, Shiwei and Ye, Yapeng and Shen, Guangyu and Su, Zian and Cheng, Siyuan and Tao, Guanhong and Shi, Qingkai and Zhang, Zhuo and Zhang, Xiangyu
关键词: Binary Similarity Analysis, Program Analysis, Transformer

Abstract

This repo contains the artifact for paper “Improving Binary Code Similarity Transformer Models by Semantics-driven Instruction Deemphasis” published on ISSTA’23. Please refer to README.md for details.

DOI: 10.1145/3597926.3598121

Data Constraint Mining for Automatic Reconciliation Scripts Generation

作者: Wang, Tianxiao and Zhi, Chen and Zhou, Xiaoqun and Wu, Jinjie and Yin, Jianwei and Deng, Shuiguang
关键词: data constraints, scripts generation, sql generation, symbolic regression

Abstract

Fund loss is an increasingly critical problem caused by the misbehavior of software, especially in fintech and e-commerce platforms. Data reconciliation is one of the most commonly used approaches in detecting and preventing fund loss by executing reconciliation scripts on data storage systems (e.g., database and cache systems). The core of reconciliation scripts is the data constraints, which can be expressed as implications with two parts: preconditions and assertions. However, due to the complexity and diversity of business, the construction of data constraints and reconciliation scripts usually heavily relies on business experts. To this end, we propose AutoReconciler to mine data constraints from business data and generate reconciliation scripts automatically. It can mine assertions via enhanced symbolic regression, discover preconditions via association rule mining, and generate reconciliation scripts in SQL form. We have performed extensive experiments on the synthesized data. The result shows that our approach outperforms the baseline by a large margin (an average improvement in precision and recall of 22.1% and 51.6%, respectively), especially for complex data constraints. Our solution has been implemented, deployed, and adopted in production, and we conducted several case studies further to confirm the benefits of our solution in industrial scenarios.

DOI: 10.1145/3597926.3598122

Guided Retraining to Enhance the Detection of Difficult Android Malware

作者: Daoudi, Nadia and Allix, Kevin and Bissyand'{e
关键词: Android, difficult samples, malware, retraining

Abstract

The popularity of Android OS has made it an appealing target for malware developers. To evade detection, including by ML-based techniques, attackers invest in creating malware that closely resemble legitimate apps, challenging the state of the art with difficult-to-detect samples. In this paper, we propose Guided Retraining, a supervised representation learning-based method for boosting the performance of malware detectors. To that end, we first split the experimental dataset into subsets of “easy” and “difficult” samples, where difficulty is associated to the prediction probabilities yielded by a malware detector. For the subset of “easy” samples, the base malware detector is used to make the final predictions since the error rate on that subset is low by construction. Our work targets the second subset containing “difficult” samples, for which the probabilities are such that the classifier is not confident on the predictions, which have high error rates. We apply our Guided Retraining method on these difficult samples to improve their classification. Guided Retraining leverages the correct predictions and the errors made by the base malware detector to guide the retraining process. Guided Retraining learns new embeddings of the difficult samples using Supervised Contrastive Learning and trains an auxiliary classifier for the final predictions. We validate our method on four state-of-the-art Android malware detection approaches using over 265k malware and benign apps. Experimental results show that Guided Retraining can boost state-of-the-art detectors by eliminating up to 45.19% of the prediction errors that they make on difficult samples. We note furthermore that our method is generic and designed to enhance the performance of binary classifiers for other tasks beyond Android malware detection.

DOI: 10.1145/3597926.3598123

DeFiTainter： Detecting Price Manipulation Vulnerabilities in DeFi Protocols

作者: Kong, Queping and Chen, Jiachi and Wang, Yanlin and Jiang, Zigui and Zheng, Zibin
关键词: smart contract, taint analysis, vulnerability detection

Abstract

DeFi protocols are programs that manage high-value digital assets on blockchain. The price manipulation vulnerability is one of the common vulnerabilities in DeFi protocols, which allows attackers to gain excessive profits by manipulating token prices. In this paper, we propose DeFiTainter, an inter-contract taint analysis framework for detecting price manipulation vulnerabilities. DeFiTainter features two innovative mechanisms to ensure its effectiveness. The first mechanism is to construct a call graph for inter-contract taint analysis by restoring call information, not only from code constants but also from contract storage and function parameters. The second mechanism is a high-level semantic induction tailored for detecting price manipulation vulnerabilities, which accurately identifies taint sources and sinks and tracks taint data across contracts. Extensive evaluation of real-world incidents and high-value DeFi protocols shows that DeFiTainter outperforms existing approaches and achieves state-of-the-art performance with a precision of 96% and a recall of 91.3% in detecting price manipulation vulnerabilities. Furthermore, DeFiTainter uncovers three previously undisclosed price manipulation vulnerabilities.

DOI: 10.1145/3597926.3598124

Beyond “Protected” and “Private”： An Empirical Security Analysis of Custom Function Modifiers in Smart Contracts

作者: Fang, Yuzhou and Wu, Daoyuan and Yi, Xiao and Wang, Shuai and Chen, Yufan and Chen, Mengjie and Liu, Yang and Jiang, Lingxiao
关键词: Access Control, Modifiers, Smart Contract Security, Taint Analysis

Abstract

A smart contract is a piece of application-layer code running on blockchain ledgers and it provides programmatic logic via transaction-based execution of pre-defined functions. Smart contract functions are by default invokable by any party. To safeguard them, the mainstream smart contract language, i.e., Solidity of the popular Ethereum blockchain, proposed a unique language-level keyword called “modifier,” which allows developers to define custom function access control policies beyond the traditional “protected” and “private” modifiers in classic programming languages.

In this paper, we aim to conduct a large-scale security analysis of the modifiers used in real-world Ethereum smart contracts. To achieve this, we design and implement a novel smart contract analysis tool called SoMo. Its main objective is to identify insecure modifiers that can be bypassed from one or more unprotected smart contract functions. This is challenging because of the complicated relationship between modifiers and their variables/functions and the ambiguity of attacker-accessible entry functions. To overcome them, we first propose a new structure, the Modifier Dependency Graph (MDG), to connect all the modifier-related control/data flows. Over MDGs, we then model system variables, generate symbolic path constraints, and iteratively test each candidate entry function. Our extensive evaluation shows that SoMo outperforms the state-of-the-art SPCon tool by detecting all its true positives and correctly avoiding 9 out of 11 false positives. It also achieves high precision of 91.2% when analyzing a large dataset of 62,464 contracts, over 400 of which were identified with bypassable modifiers. Our analysis further reveals three interesting security findings about modifiers and nine major types of modifier usage in the wild. SoMo has been integrated into an online security scanning service, MetaScan.

DOI: 10.1145/3597926.3598125

Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing

作者: Lau, Julia Kaiwen and Kong, Kelvin Kai Wen and Yong, Julian Hao and Tan, Per Hoong and Yang, Zhou and Yong, Zi Qian and Low, Joshua Chern Wey and Chong, Chun Yong and Lim, Mei Kuan and Lo, David
关键词: Automated Speech Recognition, False Alarms, Software Testing

Abstract

Recent studies have proposed the use of Text-To-Speech (TTS) systems to automatically synthesise speech test cases on a scale and uncover a large number of failures in ASR systems. However, the failures uncovered by synthetic test cases may not reflect the actual performance of an ASR system when it transcribes human audio, which we refer to as false alarms. Given a failed test case synthesised from TTS systems, which consists of TTS-generated audio and the corresponding ground truth text, we feed the human audio stating the same text to an ASR system. If human audio can be correctly transcribed, an instance of a false alarm is detected. In this study, we investigate false alarm occurrences in five popular ASR systems using synthetic audio generated from four TTS systems and human audio obtained from two commonly used datasets. Our results show that the least number of false alarms is identified when testing Deepspeech, and the number of false alarms is the highest when testing Wav2vec2. On average, false alarm rates range from 21% to 34% in all five ASR systems. Among the TTS systems used, Google TTS produces the least number of false alarms (17%), and Espeak TTS produces the highest number of false alarms (32%) among the four TTS systems. Additionally, we build a false alarm estimator that flags potential false alarms, which achieves promising results: a precision of 98.3%, a recall of 96.4%, an accuracy of 98.5%, and an F1 score of 97.3%. Our study provides insight into the appropriate selection of TTS systems to generate high-quality speech to test ASR systems. Additionally, a false alarm estimator can be a way to minimise the impact of false alarms and help developers choose suitable test inputs when evaluating ASR systems. The source code used in this paper is publicly available on GitHub at https://github.com/julianyonghao/FAinASRtest.

DOI: 10.1145/3597926.3598126

DualApp

作者: Xue, Zhiyi and Liu, Si and Zhang, Zhaodi and Wu, Yiting and Zhang, Min
关键词: Deep Neural Network, DualApp, Over approximation, Robustness Verification, Under approximation

Abstract

DualApp is a prototype tool for the robustness verification of neural networks. It is the official implementation for paper “A Tale of Two Approximations: Tightening Over-Approximation for DNN Robustness Verification via Under-Approximation”. In this project, we propose a dual-approximation approach to tighten over-approximations, leveraging an activation function’s underestimated domain to define tight approximation bounds.We assess it on a comprehensive benchmark of DNNs with different architectures. Our experimental results show that DualApp significantly outperforms the state-of-the-art approaches on the verified robustness ratio and the certified lower bound.

DOI: 10.1145/3597926.3598127

SlipCover： Near Zero-Overhead Code Coverage for Python

作者: Altmayer Pizzorno, Juan and Berger, Emery D.
关键词: Code Coverage, Dynamic Code Instrumentation, Python, Testing

Abstract

Coverage analysis is widely used but can suffer from high overhead. This overhead is especially acute in the context of Python, which is already notoriously slow (a recent study observes a roughly 30x slowdown vs. native code). We find that the state-of-the-art coverage tool for Python, coverage.py, introduces a median overhead of 180% with the standard Python interpreter. Slowdowns are even more extreme when using PyPy, a JIT-compiled Python implementation, with coverage.py imposing a median overhead of 1,300%. This performance degradation reduces the utility of coverage analysis in most use cases, including testing and fuzzing, and precludes its use in deployment.
This paper presents SlipCover, a novel, near-zero overhead coverage analyzer for Python. SlipCover works without modifications to either the Python interpreter or PyPy. It first processes a program’s AST to accurately identify all branches and lines. SlipCover then dynamically rewrites Python bytecodes to add lightweight instrumentation to each identified branch and line. At run time, SlipCover periodically de-instruments already-covered lines and branches. The result is extremely low overheads – a median of just 5% – making SlipCover suitable for use in deployment. We show its efficiency can translate to significant increases in the speed of coverage-based clients. As a proof of concept, we integrate SlipCover into TPBT, a targeted property-based testing system, and observe a 22x speedup.

DOI: 10.1145/3597926.3598128

Systematic Testing of the Data-Poisoning Robustness of KNN

作者: Li, Yannan and Wang, Jingbo and Wang, Chao
关键词: Abstract Interpretation, Certification, Data Poisoning, Nearest Neighbors, Robustness, Testing

Abstract

Data poisoning aims to compromise a machine learning based software component by contaminating its training set to change its prediction results for test inputs. Existing methods for deciding data-poisoning robustness have either poor accuracy or long running time and, more importantly, they can only certify some of the truly-robust cases, but remain inconclusive when certification fails. In other words, they cannot falsify the truly-non-robust cases. To overcome this limitation, we propose a systematic testing based method, which can falsify as well as certify data-poisoning robustness for a widely used supervised-learning technique named k-nearest neighbors (KNN). Our method is faster and more accurate than the baseline enumeration method, due to a novel over-approximate analysis in the abstract domain, to quickly narrow down the search space, and systematic testing in the concrete domain, to find the actual violations. We have evaluated our method on a set of supervised-learning datasets. Our results show that the method significantly outperforms state-of-the-art techniques, and
can decide data-poisoning robustness of KNN prediction results for most of the test inputs.

DOI: 10.1145/3597926.3598129

Artifact of GrayC： Greybox Fuzzing of Compilers and Analysers for C

作者: Even-Mendoza, Karine and Sharma, Arindam and Donaldson, Alastair F. and Cadar, Cristian
关键词: Artifact, Bug Reports, Clang, code mutators, compilers, Frama-C, Fuzzing, GCC, GrayC, Greybox fuzzing, LibFuzzer, LLVM, MSVC, program analysers

Abstract

This is the official artifact of the paper: GrayC: Greybox Fuzzing of Compilers and Analysers for C (ISSTA 2023).

The artifacts contains the data for bug reports and raw data for the whole paper, including for the evaluation in section 4 and section 5. In addition, we included all the sets generated with the tools in the evaluation in our artifact as 10-sets-of-test-programs-tool-name.zip.

Note 1: This work was supported by EPSRC (EP/R011605/1 and EP/R006865/1). Note 2: The first two authors both contributed equally to this research. Note 3: Karine Even-Mendoza: A major part of this work was done as an Imperial College London employee.

DOI: 10.1145/3597926.3598130

Artifact for “Enhancing REST API Testing with NLP Techniques”

作者: Kim, Myeongsoo and Corradini, Davide and Sinha, Saurabh and Orso, Alessandro and Pasqua, Michele and Tzoref-Brill, Rachel and Ceccato, Mariano
关键词: Automated REST API Testing, Natural Language Processing for Testing, OpenAPI Specification Analysis

Abstract

This artifact includes the NLP2REST tool and the experimental data necessary for both replicating and extending our work. For the most recent version of the tool, we recommend visiting the following link: https://github.com/codingsoo/nlp2rest.

DOI: 10.1145/3597926.3598131

Reproduction package for article “Automated Generation of Security-Centric Descriptions for Smart Contract Bytecode”

作者: Pan, Yu and Xu, Zhichao and Li, Levi Taiji and Yang, Yunhe and Zhang, Mu
关键词: decentralized apps, natural language generation, program analysis, smart contracts, textual description

Abstract

Reproduce experiments in paper “Automated Generation of Security-Centric Descriptions for Smart Contract Bytecode”

DOI: 10.1145/3597926.3598132

Toward Automated Detecting Unanticipated Price Feed in Smart Contract

作者: Mo, Yifan and Chen, Jiachi and Wang, Yanlin and Zheng, Zibin
关键词: DeFi, Formal Verification, Price Oracle, Smart Contract

Abstract

Decentralized finance (DeFi) based on smart contracts has reached a total value locked (TVL) of over USD 200 billion in 2022. In DeFi ecosystems, price oracles play a critical role in providing real-time price feeds for cryptocurrencies to ensure accurate asset pricing in smart contracts. However, the price oracle also faces security issues, including the possibility of unanticipated price feeds, which can lead to imbalances in debt and assets in the DeFi protocol. However, existing solutions cannot effectively combine transactions and code for real-time monitoring of price oracles.

To address this limitation, we first categorize price oracles as either DON oracles, DEX oracles, or internal oracles based on trusted parties, and analyze their security risks, data sources, price duration, and query fees. Then, we propose VeriOracle, a formal verification framework for the automated detection of unanticipated price feeds in smart contracts. VeriOracle can deploy a formal semantic model of the price oracle on the blockchain to detect the status of smart contracts and identify unanticipated price feed transactions in real time. We apply VeriOracle to verify over 500,000 transactions of 13 vulnerable DeFi protocols in the real world. The experimental results show that (1) VeriOracle is effective and it can detect unanticipated price feeds before DeFi attacks (33,714 blocks ahead of the attacker in the best case); (2) VeriOracle is efficient in that its verification time (about 4s) is less than the block time of Ethereum (about 14s), which means VeriOracle can detect unsafe transactions in real time; and (3) VeriOracle is extendable for verifying defense strategies. Attacks using unanticipated price feeds can only succeed in particular smart contract states. VeriOracle can verify which smart contract states can defend against attacks.

DOI: 10.1145/3597926.3598133

Replication package for the ISSTA2023 paper： Virtual Reality (VR) Automated Testing in the Wild： a Case Study on Unity-Based VR Applications

作者: Rzig, Dhia Elhaq and Iqbal, Nafees and Attisano, Isabella and Qin, Xue and Hassan, Foyzul
关键词: Software Testing and verification, Test smells., Virtual Reality (VR) environment

Abstract

This package contains the dataset and the code we used within this research work.

The list of 314 VR projects we used is in the file " VR_Project_List.txt", which contains the git URLs for all the projects we attempted to download.

The source code of the tool we developed and used to generate the data we used within this paper is available under the Source Code folder. This project is compatible with IntelliJ and Eclipse IDEs, and we specifically recommend IntelliJ. It needs Java 8 or newer to be executed and comes with all the external libraries needed for its execution. If any problems are encountered with the external libraries, Maven can be used to re-download any missing libraries to the system attempting to execute the code.

DOI: 10.1145/3597926.3598134

How Effective Are Neural Networks for Fixing Security Vulnerabilities

作者: Wu, Yi and Jiang, Nan and Pham, Hung Viet and Lutellier, Thibaud and Davis, Jordan and Tan, Lin and Babkin, Petr and Shah, Sameena
关键词: AI and Software Engineering, Automated Program Repair, Language Model, Vulnerability

Abstract

Security vulnerability repair is a difficult task that is in dire need of automation. Two groups of techniques have shown promise: (1) large code language models (LLMs) that have been pre-trained on source code for tasks such as code completion, and (2) automated program repair (APR) techniques that use deep learning (DL) models to automatically fix software bugs. This paper is the first to study and compare Java vulnerability repair capabilities of LLMs and DL-based APR models. The contributions include that we (1) apply and evaluate five LLMs (Codex, CodeGen, CodeT5, PLBART and InCoder), four fine-tuned LLMs, and four DL-based APR techniques on two real-world Java vulnerability benchmarks (Vul4J and VJBench), (2) design code transformations to address the training and test data overlapping threat to Codex, (3) create a new Java vulnerability repair benchmark VJBench, and its transformed version VJBench-trans, to better evaluate LLMs and APR techniques, and (4) evaluate LLMs and APR techniques on the transformed vulnerabilities in VJBench-trans. Our findings include that (1) existing LLMs and APR models fix very few Java vulnerabilities. Codex fixes 10.2 (20.4%), the most number of vulnerabilities. Many of the generated patches are uncompilable patches. (2) Fine-tuning with general APR data improves LLMs’ vulnerability-fixing capabilities. (3) Our new VJBench reveals that LLMs and APR models fail to fix many Common Weakness Enumeration (CWE) types, such as CWE-325 Missing cryptographic step and CWE-444 HTTP request smuggling. (4) Codex still fixes 8.7 transformed vulnerabilities, outperforming all the other LLMs and APR models on transformed vulnerabilities. The results call for innovations to enhance automated Java vulnerability repair such as creating larger vulnerability repair training data, tuning LLMs with such data, and applying code simplification transformation to facilitate vulnerability repair.

DOI: 10.1145/3597926.3598135

Rare Path Guided Fuzzing

作者: Saha, Seemanta and Sarker, Laboni and Shafiuzzaman, Md and Shou, Chaofan and Li, Albert and Sankaran, Ganesh and Bultan, Tevfik
关键词: Concolic execution, Control flow analysis, Fuzz testing, Model counting, Probabilistic analysis

Abstract

Starting with a random initial seed, fuzzers search for inputs that trigger bugs or vulnerabilities. However, fuzzers often fail to generate inputs for program paths guarded by restrictive branch conditions. In this paper, we show that by first identifying rare-paths in programs (i.e., program paths with path constraints that are unlikely to be satisfied by random input generation), and then, generating inputs/seeds that trigger rare-paths, one can improve the coverage of fuzzing tools. In particular, we present techniques 1) that identify rare paths using quantitative symbolic analysis, and 2) generate inputs that can explore these rare paths using path-guided concolic execution. We provide these inputs as initial seed sets to three state of the art fuzzers. Our experimental evaluation on a set of programs shows that the fuzzers achieve better coverage with the rare-path based seed set compared to a random initial seed.

DOI: 10.1145/3597926.3598136

CGuard： Scalable and Precise Object Bounds Protection for C

作者: Kedia, Piyus and Purandare, Rahul and Agarwal, Udit and Rishabh
关键词: Buffer overflow, Spatial safety

Abstract

A tool to detect spatial safety bugs in C programs.

DOI: 10.1145/3597926.3598137

An Empirical Study of Functional Bugs in Android Apps

作者: Xiong, Yiheng and Xu, Mengqian and Su, Ting and Sun, Jingling and Wang, Jue and Wen, He and Pu, Geguang and He, Jifeng and Su, Zhendong
关键词: Android, Empirical study, Non-crashing functional bugs, Testing

Abstract

Android apps are ubiquitous and serve many aspects of our daily lives. Ensuring their functional correctness is crucial for their success. To date, we still lack a general and in-depth understanding of functional bugs, which hinders the development of practices and techniques to tackle functional bugs. To fill this gap, we conduct the first systematic study on 399 functional bugs from 8 popular open-source and representative Android apps to investigate the root causes, bug symptoms, test oracles, and the capabilities and limitations of existing testing techniques. This study took us substantial effort. It reveals several new interesting findings and implications which help shed light on future research on tackling functional bugs. Furthermore, findings from our study guided the design of a proof-of-concept differential testing tool, RegDroid, to automatically find functional bugs in Android apps. We applied RegDroid on 5 real-world popular apps, and successfully discovered 14 functional bugs, 10 of which were previously unknown and affected the latest released versions—all these 10 bugs have been confirmed and fixed by the app developers. Specifically, 10 out of these 14 found bugs cannot be found by existing testing techniques. We have made all the artifacts (including the dataset of 399 functional bugs and RegDroid) in our work publicly available at https://github.com/Android-Functional-bugs-study/home.

DOI: 10.1145/3597926.3598138

NodeRT： Detecting Races in Node.js Applications Practically

作者: Zhou, Jingyao and Xu, Lei and Lu, Gongzheng and Zhang, Weifeng and Zhang, Xiangyu
关键词: Node.js, event-driven architecture, race detection

Abstract

Node.js has become one of the most popular development platforms due to its superior concurrency support. However, races induced by the nondeterministic execution order of event handlers may occur in Node.js applications, causing serious runtime failures. The state-of-the-art Node.js race detector NRace builds a happens-before (HB) graph before detection with a set of HB relation rules. In detection, NRace utilizes a heavy-weight BFS-based algorithm to query the reachability between resource operations, which introduces substantial overhead in practice, causing NRace inapplicable to real-world Node.js application test processes. This paper proposes a more practical Node.js dynamic race detection approach called NodeRT (Node.js Race Tracker). To reduce unnecessary overhead, NodeRT simplifies the HB relation rules, and divides the detection into three stages: trace collection stage, race candidate detection stage, and false positive removal stage. In the trace collection stage, NodeRT constructs a partial HB graph called asynchronous call tree (ACTree), enabling efficient reachability queries between event handlers. In the race candidate detection stage, NodeRT performs detection on the ACTree, which effectively eliminates most non-racing event handlers and outputs race candidates. In the false positive removal stage, NodeRT utilizes matching rules derived from HB relation rules and features of resources to reduce false positives in the race candidates. In experiments, NodeRT detects all known races and 9 unknown harmful races in real-world applications, whereas NRace only detects 3 of the unknown harmful races, with 64\texttimes{

DOI: 10.1145/3597926.3598139

An Empirical Study on Concurrency Bugs in Interrupt-Driven Embedded Software

作者: Li, Chao and Chen, Rui and Wang, Boxiang and Wang, Zhixuan and Yu, Tingting and Jiang, Yunsong and Gu, Bin and Yang, Mengfei
关键词: concurrency bugs, embedded software, empirical study, interrupt-driven programs

Abstract

Interrupt-driven embedded software is widely used in aerospace, automotive electronics, medical equipment, IoT, and other industrial fields. This type of software is usually programmed with interrupts to interact with hardware and respond to external stimuli on time. However, uncertain interleaving execution of interrupts may cause concurrency bugs, resulting in task failure or serious safety issues. A deep understanding of real-world concurrency bugs in embedded software will significantly improve the ability of techniques in combating concurrency bugs, such as bug detection, testing and fixing.This paper performs the first comprehensive and large-scale empirical study on concurrency bugs in industrial interrupt-driven embedded software. A total number of 132 real-world concurrency bugs in 102 industrial embedded software have been rigorously analyzed. Not only have the root causes, impacts and fix strategies of bugs been studied, but also the manifestation, including triggering scopes, racing variables, access interleaving patterns, and variables correlations. This study reveals several significant findings, which can guide future research in developing techniques and tools to combat concurrency bugs for interrupt-driven embedded software.

DOI: 10.1145/3597926.3598140

CodeGrid： A Grid Representation of Code

作者: Kabor'{e
关键词: Code TypeSetting, Spatial-Aware Neural Network

Abstract

Code representation is a key step in the application of AI in software engineering. Generic NLP representations are effective but do not exploit all the rich structure inherent to code. Recent work has focused on extracting abstract syntax trees (AST) and integrating their structural information into code representations.These AST-enhanced representations advanced the state of the art and accelerated new applications of AI to software engineering. ASTs, however, neglect important aspects of code structure, notably control and data flow, leaving some potentially relevant code signal unexploited. For example, purely image-based representations perform nearly as well as AST-based representations, despite the fact that they must learn to even recognize tokens, let alone their semantics. This result, from prior work, is strong evidence that these new code representations can still be improved; it also raises the question of just what signal image-based approaches are exploiting. We answer this question. We show that code is spatial and exploit this fact to propose , a new representation that embeds tokens into a grid that preserves code layout. Unlike some of the existing state of the art, is agnostic to the downstream task: whether that task is generation or classification, can complement the learning algorithm with spatial signal. For example, we show that CNNs, which are inherently spatially-aware models, can exploit outputs to effectively tackle fundamental software engineering tasks, such as code classification, code clone detection and vulnerability detection. PixelCNN leverages ’s grid representations to achieve code completion. Through extensive experiments, we validate our spatial code hypothesis, quantifying model performance as we vary the degree to which the representation preserves the grid. To demonstrate its generality, we show that augments models, improving their performance on a range of tasks, On clone detection, improves ASTNN’s performance by 3.3% F1 score.

DOI: 10.1145/3597926.3598141

作者: Zhang, Jian and Wang, Xu and Zhang, Hongyu and Sun, Hailong and Liu, Xudong and Hu, Chunming and Liu, Yang
关键词: Bug detection, deep learning, graph neural network

Abstract

Automated bug detection is essential for high-quality software development and has attracted much attention over the years. Among the various bugs, previous studies show that the condition expressions are quite error-prone and the condition-related bugs are commonly found in practice. Traditional approaches to automated bug detection are usually limited to compilable code and require tedious manual effort. Recent deep learning-based work tends to learn general syntactic features based on Abstract Syntax Tree (AST) or apply the existing Graph Neural Networks over program graphs. However, AST-based neural models may miss important control flow information of source code, and existing Graph Neural Networks for bug detection tend to learn local neighbourhood structure information. Generally, the condition-related bugs are highly influenced by control flow knowledge, therefore we propose a novel CFG-based Graph Neural Network (CFGNN) to automatically detect condition-related bugs, which includes a graph-structured LSTM unit to efficiently learn the control flow knowledge and long-distance context information.
We also adopt the API-usage attention mechanism to leverage the API knowledge. To evaluate the proposed approach, we collect real-world bugs in popular GitHub repositories and build a large-scale condition-related bug dataset. The experimental results show that our proposed approach significantly outperforms the state-of-the-art methods for detecting condition-related bugs.

DOI: 10.1145/3597926.3598142

Third-Party Library Dependency for Large-Scale SCA in the C/C++ Ecosystem： How Far Are We?

作者: Jiang, Ling and Yuan, Hengchen and Tang, Qiyi and Nie, Sen and Wu, Shi and Zhang, Yuqun
关键词: Code Clone Detection, Mining Software Repositories, Software Composition Analysis

Abstract

Existing software composition analysis (SCA) techniques for the C/C++ ecosystem tend to identify the reused components through feature matching between target software project and collected third-party libraries (TPLs). However, feature duplication caused by internal code clone can cause inaccurate SCA results. To mitigate this issue, Centris, a state-of-the-art SCA technique for the C/C++ ecosystem, was proposed to adopt function-level code clone detection to derive the TPL dependencies for eliminating the redundant features before performing SCA tasks. Although Centris has been shown effective in the original paper, the accuracy of the derived TPL dependencies is not evaluated. Additionally, the dataset to evaluate the impact of TPL dependency on SCA is limited. To further investigate the efficacy and limitations of Centris, we first construct two large-scale ground-truth datasets for evaluating the accuracy of deriving TPL dependency and SCA results respectively. Then we extensively evaluate Centris where the evaluation results suggest that the accuracy of TPL dependencies derived by Centris may not well generalize to our evaluation dataset. We further infer the key factors that degrade the performance can be the inaccurate function birth time and the threshold-based recall. In addition, the impact on SCA from the TPL dependencies derived by Centris can be somewhat limited. Inspired by our findings, we propose TPLite with function-level origin TPL detection and graph-based dependency recall to enhance the accuracy of TPL reuse detection in the C/C++ ecosystem. Our evaluation results indicate that TPLite effectively increases the precision from 35.71% to 88.33% and the recall from 49.44% to 62.65% of deriving TPL dependencies compared with Centris. Moreover, TPLite increases the precision from 21.08% to 75.90% and the recall from 57.62% to 64.17% compared with the SOTA academic SCA tool B2SFinder and even outperforms the well-adopted commercial SCA tool BDBA, i.e., increasing the precision from 72.46% to 75.90% and the recall from 58.55% to 64.17%.

DOI: 10.1145/3597926.3598143

Green Fuzzer Benchmarking

作者: Ounjai, Jiradet and W"{u
关键词: benchmarking, fuzzing, testing

Abstract

Over the last decade, fuzzing has been increasingly gaining
traction due to its effectiveness in finding
bugs. Nevertheless, fuzzer evaluations have been challenging
during this time, mainly due to lack of standardized
benchmarking. Aiming to alleviate this issue, in 2020, Google
released FuzzBench, an open-source benchmarking platform, that
is widely used for accurate fuzzer benchmarking.

However, a typical FuzzBench experiment takes CPU years to
run. If we additionally consider that fuzzers under active
development evaluate any changes empirically, benchmarking
becomes prohibitive both in terms of computational resources
and time. In this paper, we propose GreenBench, a greener
benchmarking platform that, compared to FuzzBench,
significantly speeds up fuzzer evaluations while maintaining
very high accuracy.

In contrast to FuzzBench, GreenBench drastically increases the
number of benchmarks while drastically decreasing the duration
of fuzzing campaigns. As a result, the fuzzer rankings
generated by GreenBench are almost as accurate as those by
FuzzBench (with very high correlation), but GreenBench is from
18 to 61 times faster. We discuss the implications of these
findings for the fuzzing community.

DOI: 10.1145/3597926.3598144

Interpreters for GNN-Based Vulnerability Detection： Are We There Yet?

作者: Hu, Yutao and Wang, Suyuan and Li, Wenke and Peng, Junru and Wu, Yueming and Zou, Deqing and Jin, Hai
关键词: GNN Interpreters, Interpretation, Vulnerability Detection

Abstract

Traditional vulnerability detection methods have limitations due to their need for extensive manual labor. Using automated means for vulnerability detection has attracted research interest, especially deep learning, which has achieved remarkable results. Since graphs can better convey the structural feature of code than text, graph neural network (GNN) based vulnerability detection is significantly better than text-based approaches. Therefore, GNN-based vulnerability detection approaches are becoming popular. However, GNN models are close to black boxes for security analysts, so the models cannot provide clear evidence to explain why a code sample is detected as vulnerable or secure. At this stage, many GNN interpreters have been proposed. However, the explanations provided by these interpretations for vulnerability detection models are highly inconsistent and unconvincing to security experts. To address the above issues, we propose principled guidelines to assess the quality of the interpretation approaches for GNN-based vulnerability detectors based on concerns in vulnerability detection, namely, stability, robustness, and effectiveness. We conduct extensive experiments to evaluate the interpretation performance of six famous interpreters (GNN-LRP, DeepLIFT, GradCAM, GNNExplainer, PGExplainer, and SubGraphX) on four vulnerability detectors (DeepWukong, Devign, IVDetect, and Reveal). The experimental results show that the target interpreters achieve poor performance in terms of effectiveness, stability, and robustness. For effectiveness, we find that the instance-independent methods outperform others due to their deep insight into the detection model. In terms of stability, the perturbation-based interpretation methods are more resilient to slight changes in model parameters as they are model-agnostic. For robustness, the instance-independent approaches provide more consistent interpretation results for similar vulnerabilities.

DOI: 10.1145/3597926.3598145

Artifacts for the ISSTA 2023 Paper： An Empirical Study on the Effects of Obfuscation on Static Machine Learning-based Malicious JavaScript Detectors

作者: Ren, Kunlun and Qiang, Weizhong and Wu, Yueming and Zhou, Yi and Zou, Deqing and Jin, Hai
关键词: JavaScript obfuscation, machine learning, malicious JavaScript detector, web security

Abstract

This repository contains the evaluation script and the corresponding data of the ISSTA’23 paper “An Empirical Study on the Effects of Obfuscation on Static Machine Learning-Based Malicious JavaScript Detectors”. Detailed Instructions: detectors: The detectors under the folder detectors/ are the main projects to be evaluated in our paper, which are CUJO, ZOZZLE, JAST, and JSTAP. Detailed setup and usage instructions are described in README.md in the corresponding folder. samples: The files under the folder samples/ are the samples from a random tenth of our dataset used in our paper. Results can be obtained quickly using these samples. These results will not be exactly the same as in the paper, but they are similar. RQ1: The code under folder RQ1/ is to figure out how obfuscation affects these detectors. RQ1_1_train.py is to train four detectors with unobfuscated samples. RQ1_1_test.py tests these trained detectors with unobfuscated and obfuscated samples. RQ1_2_train.py is to train the detector ZOZZLE that uses different machine learning algorithms. RQ1_2_test.py tests these trained models with unobfuscated and obfuscated samples. RQ1_3_train.py uses a training set with all unobfuscated benign samples and all obfuscated malicious samples, and a training set with all obfuscated benign samples and all unobfuscated malicious samples to train the detectors. RQ1_3_test.py uses these detectors to detect unobfuscated benign samples, obfuscated benign samples, unobfuscated malicious samples, and obfuscated malicious samples, respectively. RQ2: The code under folder RQ2/ is to study the two measures to mitigate the impact of obfuscation effective or not. RQ2_1_train.py uses obfuscated samples to train four detectors. RQ2_1_test.py tests these detectors on the same type of obfuscated samples. RQ2_2_test.py tests thest detectors on the different type of obfuscated samples. RQ2_3.py uses the BERT variants to generate code representation of unobfuscated samples, trains the detector with these code representations, and tests the trained detectors with code representations of obfuscated samples. RQ3: The code unser fodler RQ3/ visualizes the vectors, extracts the ten most important features, and calculates the distance between different sets of vectors. RQ4: There is no code related to RQ4 here because the actual operation of RQ4 is to submit the samples to VirusTotal .

DOI: 10.1145/3597926.3598146

Replication Package for Article “Understanding Breaking Changes in the Wild”

作者: Jayasuriya, Dhanushka and Terragni, Valerio and Dietrich, Jens and Ou, Samuel and Blincoe, Kelly
关键词: breaking changes, software dependency, software evolution, software libraries

Abstract

This Artifact comprises the data used for the research and the scripts utilized to extract the data from GitHub repositories. We have included the dataset we used for the manual analysis and the codes used for the manual analysis process. Additionally we included the transitive data we have extracted for these repositories. The README.md file includes the steps you need to create the python environment to execute the queries you need to execute on the dataset to extract the answers for each research question.

DOI: 10.1145/3597926.3598147

Replication package for article “Improving Spectrum-Based Localization of Multiple Faults by Iterative Test Suite Reduction”

作者: Callaghan, Dylan and Fischer, Bernd
关键词: Automated Debugging, Fault Localization

Abstract

This artifact contains all necessary components to replicate the experiments described in the paper “Improving Spectrum-Based Localization of Multiple Faults by Iterative Test Suite Reduction”. This includes the FLITSR tool described in the paper and additional scripts to run the full-scale experiments, as well as the two datasets used for evaluation in the paper.

DOI: 10.1145/3597926.3598148

Extracting Inline Tests from Unit Tests

作者: Liu, Yu and Nie, Pengyu and Guo, Anna and Gligoric, Milos and Legunsen, Owolabi
关键词: Inline tests, automatic test generation, unit tests

Abstract

We recently proposed inline tests for validating individual program statements; they allow developers to provide test inputs, expected outputs, and test oracles immediately after a target statement. But, existing code can have many target statements. So, automatic generation of inline tests is an important next step towards increasing their adoption. We propose ExLi, the first technique for automatically generating inline tests. ExLi extracts inline tests from unit tests; it first records all variable values at a target statement while executing unit tests. Then, ExLi uses those values as test inputs and test oracles in an initial set of generated inline tests. Target statements that are executed many times could have redundant initial inline tests. So, ExLi uses a novel coverage-then-mutants based reduction process to remove redundant inline tests. We implement ExLi for Java and use it to generate inline tests for 718 target statements in 31 open-source programs. ExLi reduces 17,273 initially generated inline tests to 905 inline tests. The final set of generated inline tests kills up to 25.1% more mutants on target statements than developer written and automatically generated unit tests. That is, ExLi generates inline tests that can improve the fault-detection capability of the test suites from which they are extracted.

DOI: 10.1145/3597926.3598149

DDLDroid： A Static Analyzer for Automatically Detecting Data Loss Issues in Android Applications

作者: Zhou, Yuhao and Song, Wei
关键词: Android apps, bug detection, data loss, taint analysis

Abstract

DOI: 10.1145/3597926.3604916

Behaviorally Typed State Machines in TypeScript for Heterogeneous Swarms

作者: Kuhn, Roland and Darmasaputra, Alan
关键词: Distributed coordination, behavioural types, local-first software

Abstract

A heterogeneous swarm system is a distributed system where participants come and go, communication topology may change at any time, data replication is asynchronous and partial, and local agents behave differently between nodes. These systems are hard to design and reason about, mainly because we desire a particular class of behaviors to emerge from the interplay of heterogeneous individual agents. Nevertheless, mission-critical operations like manufacturing process orchestration in factories use such systems due to their uncompromising availability and resilience of computing services. This paper presents a set of TypeScript libraries to model peer-to-peer workflows as state machines, execute them using the Actyx middleware, and check the shape of these machines for conformance to a swarm protocol. The swarm protocol describes an idealized global view of the cooperation of machines of different roles. It directly corresponds to a diagram a product manager would sketch on a whiteboard; this allows for verifying that the coded state machines correctly implement the product specification. A well-formed swarm protocol also guarantees that conforming machines will achieve eventual consensus on the overall state progression even in the absence of further coordination. This tool is for developers of business logic for heterogeneous swarm systems, helping them verify that their protocols and implementations are correct. Tool repo: https://github.com/Actyx/machines

DOI: 10.1145/3597926.3604917

ECSTATIC： Automatic Configuration-Aware Testing and Debugging of Static Analysis Tools

作者: Mordahl, Austin and Soles, Dakota and Miao, Miao and Zhang, Zenong and Wei, Shiyi
关键词: configurable software, debugging, static analysis, testing

Abstract

Static analyses are powerful tools that can serve as a complement to dynamic approaches such as testing. In order to ensure generality, many static analysis tools are configurable. However, these configurations can make testing and debugging more difficult. To address this issue, we introduce a new tool, ECSTATIC, which leverages partial order relations between analysis configuration options to automatically test and debug static analyzers, even without ground truths. ECSTATIC’s results are reproducible by virtue of running within Docker containers, and ECSTATIC provides clear extension interfaces for users to add their own tools and input programs. We evaluated ECSTATIC on four popular dataflow analysis tools, and found 74 bugs in all four tools. We also found that ECSTATIC’s novel two-staged delta debugging was able to reduce real-world programs by 50%, compared to a baseline of 6%.

DOI: 10.1145/3597926.3604918

RustSmith： Random Differential Compiler Testing for Rust

作者: Sharma, Mayank and Yu, Pingshi and Donaldson, Alastair F.
关键词: Compiler testing, Rust, differential testing, fuzzing, mutation testing

Abstract

We present RustSmith, the first Rust randomised program generator for end-to-end testing of Rust compilers. RustSmith generates programs that conform to the advanced type system of Rust, respecting rules related to borrowing and lifetimes, and that are guaranteed to yield a well-defined result. This makes RustSmith suitable for differential testing between compilers or across optimisation levels. By applying RustSmith to a series of versions of the official Rust compiler, rustc, we show that it can detect insidious historical bugs that evaded detection for some time. We have also used RustSmith to find previously-unknown bugs in an alternative Rust compiler implementation, mrustc. In a controlled experiment, we assess statement and mutation coverage achieved by RustSmith vs. the rustc optimisation test suite.

DOI: 10.1145/3597926.3604919

KeenTune： Automated Tuning Tool for Cloud Application Performance Testing and Optimization

作者: Wang, Qinglong and Wang, Runzhe and Hu, Yuxi and Shi, Xiaohai and Liu, Zheng and Ma, Tao and Song, Houbing and Shi, Heyuan
关键词: Performance testing, automated tuning, machine learning

Abstract

The performance testing and optimization of cloud applications is challenging, because manual tuning of cloud computing stacks is tedious and automated tuning tools are rare used for cloud services. To address this issue, we introduce KeenTune, an automated tuning tool designed to optimize application performance and facilitate performance testing. KeenTune is a lightweight and flexible tool that can be deployed with to-be-tuned applications with negligible impact on their performance. Specifically, KeenTune uses a surrogate model that can be implemented with machine learning models to filter out less relevant parameters for efficient tuning. Our empirical evaluation shows that KeenTune significantly enhances the throughput performance of Nginx web servers, resulting in performance improvements of up to 90.43% and 117.23% in certain cases. This study highlights the benefits of using KeenTune for achieving efficient and effective performance testing of cloud applications. The video and source code for KeenTune are provided as supplementary materials.

DOI: 10.1145/3597926.3604920

KDAlloc： The KLEE Deterministic Allocator： Deterministic Memory Allocation during Symbolic Execution and Test Case Replay

作者: Schemmel, Daniel and B"{u
关键词: Symbolic Execution, memory allocation, test case replay

Abstract

The memory allocator can have an important impact in symbolic execution.
Taking a user-centric view, this tool demonstration paper discusses some of the main benefits provided by KLEE’s new allocator KDAlloc in terms of improved deterministic execution and bug-finding capabilities.
We then introduce a new replay tool for KLEE which enables the native execution to integrate KDAlloc and receive the same heap addresses as during symbolic execution.

DOI: 10.1145/3597926.3604921

EDHOC-Fuzzer： An EDHOC Protocol State Fuzzer

作者: Sagonas, Konstantinos and Typaldos, Thanasis
关键词: CoAP, EDHOC, IoT protocols, OSCORE, Software security, fuzzing, model learning, model-based testing, protocol security

Abstract

EDHOC is a compact and lightweight authenticated key exchange protocol proposed by the IETF, whose design focuses on small message sizes, in order to be suitable for constrained IoT communication technologies. In this tool paper, we overview EDHOC-Fuzzer, a protocol state fuzzer for implementations of EDHOC clients and servers. It employs model learning to generate a state machine model of an EDHOC implementation, capturing its input/output behavior. This model can then be used for model-based testing, for fingerprinting, or can be analyzed for non-conformances, state machine bugs and security vulnerabilities. We overview the architecture and use of EDHOC-Fuzzer, and present some examples of models produced by the tool and our current findings.

DOI: 10.1145/3597926.3604922

Reproduction Package for Article `MetaData262： Automatic Test Suite Selection for Partial JavaScript Implementations’

作者: Ramos, Frederico and Reis, Diogo Costa and Trigo, Miguel and Morgado, Ant'{o
关键词: ECMAScript, Metadata, Test262

Abstract

The overall architecture of the MetaData262 Computation Engine consists of five main modules: (1) the Frontmatter Parsing Module (M1) for parsing the frontmatter keys of each test and creating the base JSON object to which the parsed keys are to be added; (2) the Syntactic Constructs Module (M2) for computing the syntactic constructs used within each test; (3) the History Computation Module (M3) for determining the creation date and last-modified date associated with each test; (4) the Version Computation Module (M4) for determining the version of the ES standard to be associated with each test; and (5) the Built-Ins Computation Module (M5) for determining the built-in objects used within each test.

DOI: 10.1145/3597926.3604923

RobotBT： Behavior-Tree-Based Test-Case Specification for the Robot Framework

作者: Peldszus, Sven and Akopian, Noubar and Berger, Thorsten
关键词: Behavior Tree, Robot Framework, Test Case Specification

Abstract

The Robot Framework is a popular and widely used test automation framework that abstracts test case specifications toward natural language specifications. This makes it well suited for implementing high-level test cases, at least as long as the functions provided by Robot can support the intended functionality. For more complicated test cases, custom and often deeply nested functionality specifications are required, and the readability of Robot test cases tends to decrease. We present RobotBT, a library for the Robot framework that addresses these shortcomings by adding support for specifying test cases using behavior trees. Behavior trees are a comprehensive method for specifying complex behaviors based on a control flow model that orchestrates the execution of functionality. We evaluated RobotBT on a test suite for GUI testing from G~DATA CyberDefense AG and interviewed their engineers about the usability, readability, and applicability of RobotBT. Our results show that BTs improve the expressiveness and readability of Robot Framework test cases and are applicable to practical problems.

DOI: 10.1145/3597926.3604924

TreeLine and SlackLine： Grammar-Based Performance Fuzzing on Coffee Break

作者: Alsaeed, Ziyad and Young, Michal
关键词: input generation, mcts, performance analysis

Abstract

TreeLine and SlackLine are grammar-based fuzzers for quickly finding performance problems in programs driven by richly structured text that can be described by context-free grammar. In contrast to long fuzzing campaigns to find (mostly invalid) inputs that trigger security vulnerabilities, TreeLine and SlackLine are designed to search for performance problems in the space of valid inputs in minutes rather than hours. The TreeLine and SlackLine front-ends differ in search strategy (Monte Carlo Tree Search or derivation tree splicing, respectively) but accept the same grammar specifications and rely on a common back-end for instrumented execution. Separation of concerns should facilitate use by other researchers who wish to explore alternatives and extensions of either the front or back ends.

DOI: 10.1145/3597926.3604925

Oven： Safe and Live Communication Protocols in Scala, using Synthetic Behavioural Type Analysis

作者: Ferreira, Francisco and Jongmans, Sung-Shik
关键词: behavioural types, choreographies, multiparty session types

Abstract

We present Oven: a toolset to assure safety and liveness of communication
protocols among threads in concurrent programs in Scala.

Oven is the first practical toolset that is built on top of new theoretical
foundations of synthetic behavioural type analysis, recently developed by us to
improve the expressiveness of existing work.

We explain Oven’s usage, summarise its design and
implementation (main challenge: how to encode the new synthetic behavioural
typing rules in Scala’s existing type system), and discuss a preliminary
evaluation of expressiveness (the results provide first evidence that Oven
is an improvement over two state-of-the-art tools).

DOI: 10.1145/3597926.3604926

SymRustC： A Hybrid Fuzzer for Rust

作者: Tuong, Fr'{e
关键词: Hybrid fuzzing, Rust, concolic execution

Abstract

We present SymRustC, a hybrid fuzzer for Rust. SymRustC is hybrid
in the sense that it combines fuzzing and concolic execution.
SymRustC leverages an existing tool called SymCC for
its concolic execution capability and another existing tool
called LibAFL for its fuzzing capability.
Since SymCC instruments LLVM IR (Intermediate
Representation) for concolic execution and the Rust compiler
uses LLVM as a backend, we integrate SymCC with the
Rust compiler to instrument Rust programs for concolic
execution. LibAFL provides a framework to develop a
fuzzer, and we use it to develop a hybrid fuzzer that
combines fuzzing and our concolic execution. We discuss our
implementation as well as four case studies to demonstrate
that SymRustC can generate inputs that discover errors in
Rust programs.

DOI: 10.1145/3597926.3604927

EvoSpex： A Search-Based Tool for Postcondition Inference

作者: Molina, Facundo and Ponzio, Pablo and Aguirre, Nazareno and Frias, Marcelo F.
关键词: Oracle problem, evolutionary computation, specification inference

Abstract

Postconditions are predicates that specify the intended behavior of a program by capturing properties about the program state when the program finishes its execution. Although postconditions can help to improve many software reliability analyses, they are seldom found accompanying source code. Thus, tools that assist developers in specifying postconditions are useful. This tool demo paper presents EvoSpex, a tool based on evolutionary computation that automatically infers postconditions of Java methods. Given a target Java method and a test suite for it, our tool executes the test suite to obtain valid pre/post state pairs for the method under analysis. Then, these pairs are mutated to obtain (allegedly) invalid ones, and finally a postcondition assertion characterizing the current method behavior is produced, by using an evolutionary algorithm that searches for an assertion that is satisfied by the valid pre/post state pairs and leaves out the invalid ones. EvoSpex implements a classic genetic algorithm that explores the space of candidate postconditions over a JML-like specification language. The algorithm is guided by a fitness function that aims at precisely capturing the valid state pairs, rejecting the invalid ones, and that also favors more succinct assertions.

DOI: 10.1145/3597926.3604928

PExReport-Maven： Creating Pruned Executable Cross-Project Failure Reports in Maven Build System

作者: Huang, Sunzhou and Wang, Xiaoyin
关键词: cross-project failure, failure report, failure reproduction, maven, test failure

Abstract

Modern Java software development extensively depends on existing libraries written by other developer teams from the same or a different organization. When a developer executes the test, the execution trace may go across the boundaries of multiple dependencies and create cross-project failures (CPFs). A readable, executable, and concise CPF report may enable the most effective communication, but creating such a report is often challenging in Java ecosystems. We developed PExReport-Maven to automatically create the ideal CPF reports in the Maven build system. PExReport-Maven leverages the Maven build system to prune source code, dependencies, and the build environment to create a concise stand-alone executable CPF reproduction package from the original CPF project. The reproduction package includes the source code, dependencies, and build environment necessary to reproduce the CPF, making it an ideal CPF report. We performed an evaluation on 74 software project issues with 198 cross-project failures, and the evaluation results show that PExReport can create pruned reproduction packages for 184 out of the 198 test failures in our dataset, with an average reduction of 72.97% in Java classes. A future study will be conducted based on user feedback from using this tool to report real-world CPFs.
PExReport-Maven is publicly available at https://github.com/wereHuang/PExReport-Maven. The tool demo is available on the PExReport website: https://sites.google.com/view/pexreport/home.

DOI: 10.1145/3597926.3604929

Quantitative Robustness Analysis of Neural Networks

作者: Downing, Mara
关键词: Neural Network Verification, Quantitative Verification, Robustness

Abstract

Neural networks are an increasingly common tool for solving problems that require complex analysis and pattern matching, such as identifying stop signs or processing medical imagery. Accordingly, verification of neural networks for safety and correctness is of great importance, as mispredictions can have catastrophic results in safety critical domains. One metric for verification is robustness, which answers whether or not a misclassified input exists in a given input neighborhood. I am focusing my research at quantitative robustness—finding not only if there exist misclassified inputs within a given neighborhood but also how many exist as a proportion of the neighborhood size. My overall goal is to expand the research on quantitative neural network robustness verification and create a variety of quantitative verification tools geared towards expanding our understanding of neural network robustness.

DOI: 10.1145/3597926.3605231

Automatic Testing and Benchmarking for Configurable Static Analysis Tools

作者: Mordahl, Austin
关键词: benchmarking, configurable static analysis, debugging, testing

Abstract

Static analysis is an important tool for detecting bugs in real-world software. The advent of numerous analysis algorithms with their own tradeoffs has led to the proliferation of configurable static analysis tools, but their complex, undertested configuration spaces are obstacles to their widespread adoption. To improve the reliability of these tools, my research focuses on developing new approaches to automatically test and debug them. First, I describe an empirical study that helps to understand the performance and behavior of configurable taint analysis tools for Android. The findings of this study motivate the development of ECSTATIC, a framework for testing and debugging that goes beyond taint analysis to test any configurable static analysis tool. The next steps for this research involve the automatic creation of real-world benchmarks for static analysis with associated ground truths and analysis features.

DOI: 10.1145/3597926.3605232

Type Automata

作者: Roth, Ori
关键词: automata theory, fluent API, metaprogramming, type systems

Abstract

PL researchers have a profound understanding of automata theory but fail to grasp the subtleties and nuances of the type systems used in modern programming languages.
My research pursues new insights into the computational power of type systems by connecting them with well-founded classes of automata through type automata—machines that employ program types as control and storage.
In addition, I demonstrate advanced type-level metaprogramming applications of type automata that coerce the compiler into performing computations at compile time.

DOI: 10.1145/3597926.3605237

Harnessing Large Language Models for Simulink Toolchain Testing and Developing Diverse Open-Source Corpora of Simulink Models for Metric and Evolution Analysis

作者: Shrestha, Sohil Lal
关键词: Cyber-physical system development, GPT-2, Simulink, deep learning, mining software repositories, model evolution, open-source, programming language modeling, tool chain bugs

Abstract

MATLAB/Simulink is a de-facto standard tool in several safety-critical industries such as automotive, aerospace, healthcare, and industrial automation for system modeling and analysis, compiling models to code, and deploying code to embedded hardware. On one hand, testing cyber-physical system (CPS) development tools such as MathWorks’ Simulink is important as a bug in the toolchain may propagate to the artifacts they produce. On the other hand, it is equally important to understand modeling practices and model evolution to support engineers and scientists as they are widely used in design, simulation, and verification of CPS models. Existing work in this area is limited by two main factors, i.e., (1) inefficiencies of state-of-the-art testing schemes in finding critical tool-chain bugs and (2) the lack of a reusable corpus of public Simulink models. In my thesis, I propose to (1) curate a large reusable corpus of Simulink models to help understand modeling practices and model evolution and (2) leverage such a corpus with deep-learning based language models to test the toolchain.

DOI: 10.1145/3597926.3605233

Fairness Testing for Recommender Systems

作者: Guo, Huizhong
关键词: AI Ethics, Fairness Testing, Recommender Systems

Abstract

The topic of fairness in recommender systems (RSs) is gaining significant attention.
However, current fairness metrics and testing approaches primarily cater to classification systems and are not suitable for RSs.
To bridge this gap, we aim to address the specific challenges involved in fairness testing for RSs.
In this paper, we present a novel testing approach specifically designed for RSs, which enables us to achieve accurate results while maintaining high efficiency.
Additionally, we suggest potential avenues for further research in the realm of fairness testing for RSs.

DOI: 10.1145/3597926.3605235

Quantitative Symbolic Similarity Analysis

作者: Sarker, Laboni
关键词: equivalence, model counting, quantitative analysis, similarity, symbolic execution

Abstract

Similarity analysis plays a crucial role in various software engineering tasks, such as detecting software changes, version merging, identifying plagiarism, and analyzing binary code. Equivalence analysis, a stricter form of similarity, focuses on determining whether different programs or versions of the same program behave identically. While extensive research exists on code and binary similarity as well as equivalence analysis, there is a lack of quantitative reasoning in these areas. Non-equivalence is a spectrum that requires deeper exploration, as it can manifest in different ways across the input domain space. This paper emphasizes the importance of quantitative reasoning on non-equivalence which arises due to semantic differences. By quantitatively reasoning about non-equivalence, it becomes possible to identify specific input ranges for which programs are equivalent or non-equivalent. We aim to address the gap in quantitative reasoning in symbolic similarity analysis, enabling a more comprehensive understanding of program behavior.

DOI: 10.1145/3597926.3605238

Reasoning about MLIR Semantics through Effects and Handlers

作者: Yu, Pingshi
关键词: MLIR, Semantics, algebraic effects, interpreters

Abstract

MLIR is a novel framework for developing intermediate representations (IRs) of compilers. At its core, MLIR is a framework for the specification of syntax fragments (dialects) and optimisations, which can be combined `{a

DOI: 10.1145/3597926.3605239

cydios： a model-based testing framework for ios apps

Abstract

Improving Bit-Blasting for Nonlinear Integer Constraints

Abstract

CONCORD： Clone-Aware Contrastive Learning for Source Code

Abstract

Towards Efficient Fine-Tuning of Pre-trained Code Models： An Experimental Study and Beyond

Abstract

Understanding and Tackling Label Errors in Deep Learning-Based Vulnerability Detection (Experience Paper)

Abstract

Pattern-Based Peephole Optimizations with Java JIT Tests

Abstract

Icicle： A Re-designed Emulator for Grey-Box Firmware Fuzzing

Abstract

Fine-Grained Code Clone Detection with Block-Based Splitting of Abstract Syntax Tree

Abstract

Reducing the Memory Footprint of IFDS-based Data-Flow Analyses Using Fine-Grained Garbage Collection (Artifact)

Abstract

Hybrid Inlining： A Framework for Compositional and Context-Sensitive Static Analysis

Abstract

Artifacts for the paper： “Green Fuzzing： A Saturation-Based Stopping Criterion using Vulnerability Prediction”

Abstract

Reproduction artifact for “Testing Graph Database Engines via Query Partitioning”

Abstract

Semantic-Based Neural Network Repair

Abstract

GDsmith： Detecting Bugs in Cypher Graph Database Engines

Abstract

The tool LIPuS and its experiment package of the paper ‘‘Loop Invariant Inference through SMT Solving Enhanced Reinforcement Learning’’

Abstract

CODEP： Grammatical Seq2Seq Model for General-Purpose Code Generation

Abstract

Concept-Based Automated Grading of CS-1 Programming Assignments

Abstract

Artifact for “Beware of the Unexpected： Bimodal Taint Analysis”

Abstract

DeUEDroid system

Abstract

Dependency-Aware Metamorphic Testing of Datalog Engines

Abstract

Artifact for the ISSTA2023 Paper Fuzzing Deep Learning Compilers with HirGen

Abstract

API2Vec： Learning Representations of API Sequences for Malware Detection

Abstract

June： A Type Testability Transformation for Improved ATG Performance

Abstract

A Comprehensive Study on Quality Assurance Tools for Java

Abstract

IcyChecker-Artifact： Detecting State Inconsistency Bugs in DApps via On-Chain Transaction Replay and Fuzzing

Abstract

FairRec： Fairness Testing for Deep Recommender Systems

Abstract

ItyFuzz： Snapshot-Based Fuzzer for Smart Contract

Abstract

TrickyBugs

Abstract

Reproduction Package for Ariticle “Precise and Efficient Patch Presence Test for Android Applications against Code Obfuscation”

Abstract

Reproduction Package for Article `Detecting Vulnerabilities in Linux-based Embedded Firmware with SSE-based On-demand Alias Analysis’

Abstract

Definition and Detection of Defects in NFT Smart Contracts

Abstract

Eunomia： Enabling User-Specified Fine-Grained Search in Symbolically Executing WebAssembly Binaries

Abstract

Reproduction tool and data for article “Type Batched Program Reduction”

Abstract

Reproduction Package for Article “Automatically Reproducing Android Bug Reports using Natural Language Processing and Reinforcement Learning”

Abstract

ISSTA2023 Artifact for “Large Language Models Are Zero-Shot Fuzzers： Fuzzing Deep-Learning Libraries via Large Language Models”

Abstract

Exploring Missed Optimizations in WebAssembly Optimizers

Abstract

PhysCov： Physical Test Coverage for Autonomous Vehicles

Abstract

Building Critical Testing Scenarios for Autonomous Driving from Real Accidents

Abstract

CILIATE： Towards Fairer Class-Based Incremental Learning by Dataset and Training Refinement

Abstract

BehAVExplor： Behavior Diversity Guided Testing for Autonomous Driving Systems

Abstract