ISSTA 2020

WEIZZ： automatic grey-box fuzzing for structured binary formats

作者: Fioraldi, Andrea and D’Elia, Daniele Cono and Coppa, Emilio
关键词: structural mutations, chunk-based formats, binary testing, Fuzzing

Abstract

Fuzzing technologies have evolved at a fast pace in recent years, revealing bugs in programs with ever increasing depth and speed. Applications working with complex formats are however more difficult to take on, as inputs need to meet certain format-specific characteristics to get through the initial parsing stage and reach deeper behaviors of the program. Unlike prior proposals based on manually written format specifications, we propose a technique to automatically generate and mutate inputs for unknown chunk-based binary formats. We identify dependencies between input bytes and comparison instructions, and use them to assign tags that characterize the processing logic of the program. Tags become the building block for structure-aware mutations involving chunks and fields of the input. Our technique can perform comparably to structure-aware fuzzing proposals that require human assistance. Our prototype implementation WEIZZ revealed 16 unknown bugs in widely used programs.

DOI: 10.1145/3395363.3397372

Active fuzzing for testing and securing cyber-physical systems

作者: Chen, Yuqi and Xuan, Bohan and Poskitt, Christopher M. and Sun, Jun and Zhang, Fan
关键词: testing defence mechanisms, fuzzing, benchmark generation, active learning, Cyber-physical systems

Abstract

Cyber-physical systems (CPSs) in critical infrastructure face a pervasive threat from attackers, motivating research into a variety of countermeasures for securing them. Assessing the effectiveness of these countermeasures is challenging, however, as realistic benchmarks of attacks are difficult to manually construct, blindly testing is ineffective due to the enormous search spaces and resource requirements, and intelligent fuzzing approaches require impractical amounts of data and network access. In this work, we propose active fuzzing, an automatic approach for finding test suites of packet-level CPS network attacks, targeting scenarios in which attackers can observe sensors and manipulate packets, but have no existing knowledge about the payload encodings. Our approach learns regression models for predicting sensor values that will result from sampled network packets, and uses these predictions to guide a search for payload manipulations (i.e. bit flips) most likely to drive the CPS into an unsafe state. Key to our solution is the use of online active learning, which iteratively updates the models by sampling payloads that are estimated to maximally improve them. We evaluate the efficacy of active fuzzing by implementing it for a water purification plant testbed, finding it can automatically discover a test suite of flow, pressure, and over/underflow attacks, all with substantially less time, data, and network access than the most comparable approach. Finally, we demonstrate that our prediction models can also be utilised as countermeasures themselves, implementing them as anomaly detectors and early warning systems.

DOI: 10.1145/3395363.3397376

Replication Package for lFuzzer - Learning Input Tokens for Effective Fuzzing

作者: Mathis, Bj"{o
关键词: fuzzing, parser, test input generation

Abstract

This package contains the data and tools used for the experiments run to evaluate lFuzzer. The artifact contains two packages, one for reproducing the evaluation and conducting new experiments and one containing the evaluation results of the paper. The reproduction package is a docker container in which the experiments from the paper as well as new experiments can be performed.

DOI: 10.1145/3395363.3397348

Fast bit-vector satisfiability

作者: Yao, Peisen and Shi, Qingkai and Huang, Heqing and Zhang, Charles
关键词: program analysis, Satisfiability modulo theory, SAT solving

Abstract

SMT solving is often a major source of cost in a broad range of techniques such as symbolic program analysis. Thus, speeding up SMT solving is still an urgent requirement. A dominant approach, which is known as eager SMT solving, is to reduce a first-order formula to a pure Boolean formula, which is handed to an expensive SAT solver to determine the satisfiability. We observe that the SAT solver can utilize the knowledge in the first-order formula to boost its solving efficiency. Unfortunately, despite much progress, it is still not clear how to make use of the knowledge in an eager SMT solver. This paper addresses the problem by introducing a new and fast method, which utilizes the interval and data-dependence information learned from the first-order formulas. We have implemented the approach as a tool called Trident and evaluated it on three symbolic analyzers (Angr, Qsym, and Pinpoint). The experimental results, based on seven million SMT solving instances generated for thirty real-world software systems, show that Trident significantly reduces the total solving time from 2.9X to 7.9X over three state-of-the-art SMT solvers (Z3, CVC4, and Boolector), without sacrificing the number of solved instances. We also demonstrate that Trident achieves end-to-end speedups for three program analysis clients by 1.9X, 1.6X, and 2.4X, respectively.

DOI: 10.1145/3395363.3397378

Relocatable addressing model for symbolic execution

作者: Trabish, David and Rinetzky, Noam
关键词: Symbolic execution, Memory partitioning, Addressing model

Abstract

Symbolic execution (SE) is a widely used program analysis technique. Existing SE engines model the memory space by associating memory objects with concrete addresses, where the representation of each allocated object is determined during its allocation. We present a novel addressing model where the underlying representation of an allocated object can be dynamically modified even after its allocation, by using symbolic addresses rather than concrete ones. We demonstrate the benefits of our model in two application scenarios: dynamic inter- and intra-object partitioning. In the former, we show how the recently proposed segmented memory model can be improved by dynamically merging several object representations into a single one, rather than doing that a-priori using static pointer analysis. In the latter, we show how the cost of solving array theory constraints can be reduced by splitting the representations of large objects into multiple smaller ones. Our preliminary results show that our approach can significantly improve the overall effectiveness of the symbolic exploration.

DOI: 10.1145/3395363.3397363

Replication package for： Running Symbolic Execution Forever

作者: Busse, Frank and Nowack, Martin and Cadar, Cristian
关键词: binutils, coreutils, findutils, grep, KLEE, libspng, memoization, MoKlee, software testing, symbolic execution, tcpdump

Abstract

The artefact contains a Docker image with MoKlee, our memoization extension of KLEE, all benchmarks in LLVM bitcode format, the raw experiment results and scripts to re-create our evaluation and to re-run all experiments.

DOI: 10.1145/3395363.3397360

Tool Package for paper "Can Automated Program Repair Refine Fault Localization? A Unified Debugging Approach "

作者: Lou, Yiling and Ghanbari, Ali and Li, Xia and Zhang, Lingming and Zhang, Haotian and Hao, Dan and Zhang, Lu
关键词: Automated Debugging Tool, Automated Program Repair, Fault Localization, Unified Debugging

Abstract

The package includes the tool in the paper "Can Automated Program Repair Refine Fault Localization? A Unified Debugging Approach ", including the installation and execution document. More updates please refer to the tool homepage "https://github.com/yilinglou/proFL"

DOI: 10.1145/3395363.3397351

Automated repair of feature interaction failures in automated driving systems

作者: Abdessalem, Raja Ben and Panichella, Annibale and Nejati, Shiva and Briand, Lionel C. and Stifter, Thomas
关键词: Search-based Software Testing, Feature Interaction Problem, Automated Software Repair, Automated Driving Systems

Abstract

In the past years, several automated repair strategies have been proposed to fix bugs in individual software programs without any human intervention. There has been, however, little work on how automated repair techniques can resolve failures that arise at the system-level and are caused by undesired interactions among different system components or functions. Feature interaction failures are common in complex systems such as autonomous cars that are typically built as a composition of independent features (i.e., units of functionality). In this paper, we propose a repair technique to automatically resolve undesired feature interaction failures in automated driving systems (ADS) that lead to the violation of system safety requirements. Our repair strategy achieves its goal by (1) localizing faults spanning several lines of code, (2) simultaneously resolving multiple interaction failures caused by independent faults, (3) scaling repair strategies from the unit-level to the system-level, and (4) resolving failures based on their order of severity. We have evaluated our approach using two industrial ADS containing four features. Our results show that our repair strategy resolves the undesired interaction failures in these two systems in less than 16h and outperforms existing automated repair techniques.

DOI: 10.1145/3395363.3397386

CoCoNuT： combining context-aware neural translation models using ensemble for program repair

作者: Lutellier, Thibaud and Pham, Hung Viet and Pang, Lawrence and Li, Yitong and Wei, Moshi and Tan, Lin
关键词: Neural Machine Translation, Deep Learning, Automated program repair, AI and Software Engineering

Abstract

Automated generate-and-validate (GV) program repair techniques (APR) typically rely on hard-coded rules, thus only fixing bugs following specific fix patterns. These rules require a significant amount of manual effort to discover and it is hard to adapt these rules to different programming languages. To address these challenges, we propose a new G&V technique—CoCoNuT, which uses ensemble learning on the combination of convolutional neural networks (CNNs) and a new context-aware neural machine translation (NMT) architecture to automatically fix bugs in multiple programming languages. To better represent the context of a bug, we introduce a new context-aware NMT architecture that represents the buggy source code and its surrounding context separately. CoCoNuT uses CNNs instead of recurrent neural networks (RNNs), since CNN layers can be stacked to extract hierarchical features and better model source code at different granularity levels (e.g., statements and functions). In addition, CoCoNuT takes advantage of the randomness in hyperparameter tuning to build multiple models that fix different bugs and combines these models using ensemble learning to fix more bugs. Our evaluation on six popular benchmarks for four programming languages (Java, C, Python, and JavaScript) shows that CoCoNuT correctly fixes (i.e., the first generated patch is semantically equivalent to the developer’s patch) 509 bugs, including 309 bugs that are fixed by none of the 27 techniques with which we compare.

DOI: 10.1145/3395363.3397369

Detecting and diagnosing energy issues for mobile applications

作者: Li, Xueliang and Yang, Yuming and Liu, Yepang and Gallagher, John P. and Wu, Kaishun
关键词: Mobile Applications, Energy Issues, Energy Bugs, Android

Abstract

Energy efficiency is an important criterion to judge the quality of mobile apps, but one third of our randomly sampled apps suffer from energy issues that can quickly drain battery power. To understand these issues, we conducted an empirical study on 27 well-maintained apps such as Chrome and Firefox, whose issue tracking systems are publicly accessible. Our study revealed that the main root causes of energy issues include unnecessary workload and excessively frequent operations. Surprisingly, these issues are beyond the application of present technology on energy issue detection. We also found that 25.0% of energy issues can only manifest themselves under specific contexts such as poor network performance, but such contexts are again neglected by present technology. In this paper, we propose a novel testing framework for detecting energy issues in real-world mobile apps. Our framework examines apps with well-designed input sequences and runtime contexts. To identify the root causes mentioned above, we employed a machine learning algorithm to cluster the workloads and further evaluate their necessity. For the issues concealed by the specific contexts, we carefully set up several execution contexts to catch them. More importantly, we designed leading edge technology, e.g. pre-designing input sequences with potential energy overuse and tuning tests on-the-fly, to achieve high efficacy in detecting energy issues. A large-scale evaluation shows that 91.6% issues detected in our experiments were previously unknown to developers. On average, these issues double the energy costs of the apps. Our testing technique achieves a low number of false positives.

DOI: 10.1145/3395363.3397350

Automated classification of actions in bug reports of mobile apps

作者: Liu, Hui and Shen, Mingzhu and Jin, Jiahao and Jiang, Yanjie
关键词: Test Case Generation, Mobile Testing, Classification, Bug report

Abstract

When users encounter problems with mobile apps, they may commit such problems to developers as bug reports. To facilitate the processing of bug reports, researchers proposed approaches to validate the reported issues automatically according to the steps to reproduce specified in bug reports. Although such approaches have achieved high success rate in reproducing the reported issues, they often rely on a predefined vocabulary to identify and classify actions in bug reports. However, such manually constructed vocabulary and classification have significant limitations. It is challenging for the vocabulary to cover all potential action words because users may describe the same action with different words. Besides that, classification of actions solely based on the action words could be inaccurate because the same action word, appearing in different contexts, may have different meaning and thus belongs to different action categories. To this end, in this paper we propose an automated approach, called MaCa, to identify and classify action words in Mobile apps’ bug reports. For a given bug report, it first identifies action words based on natural language processing. For each of the resulting action words, MaCa extracts its contexts, i.e., its enclosing segment, the associated UI target, and the type of its target element by both natural language processing and static analysis of the associated app. The action word and its contexts are then fed into a machine learning based classifier that predicts the category of the given action word in the given context. To train the classifier, we manually labelled 1,202 actions words from 525 bug reports that are associated with 207 apps. Our evaluation results on manually labelled data suggested that MaCa was accurate with high accuracy varying from 95% to 96.7%. We also investigated to what extent MaCa could further improve existing approaches (i.e., Yakusu and ReCDroid) in reproducing bug reports. Our evaluation results suggested that integrating MaCa into existing approaches significantly improved the success rates of ReCDroid and Yakusu by 22.7% = (69.2%-56.4%)/56.4% and 22.9%= (62.7%-51%)/51%, respectively.

DOI: 10.1145/3395363.3397355

DLD： Data Loss Detector

作者: Riganelli, Oliviero and Mottadelli, Simone Paolo and Rota, Claudio and Micucci, Daniela and Mariani, Leonardo
关键词: Android, Data Loss, Mobile Apps, Test Case Generation, Validation

Abstract

Android apps must work correctly even if their execution is interrupted by external events. For instance, an app must work properly even if a phone call is received, or after its layout is redrawn because the smartphone has been rotated. Since these events may require destroying, when the execution is interrupted, and recreating, when the execution is resumed, the foreground activity of the app, the only way to prevent the loss of state information is to save and restore it. This behavior must be explicitly implemented by app developers, who often miss to implement it properly, releasing apps affected by data loss problems, that is, apps that may lose state information when their execution is interrupted. Although several techniques can be used to automatically generate test cases for Android apps, the obtained test cases seldom include the interactions and the checks necessary to exercise and reveal data loss faults. Data Loss Detector (DLD) is a test case generation technique that integrates an exploration strategy, data-loss-revealing actions, and two customized oracle strategies for the detection of data loss failures.

DOI: 10.1145/3395363.3397379

Reinforcement learning based curiosity-driven testing of Android applications

作者: Pan, Minxue and Huang, An and Wang, Guoxin and Zhang, Tian and Li, Xuandong
关键词: reinforcement learning, functional scenario division, Android app testing

Abstract

Mobile applications play an important role in our daily life, while it still remains a challenge to guarantee their correctness. Model-based and systematic approaches have been applied to Android GUI testing. However, they do not show significant advantages over random approaches because of limitations such as imprecise models and poor scalability. In this paper, we propose Q-testing, a reinforcement learning based approach which benefits from both random and model-based approaches to automated testing of Android applications. Q-testing explores the Android apps with a curiosity-driven strategy that utilizes a memory set to record part of previously visited states and guides the testing towards unfamiliar functionalities. A state comparison module, which is a neural network trained by plenty of collected samples, is novelly employed to divide different states at the granularity of functional scenarios. It can determine the reinforcement learning reward in Q-testing and help the curiosity-driven strategy explore different functionalities efficiently. We conduct experiments on 50 open-source applications where Q-testing outperforms the state-of-the-art and state-of-practice Android GUI testing tools in terms of code coverage and fault detection. So far, 22 of our reported faults have been confirmed, among which 7 have been fixed.

DOI: 10.1145/3395363.3397354

Replication Package for Article： Effective White-box Testing of Deep Neural Networks with Adaptive Neuron-Selection Strategy

作者: Lee, Seokhyun and Cha, Sooyoung and Lee, Dain and Oh, Hakjoo
关键词: Deep neural networks, Online learning, White-box testing

Abstract

This article contains source codes and data for the paper "Effective White-box Testing of Deep Neural Networks with Adaptive Neuron-Selection Strategy". It also contains experiment scripts that can reproduce the results in the paper.

DOI: 10.1145/3395363.3397346

DeepGini： prioritizing massive tests to enhance the robustness of deep neural networks

作者: Feng, Yang and Shi, Qingkai and Gao, Xinyu and Wan, Jun and Fang, Chunrong and Chen, Zhenyu
关键词: Test Case Prioritization, Deep Learning Testing, Deep Learning

Abstract

Deep neural networks (DNN) have been deployed in many software systems to assist in various classification tasks. In company with the fantastic effectiveness in classification, DNNs could also exhibit incorrect behaviors and result in accidents and losses. Therefore, testing techniques that can detect incorrect DNN behaviors and improve DNN quality are extremely necessary and critical. However, the testing oracle, which defines the correct output for a given input, is often not available in the automated testing. To obtain the oracle information, the testing tasks of DNN-based systems usually require expensive human efforts to label the testing data, which significantly slows down the process of quality assurance. To mitigate this problem, we propose DeepGini, a test prioritization technique designed based on a statistical perspective of DNN. Such a statistical perspective allows us to reduce the problem of measuring misclassification probability to the problem of measuring set impurity, which allows us to quickly identify possibly-misclassified tests. To evaluate, we conduct an extensive empirical study on popular datasets and prevalent DNN models. The experimental results demonstrate that DeepGini outperforms existing coverage-based techniques in prioritizing tests regarding both effectiveness and efficiency. Meanwhile, we observe that the tests prioritized at the front by DeepGini are more effective in improving the DNN quality in comparison with the coverage-based techniques.

DOI: 10.1145/3395363.3397357

DPFuzz： Fuzzing and Debugging for Differential Performance Bugs in Machine Learning Libraries

作者: Tizpaz-Niari, Saeid and \v{C
关键词: Debugging, Differential Performance Bugs, Fuzzing, Machine Learning Libraries, Performance

Abstract

DPFuzz is a tool for fuzzing and debugging differential performance bugs. The paper and its overview are included in this folder. Please see the 'DPFuzz-Overview.pdf' to have an overall picture of DPFuzz.

DOI: 10.1145/3395363.3404540

Higher income, larger loan? monotonicity testing of machine learning models

作者: Sharma, Arnab and Wehrheim, Heike
关键词: Monotonicity, Machine Learning Testing, Decision Tree

Abstract

Today, machine learning (ML) models are increasingly applied in decision making. This induces an urgent need for quality assurance of ML models with respect to (often domain-dependent) requirements. Monotonicity is one such requirement. It specifies a software as ‘‘learned’’ by an ML algorithm to give an increasing prediction with the increase of some attribute values. While there exist multiple ML algorithms for ensuring monotonicity of the generated model, approaches for checking monotonicity, in particular of black-box models are largely lacking. In this work, we propose verification-based testing of monotonicity, i.e., the formal computation of test inputs on a white-box model via verification technology, and the automatic inference of this approximating white-box model from the black-box model under test. On the white-box model, the space of test inputs can be systematically explored by a directed computation of test cases. The empirical evaluation on 90 black-box models shows that verification-based testing can outperform adaptive random testing as well as property-based techniques with respect to effectiveness and efficiency.

DOI: 10.1145/3395363.3397352

Detecting flaky tests in probabilistic and machine learning applications

作者: Dutta, Saikat and Shi, August and Choudhary, Rutvik and Zhang, Zhekun and Jain, Aryaman and Misailovic, Sasa
关键词: Randomness, Probabilistic Programming, Non-Determinism, Machine Learning, Flaky tests

Abstract

Probabilistic programming systems and machine learning frameworks like Pyro, PyMC3, TensorFlow, and PyTorch provide scalable and efficient primitives for inference and training. However, such operations are non-deterministic. Hence, it is challenging for developers to write tests for applications that depend on such frameworks, often resulting in flaky tests – tests which fail non-deterministically when run on the same version of code. In this paper, we conduct the first extensive study of flaky tests in this domain. In particular, we study the projects that depend on four frameworks: Pyro, PyMC3, TensorFlow-Probability, and PyTorch. We identify 75 bug reports/commits that deal with flaky tests, and we categorize the common causes and fixes for them. This study provides developers with useful insights on dealing with flaky tests in this domain. Motivated by our study, we develop a technique, FLASH, to systematically detect flaky tests due to assertions passing and failing in different runs on the same code. These assertions fail due to differences in the sequence of random numbers in different runs of the same test. FLASH exposes such failures, and our evaluation on 20 projects results in 11 previously-unknown flaky tests that we reported to developers.

DOI: 10.1145/3395363.3397366

Scaffle： bug localization on millions of files

作者: Pradel, Michael and Murali, Vijayaraghavan and Qian, Rebecca and Machalica, Mateusz and Meijer, Erik and Chandra, Satish
关键词: software crashes, machine learning, Bug localization

Abstract

Despite all efforts to avoid bugs, software sometimes crashes in the field, leaving crash traces as the only information to localize the problem. Prior approaches on localizing where to fix the root cause of a crash do not scale well to ultra-large scale, heterogeneous code bases that contain millions of code files written in multiple programming languages. This paper presents Scaffle, the first scalable bug localization technique, which is based on the key insight to divide the problem into two easier sub-problems. First, a trained machine learning model predicts which lines of a raw crash trace are most informative for localizing the bug. Then, these lines are fed to an information retrieval-based search engine to retrieve file paths in the code base, predicting which file to change to address the crash. The approach does not make any assumptions about the format of a crash trace or the language that produces it. We evaluate Scaffle with tens of thousands of crash traces produced by a large-scale industrial code base at Facebook that contains millions of possible bug locations and that powers tools used by billions of people. The results show that the approach correctly predicts the file to fix for 40% to 60% (50% to 70%) of all crash traces within the top-1 (top-5) predictions. Moreover, Scaffle improves over several baseline approaches, including an existing classification-based approach, a scalable variant of existing information retrieval-based approaches, and a set of hand-tuned, industrially deployed heuristics.

DOI: 10.1145/3395363.3397356

Replication package for Abstracting Failure Inducing Inputs

作者: Gopinath, Rahul and Kampmann, Alexander and Havrikov, Nikolas and Soremekun, Ezekiel O. and Zeller, Andreas
关键词: debugging, error diagnosis, failure-inducing inputs, grammars

Abstract

This artifact contains the implementation of the algorithm in the paper "Abstracting Failure Inducing Inputs". The artifact is a Vagrant box (a virtual machine) that contains the complete implementation and the subjects that can be evaluated directly. A complete worked out example in a Jupyter notebook is included in the VM along with a complete Jupyter installation so that the notebook can be viewed directly.

DOI: 10.1145/3395363.3397349

Debugging the performance of Maven’s test isolation： experience report

作者: Nie, Pengyu and Celik, Ahmet and Coley, Matthew and Milicevic, Aleksandar and Bell, Jonathan and Gligoric, Milos
关键词: test isolation, Maven, Build system

Abstract

Testing is the most common approach used in industry for checking software correctness. Developers frequently practice reliable testing-executing individual tests in isolation from each other-to avoid test failures caused by test-order dependencies and shared state pollution (e.g., when tests mutate static fields). A common way of doing this is by running each test as a separate process. Unfortunately, this is known to introduce substantial overhead. This experience report describes our efforts to better understand the sources of this overhead and to create a system to confirm the minimal overhead possible. We found that different build systems use different mechanisms for communicating between these multiple processes, and that because of this design decision, running tests with some build systems could be faster than with others. Through this inquiry we discovered a significant performance bug in Apache Maven’s test running code, which slowed down test execution by on average 350 milliseconds per-test when compared to a competing build system, Ant. When used for testing real projects, this can result in a significant reduction in testing time. We submitted a patch for this bug which has been integrated into the Apache Maven build system, and describe our ongoing efforts to improve Maven’s test execution tooling.

DOI: 10.1145/3395363.3397381

Feedback-driven side-channel analysis for networked applications

作者: Kadron, undefinedsmet Burak and Rosner, Nicol'{a
关键词: network traffic analysis, input generation, dynamic program analysis, Side-channel analysis

Abstract

Information leakage in software systems is a problem of growing importance. Networked applications can leak sensitive information even when they use encryption. For example, some characteristics of network packets, such as their size, timing and direction, are visible even for encrypted traffic. Patterns in these characteristics can be leveraged as side channels to extract information about secret values accessed by the application. In this paper, we present a new tool called AutoFeed for detecting and quantifying information leakage due to side channels in networked software applications. AutoFeed profiles the target system and automatically explores the input space, explores the space of output features that may leak information, quantifies the information leakage, and identifies the top-leaking features. Given a set of input mutators and a small number of initial inputs provided by the user, AutoFeed iteratively mutates inputs and periodically updates its leakage estimations to identify the features that leak the greatest amount of information about the secret of interest. AutoFeed uses a feedback loop for incremental profiling, and a stopping criterion that terminates the analysis when the leakage estimation for the top-leaking features converges. AutoFeed also automatically assigns weights to mutators in order to focus the search of the input space on exploring dimensions that are relevant to the leakage quantification. Our experimental evaluation on the benchmarks shows that AutoFeed is effective in detecting and quantifying information leaks in networked applications.

DOI: 10.1145/3395363.3397365

Scalable analysis of interaction threats in IoT systems

作者: Alhanahnah, Mohannad and Stevens, Clay and Bagheri, Hamid
关键词: IoT Safety, Interaction Threats, Formal Verification

Abstract

The ubiquity of Internet of Things (IoT) and our growing reliance on IoT apps are leaving us more vulnerable to safety and security threats than ever before. Many of these threats are manifested at the interaction level, where undesired or malicious coordinations between apps and physical devices can lead to intricate safety and security issues. This paper presents IoTCOM, an approach to automatically discover such hidden and unsafe interaction threats in a compositional and scalable fashion. It is backed with auto-mated program analysis and formally rigorous violation detection engines. IoTCOM relies on program analysis to automatically infer the relevant app’s behavior. Leveraging a novel strategy to trim the extracted app’s behavior prior to translating them to analyzable formal specifications,IoTCOM mitigates the state explosion associated with formal analysis. Our experiments with numerous bundles of real-world IoT apps have corroborated IoTCOM’s ability to effectively detect a broad spectrum of interaction threats triggered through cyber and physical channels, many of which were previously unknown, and to significantly outperform the existing techniques in terms of scalability.

DOI: 10.1145/3395363.3397347

DeepSQLi： deep semantic learning for testing SQL injection

作者: Liu, Muyang and Li, Ke and Chen, Tao
关键词: test case generation, natural language processing, deep learning, Web security, SQL injection

Abstract

Security is unarguably the most serious concern for Web applications, to which SQL injection (SQLi) attack is one of the most devastating attacks. Automatically testing SQLi vulnerabilities is of ultimate importance, yet is unfortunately far from trivial to implement. This is because the existence of a huge, or potentially infinite, number of variants and semantic possibilities of SQL leading to SQLi attacks on various Web applications. In this paper, we propose a deep natural language processing based tool, dubbed DeepSQLi, to generate test cases for detecting SQLi vulnerabilities. Through adopting deep learning based neural language model and sequence of words prediction, DeepSQLi is equipped with the ability to learn the semantic knowledge embedded in SQLi attacks, allowing it to translate user inputs (or a test case) into a new test case, which is se- mantically related and potentially more sophisticated. Experiments are conducted to compare DeepSQLi with SQLmap, a state-of-the-art SQLi testing automation tool, on six real-world Web applications that are of different scales, characteristics and domains. Empirical results demonstrate the effectiveness and the remarkable superiority of DeepSQLi over SQLmap, such that more SQLi vulnerabilities can be identified by using a less number of test cases, whilst running much faster.

DOI: 10.1145/3395363.3397375

Dependent-test-aware regression testing techniques

作者: Lam, Wing and Shi, August and Oei, Reed and Zhang, Sai and Ernst, Michael D. and Xie, Tao
关键词: regression testing, order-dependent test, flaky test

Abstract

Developers typically rely on regression testing techniques to ensure that their changes do not break existing functionality. Unfortunately, these techniques suffer from flaky tests, which can both pass and fail when run multiple times on the same version of code and tests. One prominent type of flaky tests is order-dependent (OD) tests, which are tests that pass when run in one order but fail when run in another order. Although OD tests may cause flaky-test failures, OD tests can help developers run their tests faster by allowing them to share resources. We propose to make regression testing techniques dependent-test-aware to reduce flaky-test failures. To understand the necessity of dependent-test-aware regression testing techniques, we conduct the first study on the impact of OD tests on three regression testing techniques: test prioritization, test selection, and test parallelization. In particular, we implement 4 test prioritization, 6 test selection, and 2 test parallelization algorithms, and we evaluate them on 11 Java modules with OD tests. When we run the orders produced by the traditional, dependent-test-unaware regression testing algorithms, 82% of human-written test suites and 100% of automatically-generated test suites with OD tests have at least one flaky-test failure. We develop a general approach for enhancing regression testing algorithms to make them dependent-test-aware, and apply our approach to 12 algorithms. Compared to traditional, unenhanced regression testing algorithms, the enhanced algorithms use provided test dependencies to produce orders with different permutations or extra tests. Our evaluation shows that, in comparison to the orders produced by unenhanced algorithms, the orders produced by enhanced algorithms (1) have overall 80% fewer flaky-test failures due to OD tests, and (2) may add extra tests but run only 1% slower on average. Our results suggest that enhancing regression testing algorithms to be dependent-test-aware can substantially reduce flaky-test failures with only a minor slowdown to run the tests.

DOI: 10.1145/3395363.3397364

Differential regression testing for REST APIs

作者: Godefroid, Patrice and Lehmann, Daniel and Polishchuk, Marina
关键词: specification regression, service regression, differential regression testing, client/service version matrix, REST APIs

Abstract

Cloud services are programmatically accessed through REST APIs. Since REST APIs are constantly evolving, an important problem is how to prevent breaking changes of APIs, while supporting several different versions. To find such breaking changes in an automated way, we introduce differential regression testing for REST APIs. Our approach is based on two observations. First, breaking changes in REST APIs involve two software components, namely the client and the service. As such, there are also two types of regressions: regressions in the API specification, i.e., in the contract between the client and the service, and regressions in the service itself, i.e., previously working requests are “broken” in later versions of the service. Finding both kinds of regressions involves testing along two dimensions: when the service changes and when the specification changes. Second, to detect such bugs automatically, we employ differential testing. That is, we compare the behavior of different versions on the same inputs against each other, and find regressions in the observed differences. For generating inputs (sequences of HTTP requests) to services, we use RESTler, a stateful fuzzer for REST APIs. Comparing the outputs (HTTP responses) of a cloud service involves several challenges, like abstracting over minor differences, handling out-of-order requests, and non-determinism. Differential regression testing across 17 different versions of the widely-used Azure networking APIs deployed between 2016 and 2019 detected 14 regressions in total, 5 of those in the official API specifications and 9 regressions in the services themselves.

DOI: 10.1145/3395363.3397374

Empirically revisiting and enhancing IR-based test-case prioritization

作者: Peng, Qianyang and Shi, August and Zhang, Lingming
关键词: information retrieval, continuous integration, Test-case prioritization

Abstract

Test-case prioritization (TCP) aims to detect regression bugs faster via reordering the tests run. While TCP has been studied for over 20 years, it was almost always evaluated using seeded faults/mutants as opposed to using real test failures. In this work, we study the recent change-aware information retrieval (IR) technique for TCP. Prior work has shown it performing better than traditional coverage-based TCP techniques, but it was only evaluated on a small-scale dataset with a cost-unaware metric based on seeded faults/mutants. We extend the prior work by conducting a much larger and more realistic evaluation as well as proposing enhancements that substantially improve the performance. In particular, we evaluate the original technique on a large-scale, real-world software-evolution dataset with real failures using both cost-aware and cost-unaware metrics under various configurations. Also, we design and evaluate hybrid techniques combining the IR features, historical test execution time, and test failure frequencies. Our results show that the change-aware IR technique outperforms stateof-the-art coverage-based techniques in this real-world setting, and our hybrid techniques improve even further upon the original IR technique. Moreover, we show that flaky tests have a substantial impact on evaluating the change-aware TCP techniques based on real test failures.

DOI: 10.1145/3395363.3397383

Intermittently failing tests in the embedded systems domain

作者: Strandberg, Per Erik and Ostrand, Thomas J. and Weyuker, Elaine J. and Afzal, Wasif and Sundmark, Daniel
关键词: system level test automation, non-deterministic tests, intermittently failing tests, flaky tests, embedded systems

Abstract

Software testing is sometimes plagued with intermittently failing tests and finding the root causes of such failing tests is often difficult. This problem has been widely studied at the unit testing level for open source software, but there has been far less investigation at the system test level, particularly the testing of industrial embedded systems. This paper describes our investigation of the root causes of intermittently failing tests in the embedded systems domain, with the goal of better understanding, explaining and categorizing the underlying faults. The subject of our investigation is a currently-running industrial embedded system, along with the system level testing that was performed. We devised and used a novel metric for classifying test cases as intermittent. From more than a half million test verdicts, we identified intermittently and consistently failing tests, and identified their root causes using multiple sources. We found that about 1-3% of all test cases were intermittently failing. From analysis of the case study results and related work, we identified nine factors associated with test case intermittence. We found that a fix for a consistently failing test typically removed a larger number of failures detected by other tests than a fix for an intermittent test. We also found that more effort was usually needed to identify fixes for intermittent tests than for consistent tests. An overlap between root causes leading to intermittent and consistent tests was identified. Many root causes of intermittence are the same in industrial embedded systems and open source software. However, when comparing unit testing to system level testing, especially for embedded systems, we observed that the test environment itself is often the cause of intermittence.

DOI: 10.1145/3395363.3397359

Feasible and Stressful Trajectory Generation for Mobile Robots - Artifact

作者: Hildebrandt, Carl and Elbaum, Sebastian and Bezzo, Nicola and Dwyer, Matthew B.
关键词: Kinematic and Dynamic Models, Robotics, Stress Testing, Test Generation

Abstract

This artifact can be used to replicate the results found in the paper: "Feasible and Stressful Trajectory Generation for Mobile Robots". For more information on the content consult the readme file.

DOI: 10.1145/3395363.3397387

作者: Li, Hui and Wang, Dong and Huang, Tianze and Gao, Yu and Dou, Wensheng and Xu, Lijie and Wang, Wei and Wei, Jun and Zhong, Hua
关键词: bug detection, cache, performance, Spark

Abstract

This artifact contains the source code of CacheCheck. It provides general instructions to use CacheCheck and evaluates experimental results in our paper. More details and newest version is provided on Github (https://github.com/Icysandwich/cachecheck).

DOI: 10.1145/3395363.3397353

Patch based vulnerability matching for binary programs

作者: Xu, Yifei and Xu, Zhengzi and Chen, Bihuan and Song, Fu and Liu, Yang and Liu, Ting
关键词: Vulnerability Matching, Security, Patch Presence Identification, Binary Analysis

Abstract

The binary-level function matching has been widely used to detect whether there are 1-day vulnerabilities in released programs. However, the high false positive is a challenge for current function matching solutions, since the vulnerable function is highly similar to its corresponding patched version. In this paper, the Binary X-Ray (BinXray), a patch based vulnerability matching approach, is proposed to identify the specific 1-day vulnerabilities in target programs accurately and effectively. In the preparing step, a basic block mapping algorithm is designed to extract the signature of a patch, by comparing the given vulnerable and patched programs. The signature is represented as a set of basic block traces. In the detection step, the patching semantics is applied to reduce irrelevant basic block traces to speed up the signature searching. The trace similarity is also designed to identify whether a target program is patched. In experiments, 12 real software projects related to 479 CVEs are collected. BinXray achieves 93.31% accuracy and the analysis time cost is only 296.17ms per function, outperforming the state-of-the-art works.

DOI: 10.1145/3395363.3397361

Identifying Java Calls in Native Code via Binary Scanning (artifact)

作者: Fourtounis, George and Triantafyllou, Leonidas and Smaragdakis, Yannis
关键词: binary, Java, native code, static analysis

Abstract

This is the artifact for the paper "Identifying Java Calls in Native Code via Binary Scanning" (ISSTA 2020). It contains a Doop installation, the benchmarks used in the "Evaluation" section of the paper, and instructions on how to replicate the paper results.

DOI: 10.1145/3395363.3397368

An empirical study on ARM disassembly tools

作者: Jiang, Muhui and Zhou, Yajin and Luo, Xiapu and Wang, Ruoyu and Liu, Yang and Ren, Kui
关键词: Empirical Study, Disassembly Tools, ARM Architecture

Abstract

With the increasing popularity of embedded devices, ARM is becoming the dominant architecture for them. In the meanwhile, there is a pressing need to perform security assessments for these devices. Due to different types of peripherals, it is challenging to dynamically run the firmware of these devices in an emulated environment. Therefore, the static analysis is still commonly used. Existing work usually leverages off-the-shelf tools to disassemble stripped ARM binaries and (implicitly) assume that reliable disassembling binaries and function recognition are solved problems. However, whether this assumption really holds is unknown. In this paper, we conduct the first comprehensive study on ARM disassembly tools. Specifically, we build 1,896 ARM binaries (including 248 obfuscated ones) with different compilers, compiling options, and obfuscation methods. We then evaluate them using eight state-of-the-art ARM disassembly tools (including both commercial and noncommercial ones) on their capabilities to locate instructions and function boundaries. These two are fundamental ones, which are leveraged to build other primitives. Our work reveals some observations that have not been systematically summarized and/or confirmed. For instance, we find that the existence of both ARM and Thumb instruction sets, and the reuse of the BL instruction for both function calls and branches bring serious challenges to disassembly tools. Our evaluation sheds light on the limitations of state-of-the-art disassembly tools and points out potential directions for improvement. To engage the community, we release the data set, and the related scripts at https://github.com/valour01/arm_disasssembler_study.

DOI: 10.1145/3395363.3397377

Artifact for Article “How Effective Are Smart Contract Analysis Tools? Evaluating Smart Contract Static Analysis Tools using Bug Injection”

作者: Ghaleb, Asem and Pattabiraman, Karthik
关键词: bug injection, Ethereum, Ethereum security, fault injection, smart
contracts dataset, smart contracts, smart contracts analysis, smart contracts security, solidity code analysis, static analysis tools evaluation

Abstract

This is the artifact for the ISSTA'20 paper "How Effective Are Smart Contract Analysis Tools? Evaluating Smart Contract Static Analysis Tools using Bug Injection". Two main things are covered in the artifact

How to use the introduced tool, SolidiFI, for injecting bugs and evaluating smart contract static analysis tools
How to reproduce the evaluation experiments presented in the paper.

DOI: 10.1145/3395363.3397385

A programming model for semi-implicit parallelization of static analyses

作者: Helm, Dominik and K"{u
关键词: static analysis, parallelization, concurrency

Abstract

Parallelization of static analyses is necessary to scale to real-world programs, but it is a complex and difficult task and, therefore, often only done manually for selected high-profile analyses. In this paper, we propose a programming model for semi-implicit parallelization of static analyses which is inspired by reactive programming. Reusing the domain-expert knowledge on how to parallelize anal- yses encoded in the programming framework, developers do not need to think about parallelization and concurrency issues on their own. The programming model supports stateful computations, only requires monotonic computations over lattices, and is independent of specific analyses. Our evaluation shows the applicability of the programming model to different analyses and the importance of user-selected scheduling strategies. We implemented an IFDS solver that was able to outperform a state-of-the-art, specialized parallel IFDS solver both in absolute performance and scalability.

DOI: 10.1145/3395363.3397367

Recovering fitness gradients for interprocedural Boolean flags in search-based testing

作者: Lin, Yun and Sun, Jun and Fraser, Gordon and Xiu, Ziheng and Liu, Ting and Dong, Jin Song
关键词: testing, testability, search-based, program analysis

Abstract

In Search-based Software Testing (SBST), test generation is guided by fitness functions that estimate how close a test case is to reach an uncovered test goal (e.g., branch). A popular fitness function estimates how close conditional statements are to evaluating to true or false, i.e., the branch distance. However, when conditions read Boolean variables (e.g., if(x && y)), the branch distance provides no gradient for the search, since a Boolean can either be true or false. This flag problem can be addressed by transforming individual procedures such that Boolean flags are replaced with numeric comparisons that provide better guidance for the search. Unfortunately, defining a semantics-preserving transformation that is applicable in an interprocedural case, where Boolean flags are passed around as parameters and return values, is a daunting task. Thus, it is not yet supported by modern test generators. This work is based on the insight that fitness gradients can be recovered by using runtime information: Given an uncovered interprocedural flag branch, our approach (1) calculates context-sensitive branch distance for all control flows potentially returning the required flag in the called method, and (2) recursively aggregates these distances into a continuous value. We implemented our approach on top of the EvoSuite framework for Java, and empirically compared it with state-of-the-art testability transformations on non-trivial methods suffering from interprocedural flag problems, sampled from open source Java projects. Our experiment demonstrates that our approach achieves higher coverage on the subject methods with statistical significance and acceptable runtime overheads.

DOI: 10.1145/3395363.3397358

Scalable build service system with smart scheduling service

作者: Wang, Kaiyuan and Tener, Greg and Gullapalli, Vijay and Huang, Xin and Gad, Ahmed and Rall, Daniel
关键词: build system design, build scheduling service, Build service system

Abstract

Build automation is critical for developers to check if their code compiles, passes all tests and is safe to deploy to the server. Many companies adopt Continuous Integration (CI) services to make sure that the code changes from multiple developers can be safely merged at the head of the project. Internally, CI triggers builds to make sure that the new code change compiles and passes the tests. For any large company which has a monolithic code repository and thousands of developers, it is hard to make sure that all code changes are safe to submit in a timely manner. The reason is that each code change may involve multiple builds, and the company needs to run millions of builds every day to guarantee developers’ productivity. Google is one of those large companies that need a scalable build service to support developers’ work. More than 100,000 code changes are submitted to our repository on average each day, including changes from either human users or automated tools. More than 15 million builds are executed on average each day. In this paper, we first describe an overview of our scalable build service architecture. Then, we discuss more details about how we make build scheduling decisions. Finally, we discuss some experience in the scalability of the build service system and the performance of the build scheduling service.

DOI: 10.1145/3395363.3397371

Escaping dependency hell： finding build dependency errors with the unified dependency graph

作者: Fan, Gang and Wang, Chengpeng and Wu, Rongxin and Xiao, Xiao and Shi, Qingkai and Zhang, Charles
关键词: dependency verification, build tools, build maintenance

Abstract

Modern software projects rely on build systems and build scripts to assemble executable artifacts correctly and efficiently. However, developing build scripts is error-prone. Dependency-related errors in build scripts, mainly including missing dependencies and redundant dependencies, are common in various kinds of software projects. These errors lead to build failures, incorrect build results or poor performance in incremental or parallel builds. To detect such errors, various techniques are proposed and suffer from low efficiency and high false positive problems, due to the deficiency of the underlying dependency graphs. In this work, we design a new dependency graph, the unified dependency graph (UDG), which leverages both static and dynamic information to uniformly encode the declared and actual dependencies between build targets and files. The construction of UDG facilitates the efficient and precise detection of dependency errors via simple graph traversals. We implement the proposed approach as a tool, VeriBuild, and evaluate it on forty-two well-maintained open-source projects. The experimental results show that, without losing precision, VeriBuild incurs 58.2% less overhead than the state-of-the-art approach. By the time of writing, 398 detected dependency issues have been confirmed by the developers.

DOI: 10.1145/3395363.3397388

How far we have come： testing decompilation correctness of C decompilers

作者: Liu, Zhibo and Wang, Shuai
关键词: Software Testing, Reverse Engineering, Decompiler

Abstract

A C decompiler converts an executable (the output from a C compiler) into source code. The recovered C source code, once recompiled, will produce an executable with the same functionality as the original executable. With over twenty years of development, C decompilers have been widely used in production to support reverse engineering applications, including legacy software migration, security retrofitting, software comprehension, and to act as the first step in launching adversarial software exploitations. As the paramount component and the trust base in numerous cybersecurity tasks, C decompilers have enabled the analysis of malware, ransomware, and promoted cybersecurity professionals’ understanding of vulnerabilities in real-world systems. In contrast to this flourishing market, our observation is that in academia, outputs of C decompilers (i.e., recovered C source code) are still not extensively used. Instead, the intermediate representations are often more desired for usage when developing applications such as binary security retrofitting. We acknowledge that such conservative approaches in academia are a result of widespread and pessimistic views on the decompilation correctness. However, in conventional software engineering and security research, how much of a problem is, for instance, reusing a piece of simple legacy code by taking the output of modern C decompilers? In this work, we test decompilation correctness to present an up-to-date understanding regarding modern C decompilers. We detected a total of 1,423 inputs that can trigger decompilation errors from four popular decompilers, and with extensive manual effort, we identified 13 bugs in two open-source decompilers. Our findings show that the overly pessimistic view of decompilation correctness leads researchers to underestimate the potential of modern decompilers; the state-of-the-art decompilers certainly care about the functional correctness, and they are making promising progress. However, some tasks that have been studied for years in academia, such as type inference and optimization, still impede C decompilers from generating quality outputs more than is reflected in the literature. These issues rarely receive enough attention and can lead to great confusion that misleads users.

DOI: 10.1145/3395363.3397370

FPDiff： Discvovering Discrepancies in Numerical Libraries

作者: Vanover, Jackson and Deng, Xuan and Rubio-Gonz'{a
关键词: correctness, differential testing, floating point, numerical libraries, numerical methods, software testing

Abstract

FPDiff is a tool for automated, end-to-end differential testing that, given only library source code as input, extracts numerical function signatures, synthesizes drivers, creates equivalence classes of functions that are synonymous, and executes differential tests over these classes to detect meaningful numerical discrepancies between implementations. FPDiff's current scope covers special functions across numerical libraries written in different programming languages. This artifact in particular includes the following libraries: the C library GSL (The GNU Scientific Library, version 2.6), the Python libraries SciPy (version 1.3.1) and mpmath (version 1.1.0), and the JavaScript library jmat (commit 21d15fc3eb5a924beca612e337f5cb00605c03f3).

DOI: 10.1145/3395363.3397380

Testing high performance numerical simulation programs： experience, lessons learned, and open issues

作者: He, Xiao and Wang, Xingwei and Shi, Jia and Liu, Yi
关键词: Software testing, Numerical simulation, High performance computing, Experience

Abstract

High performance numerical simulation programs are widely used to simulate actual physical processes on high performance computers for the analysis of various physical and engineering problems. They are usually regarded as non-testable due to their high complexity. This paper reports our real experience and lessons learned from testing five simulation programs that will be used to design and analyze nuclear power plants. We applied five testing approaches and found 33 bugs. We found that property-based testing and metamorphic testing are two effective methods. Nevertheless, we suffered from the lack of domain knowledge, the high test costs, the shortage of test cases, severe oracle issues, and inadequate automation support. Consequently, the five programs are not exhaustively tested from the perspective of software testing, and many existing software testing techniques and tools are not fully applicable due to scalability and portability issues. We need more collaboration and communication with other communities to promote the research and application of software testing techniques.

DOI: 10.1145/3395363.3397382

Replication Package for Article： Functional Code Clone Detection with Syntax and Semantics Fusion Learning

作者: Fang, Chunrong and Liu, Zixi and Shi, Yangyang and Huang, Jeff and Shi, Qingkai
关键词: code clone detection, code representation, functional clone detection

Abstract

The FCDetector is a functional code clone detection tool with syntax and semantics fusion learning.

DOI: 10.1145/3395363.3397362

Learning to detect table clones in spreadsheets

作者: Zhang, Yakun and Dou, Wensheng and Zhu, Jiaxin and Xu, Liang and Zhou, Zhiyong and Wei, Jun and Ye, Dan and Yang, Bo
关键词: table clone, structure, format, Spreadsheet

Abstract

In order to speed up spreadsheet development productivity, end users can create a spreadsheet table by copying and modifying an existing one. These two tables share the similar computational semantics, and form a table clone. End users may modify the tables in a table clone, e.g., adding new rows and deleting columns, thus introducing structure changes into the table clone. Our empirical study on real-world spreadsheets shows that about 58.5% of table clones involve structure changes. However, existing table clone detection approaches in spreadsheets can only detect table clones with the same structures. Therefore, many table clones with structure changes cannot be detected. We observe that, although the tables in a table clone may be modified, they usually share the similar structures and formats, e.g., headers, formulas and background colors. Based on this observation, we propose LTC (Learning to detect Table Clones), to automatically detect table clones with or without structure changes. LTC utilizes the structure and format information from labeled table clones and non table clones to train a binary classifier. LTC first identifies tables in spreadsheets, and then uses the trained binary classifier to judge whether every two tables can form a table clone. Our experiments on real-world spreadsheets from the EUSES and Enron corpora show that, LTC can achieve a precision of 97.8% and recall of 92.1% in table clone detection, significantly outperforming the state-of-the-art technique (a precision of 37.5% and recall of 11.1%).

DOI: 10.1145/3395363.3397384

ObjSim： lightweight automatic patch prioritization via object similarity

作者: Ghanbari, Ali
关键词: Test Case, Patch Prioritization, Object Similarity, Automatic Program Repair

Abstract

In the context of test case based automatic program repair (APR), patches that pass all the test cases but fail to fix the bug are called overfitted patches. Currently, patches generated by APR tools get inspected manually by the users to find and adopt genuine fixes. Being a laborious activity hindering widespread adoption of APR, automatic identification of overfitted patches has lately been the topic of active research. This paper presents engineering details of ObjSim: a fully automatic, lightweight similarity-based patch prioritization tool for JVM-based languages. The tool works by comparing the system state at the exit point(s) of patched method before and after patching and prioritizing patches that result in state that is more similar to that of original, unpatched version on passing tests while less similar on failing ones. Our experiments with patches generated by the recent APR tool PraPR for fixable bugs from Defects4J v1.4.0 show that ObjSim prioritizes 16.67% more genuine fixes in top-1 place. A demo video of the tool is located at https://bit.ly/2K8gnYV.

DOI: 10.1145/3395363.3404362

Crowdsourced requirements generation for automatic testing via knowledge graph

作者: Guo, Chao and He, Tieke and Yuan, Wei and Guo, Yue and Hao, Rui
关键词: Knowledge Graph, Crowdsourced Requirements, Android GUI Testing

Abstract

Crowdsourced testing provides an effective way to deal with the problem of Android system fragmentation, as well as the application scenario diversity faced by Android testing. The generation of test requirements is a significant part of crowdsourced testing. However, manually generating crowdsourced testing requirements is tedious, which requires the issuers to have the domain knowledge of the Android application under test. To solve these problems, we have developed a tool named KARA, short for Knowledge Graph Aided Crowdsourced Requirements Generation for Android Testing. KARA first analyzes the result of automatic testing on the Android application, through which the operation sequences can be obtained. Then, the knowledge graph of the target application is constructed in a manner of pay-as-you-go. Finally, KARA utilizes knowledge graph and the automatic testing result to generate crowdsourced testing requirements with domain knowledge. Experiments prove that the test requirements generated by KARA are well understandable, and KARA can improve the quality of crowdsourced testing. The demo video can be found at https://youtu.be/kE-dOiekWWM.

DOI: 10.1145/3395363.3404363

TauJud： test augmentation of machine learning in judicial documents

作者: Guo, Zichen and Liu, Jiawei and He, Tieke and Li, Zhuoyang and Zhangzhu, Peitian
关键词: Test Augmentation, Machine Learning, Judicial Documents

Abstract

The booming of big data makes the adoption of machine learning ubiquitous in the legal field. As we all know, a large amount of test data can better reflect the performance of the model, so the test data must be naturally expanded. In order to solve the high cost problem of labeling data in natural language processing, people in the industry have improved the performance of text classification tasks through simple data amplification techniques. However, the data amplification requirements in the judgment documents are interpretable and logical, as observed from CAIL2018 test data with over 200,000 judicial documents. Therefore, we have designed a test augmentation tool called TauJud specifically for generating more effective test data with uniform distribution over time and location for model evaluation and save time in marking data. The demo can be found at https://github.com/governormars/TauJud.

DOI: 10.1145/3395363.3404364

EShield： protect smart contracts against reverse engineering

作者: Yan, Wentian and Gao, Jianbo and Wu, Zhenhao and Li, Yue and Guan, Zhi and Li, Qingshan and Chen, Zhong
关键词: Smart Contract, Reverse Engineering, Program Analysis, Ethereum, Blockchain

Abstract

Smart contracts are the back-end programs of blockchain-based applications and the execution results are deterministic and publicly visible. Developers are unwilling to release source code of some smart contracts to generate randomness or for security reasons, however, attackers still can use reverse engineering tools to decompile and analyze the code. In this paper, we propose EShield, an automated security enhancement tool for protecting smart contracts against reverse engineering. EShield replaces original instructions of operating jump addresses with anti-patterns to interfere with control flow recovery from bytecode. We have implemented four methods in EShield and conducted an experiment on over 20k smart contracts. The evaluation results show that all the protected smart contracts are resistant to three different reverse engineering tools with little extra gas cost.

DOI: 10.1145/3395363.3404365

Echidna： effective, usable, and fast fuzzing for smart contracts

作者: Grieco, Gustavo and Song, Will and Cygan, Artur and Feist, Josselin and Groce, Alex
关键词: test generation, smart contracts, fuzzing

Abstract

Ethereum smart contracts—autonomous programs that run on a blockchain—often control transactions of financial and intellectual property. Because of the critical role they play, smart contracts need complete, comprehensive, and effective test generation. This paper introduces an open-source smart contract fuzzer called Echidna that makes it easy to automatically generate tests to detect violations in assertions and custom properties. Echidna is easy to install and does not require a complex configuration or deployment of contracts to a local blockchain. It offers responsive feedback, captures many property violations, and its default settings are calibrated based on experimental data. To date, Echidna has been used in more than 10 large paid security audits, and feedback from those audits has driven the features and user experience of Echidna, both in terms of practical usability (e.g., smart contract frameworks like Truffle and Embark) and test generation strategies. Echidna aims to be good at finding real bugs in smart contracts, with minimal user effort and maximal speed.

DOI: 10.1145/3395363.3404366

ProFL： a fault localization framework for Prolog

作者: Thompson, George and Sullivan, Allison K.
关键词: Prolog, Fault localization, Declarative programming

Abstract

Prolog is a declarative, first-order logic that has been used in a variety of domains to implement heavily rules-based systems. However, it is challenging to write a Prolog program correctly. Fortunately, the SWI-Prolog environment supports a unit testing framework, plunit, which enables developers to systematically check for correctness. However, knowing a program is faulty is just the first step. The developer then needs to fix the program which means the developer needs to determine what part of the program is faulty. ProFL is a fault localization tool that adapts imperative-based fault localization techniques to Prolog’s declarative environment. ProFL takes as input a faulty Prolog program and a plunit test suite. Then, ProFL performs fault localization and returns a list of suspicious program clauses to the user. Our toolset encompasses two different techniques: ProFLs, a spectrum-based technique, and ProFLm, a mutation-based technique. This paper describes our Python implementation of ProFL, which is a command-line tool, released as an open-source project on GitHub (https://github.com/geoorge1d127/ProFL). Our experimental results show ProFL is accurate at localizing faults in our benchmark programs.

DOI: 10.1145/3395363.3404367

FineLock： automatically refactoring coarse-grained locks into fine-grained locks

作者: Zhang, Yang and Shao, Shuai and Zhai, Juan and Ma, Shiqing
关键词: Static analysis, Refactoring, Read-write lock, Pushdown automaton, Fine-grained lock

Abstract

Lock is a frequently-used synchronization mechanism to enforce exclusive access to a shared resource. However, lock-based concurrent programs are susceptible to lock contention, which leads to low performance and poor scalability. Furthermore, inappropriate granularity of a lock makes lock contention even worse. Compared to coarse-grained lock, fine-grained lock can mitigate lock contention but difficult to use. Converting coarse-grained lock into fine-grained lock manually is not only error-prone and tedious, but also requires a lot of expertise. In this paper, we propose to leverage program analysis techniques and pushdown automaton to automatically covert coarse-grained locks into fine-grained locks to reduce lock contention. We developed a prototype FineLock and evaluates it on 5 projects. The evaluation results demonstrate FineLock can refactor 1,546 locks in an average of 27.6 seconds, including converting 129 coarse-grained locks into fine-grained locks and 1,417 coarse-grained locks into read/write locks. By automatically providing potential refactoring recommendations, our tool saves a lot of efforts for developers.

DOI: 10.1145/3395363.3404368

CPSDebug： a tool for explanation of failures in cyber-physical systems

作者: Bartocci, Ezio and Manjunath, Niveditha and Mariani, Leonardo and Mateis, Cristinel and Ni\v{c
关键词: Testing, Specification Mining, Model-based Development, Failure Explanation, Debugging, Cyber-Physical Systems

Abstract

Debugging Cyber-Physical System models is often challenging, as it requires identifying a potentially long, complex and heterogenous combination of events that resulted in a violation of the expected behavior of the system. In this paper we present CPSDebug, a tool for supporting designers in the debugging of failures in MATLAB Simulink/Stateflow models. CPSDebug implements a gray-box approach that combines testing, specification mining, and failure analysis to identify the causes of failures and explain their propagation in time and space. The evaluation of the tool, based on multiple usage scenarios and faults and direct feedback from engineers, shows that CPSDebug can effectively aid engineers during debugging tasks.

DOI: 10.1145/3395363.3404369

Test recommendation system based on slicing coverage filtering

作者: Qian, Ruixiang and Zhao, Yuan and Men, Duo and Feng, Yang and Shi, Qingkai and Huang, Yong and Chen, Zhenyu
关键词: Test recommendation, Test guid, Static analysis, Program Slice

Abstract

Software testing plays a crucial role in software lifecycle. As a basic approach of software testing, unit testing is one of the necessary skills for software practitioners. Since testers are required to understand the inner code of the software under test(SUT) while writing a test case, testers usually need to learn how to detect the bug within SUT effectively. When novice programmers started to learn writing unit tests, they will generally watch a video lesson or reading unit tests written by others. These learning approaches are either time-consuming or too hard for a novice. To solve these problems, we developed a system, named TeSRS, to assist novice programmers to learn unit testing. TeSRS is a test recommendation system which can effectively assist test novice in learning unit testing. Utilizing program slice technique, TeSRS has gotten an enormous amount of test snippets from superior crowdsourcing test scripts. Depending on these test snippets, TeSRS provides novices a easier way for unit test learning. To sum up, TeSRS can help test novices (1) obtain high level design ideas of unit test case and (2) improve capabilities(e.g. branch coverage rate and mutation coverage rate) of their test scripts. TeSRS has built a scalable corpus composed of over 8000 test snippets from more than 25 test problems. Its stable performance shows effectiveness in unit test learning. Demo video can be found at <a>https://youtu.be/xvrLdvU8zFA</a>

DOI: 10.1145/3395363.3404370

Automated mobile apps testing from visual perspective

作者: Xue, Feng
关键词: Test automation, Software testing, Mobile applications, Computer vision

Abstract

The current implementation of automated mobile apps testing generally relies on internal program information, such as reading code or GUI layout files, capturing event streams. This paper proposes an approach of automated mobile apps testing from a completely visual perspective. It uses computer vision technology to enable computer to judge the internal functions from the external GUI information of mobile apps as we humans do and generates test strategy for execution, which improves the interactivity, flexibility, and authenticity of testing. We believe that this vision-based testing approach will further help alleviate the contradiction between the current huge test requirements of mobile apps and the relatively lack of testers.

DOI: 10.1145/3395363.3402644

Program-aware fuzzing for MQTT applications

作者: Araujo Rodriguez, Luis Gustavo and Mac^{e
关键词: Testing, Security, MQTT, Internet of Things, Fuzzing

Abstract

Over the last few years, MQTT applications have been widely exposed to vulnerabilities because of their weak protocol implementations. For our preliminary research, we conducted background studies to: (1) determine the main cause of vulnerabilities in MQTT applications; and (2) analyze existing MQTT-based testing frameworks. Our preliminary results confirm that MQTT is most susceptible to malformed packets, and its existing testing frameworks are based on blackbox fuzzing, meaning vulnerabilities are difficult and time-consuming to find. Thus, the aim of my research is to study and develop effective fuzzing strategies for the MQTT protocol, thereby contributing to the development of more robust MQTT applications in IoT and Smart Cities.

DOI: 10.1145/3395363.3402645

Automatic support for the identification of infeasible testing requirements

作者: Choma Neto, Jo~{a
关键词: Structural Testing, Software Testing, Search Based Software Testing, Infeasible Path Problem

Abstract

Software testing activity is imperative to improve software quality. However, finding a set of test cases satisfies a given test criterion, is not a trivial task because the overall input domain is very large, and different test sets can be derived, with different effectiveness. In the context of structural testing, the non-executability is a feature present in most programs, increasing cost and effort of testing activity. When concurrent programs are tested, new challenges arise, mainly related to the non-determinism. Non-determinism can result in different possible test outputs for the same test input, which makes the problem of non-executability more complex, requiring treatment. In this sense, our project intends to define an approach to support automatic identification of infeasible testing requirements. Hence, this proposal aims to identify properties which cause infeasible testing requirements and automate their application. Due to complexity of the problem, we will apply search-based algorithms in the automation of concurrent and sequential programs treatment.

DOI: 10.1145/3395363.3402646

WEIZZ： automatic grey-box fuzzing for structured binary formats

Abstract

Active fuzzing for testing and securing cyber-physical systems

Abstract

Replication Package for lFuzzer - Learning Input Tokens for Effective Fuzzing

Abstract

Fast bit-vector satisfiability

Abstract

Relocatable addressing model for symbolic execution

Abstract

Replication package for： Running Symbolic Execution Forever

Abstract

Tool Package for paper "Can Automated Program Repair Refine Fault Localization? A Unified Debugging Approach "

Abstract

Automated repair of feature interaction failures in automated driving systems

Abstract

CoCoNuT： combining context-aware neural translation models using ensemble for program repair

Abstract

Detecting and diagnosing energy issues for mobile applications

Abstract

Automated classification of actions in bug reports of mobile apps

Abstract

DLD： Data Loss Detector

Abstract

Reinforcement learning based curiosity-driven testing of Android applications

Abstract

Replication Package for Article： Effective White-box Testing of Deep Neural Networks with Adaptive Neuron-Selection Strategy

Abstract

DeepGini： prioritizing massive tests to enhance the robustness of deep neural networks

Abstract

DPFuzz： Fuzzing and Debugging for Differential Performance Bugs in Machine Learning Libraries

Abstract

Higher income, larger loan? monotonicity testing of machine learning models

Abstract

Detecting flaky tests in probabilistic and machine learning applications

Abstract

Scaffle： bug localization on millions of files

Abstract

Replication package for Abstracting Failure Inducing Inputs

Abstract

Debugging the performance of Maven’s test isolation： experience report

Abstract

Feedback-driven side-channel analysis for networked applications

Abstract

Scalable analysis of interaction threats in IoT systems

Abstract

DeepSQLi： deep semantic learning for testing SQL injection

Abstract

Dependent-test-aware regression testing techniques

Abstract

Differential regression testing for REST APIs

Abstract

Empirically revisiting and enhancing IR-based test-case prioritization

Abstract

Intermittently failing tests in the embedded systems domain

Abstract

Feasible and Stressful Trajectory Generation for Mobile Robots - Artifact

Abstract

Replication Package for Article： Detecting Cache-Related Bugs in Spark Applications

Abstract

Patch based vulnerability matching for binary programs

Abstract

Identifying Java Calls in Native Code via Binary Scanning (artifact)

Abstract

An empirical study on ARM disassembly tools

Abstract

Artifact for Article “How Effective Are Smart Contract Analysis Tools? Evaluating Smart Contract Static Analysis Tools using Bug Injection”

Abstract

A programming model for semi-implicit parallelization of static analyses

Abstract

Recovering fitness gradients for interprocedural Boolean flags in search-based testing

Abstract

Scalable build service system with smart scheduling service

Abstract

Escaping dependency hell： finding build dependency errors with the unified dependency graph

Abstract

How far we have come： testing decompilation correctness of C decompilers

Abstract

FPDiff： Discvovering Discrepancies in Numerical Libraries

Abstract