FSE 2021

xurongchen/fse20： Artifacts for FSE2020 paper#633

作者: Xu, Rongchen and He, Fei and Wang, Bow-Yaw
关键词: Software Engineering

Abstract

artifacts track, fse20

java-ranger： v1.0.0

作者: Sharma, Vaibhav and Hussein, Soha and Whalen, Michael W. and McCamant, Stephen and Visser, Willem
关键词: Java, Software Engineering, Software Verification

Abstract

This is the version of Java Ranger that was used in the evaluation accepted to FSE 2020.

DOI: 10.1145/3368089.3409734

docable/docable v1.1

作者: Mirhosseini, Samim and Parnin, Chris
关键词: Software Engineering

Abstract

No description provided.

DOI: 10.1145/3368089.3409706

Programming and execution models for next generation code intelligence systems (keynote)

作者: Mezini, Mira
关键词: Learning-based Code Intelligence, Programming and Execution Models, Static Program Analysis

Abstract

McCarthy defined Artificial Intelligence (AI) as “the science and engineering of making intelligent computer programs”. I define Code Intelligence (CI) by specializing his definition to intelligent computer programs that analyze other computer programs (PAPs for short) and use it to structure the keynote in two parts. The first part argues for more research on engineering foundations of PAPs. The second part outlooks new directions of combining rule-based intelligence of current PAPs with learning capabilities. Together, novel engineering foundations and enhanced intelligence characterize what I think will be the next generation of PAPs.

DOI: 10.1145/3468264.3478688

The 4ps： product, process, people, and productivity： a data-driven approach to improve software engineering (keynote)

作者: Nagappan, Nachiappan
关键词: Defect Prediction, Developer Productivity, Distributed Development, Effort Estimation, Empirical Software Engineering

Abstract

In this talk I will provide a broad overview on developer productivity and dive deep into specific analysis related to how product, process and the people impact productivity. I will use examples from industry on effort estimation and defect prediction in product, distributed development in process and the ramp up of new employees in the people category. The talk will also cover interventions via tools and process changes and their impact and discuss future challenges. This talk will be based on previously published work.

DOI: 10.1145/3468264.3478690

Interactive analysis of large code bases (invited talk)

作者: Holzmann, Gerard J.
关键词: Interactive Analysis, Static Source Code Analysis

Abstract

Current static source code analyzers can be slow, hard to use correctly, and expensive. If not properly configured, they can also generate large amounts of output, even for well-written code. To fix this, we developed a new tool called Cobra. The Cobra tool can be used interactively even on very large code bases, which means that it is very fast. It is also designed to be easy to use and free. The tool comes with a library of predefined queries to catch standard coding issues, including cyber-security related risks from the CVE database. The query libraries can be used as provided, but the best part is that you can also easily modify and extend those queries, or add your own. The Cobra tool is language neutral: you can use it to check C or C++ code, or Java, Ada, Python, or even English prose. I’ll show how the tool works, what makes it fast, and how you can write powerful queries in a couple of different ways.

DOI: 10.1145/3468264.3478691

Managers hate uncertainty： good and bad experiences with adaptive project management (invited talk)

作者: Schamin'{e
关键词: Adaptive, Congestion, Leadership, Predictability, Predictive, Product Innovation, Productivity, Uncertainty

Abstract

There is intrinsic uncertainty in product development. Uncertainty is often confused with risk, where unknowns can be identified in advance. Traditional predictive project management methods are based on risk management and it will be explained why these methods often are less effective in domains with higher uncertainty. Adaptive methods seem to be more effective, as they better deal with that uncertainty. Examples will be given where adaptive methods have led to higher predictability. But many managers still often believe predictive methods lead to higher predictability. We will formulate some hypotheses on why managers still prefer predictive methods, based on own experience and initial research. Further research is required to validate these hypotheses.

DOI: 10.1145/3468264.3478692

Industrial best practices for continuous integration (CI) and continuously delivery (CD) (invited talk)

作者: Micco, John
关键词: Continuous Delivery, Continuous Integration, Devops, Flaky Tests, Software Testing

Abstract

Industrial best practices for Continuous Integration (CI) and Continuously Delivery (CD) are constantly evolving, and the state of the art is advancing quickly across the industry. In this talk I will discuss the best practices that have been implemented by many top software companies across the industry including Google, Netflix, and VMWare where I have worked. I will discuss how these systems are being optimized to reduce human and machine resources required to qualify software releases and reduce the risk of latent defects.

DOI: 10.1145/3468264.3478693

Huawei’s practices on trusted software engineering capability improvement (invited talk)

作者: Wang, Wilson
关键词: Extra-Functional Properties, Software Engineering and Technologies, Trustworthiness, User Studies

Abstract

Human society is rapidly transforming to the digital society, the new technologies drive digital and intelligent transformation in all industries. These technologies promise cost savings, efficiency gains, and new value. At the same time, we see growing challenges relating to cyber security and privacy protection and functional safety etc. As a leading ICT product and service provider, Huawei has been committed to providing customers with high-quality and user-friendly products and services. Two years ago, Huawei initiated the Transformation Program for Software Engineering Capability Enhancement, to improve company-wide software engineering capabilities, improve the trustworthiness of both processes and results, and to provide trustworthy and quality products. This talk will share the practices, progress, and challenges of the transformation.

DOI: 10.1145/3468264.3478694

Hazard Trees for Human-on-the-Loop Interactions in sUAS Systems

作者: Vierhauser, Michael and Islam, Md Nafee Al and Agrawal, Ankit and Cleland-Huang, Jane and Mason, James
关键词: hazard analysis, Human-sUAS interaction, safety analysis, sUAS

Abstract

Traditional safety analysis for Cyber-Physical Systems in general, and for sUAS (small unmanned aerial systems) in particular, typically focuses on system-level hazards with little focus on user-related or user-induced hazards that can cause critical system failures. To address this issue, we have constructed domain-level safety analysis assets for sUAS applications following a rigorous process. In this process, we explicitly, and systematically identified Human Interaction Points (HiPs), Hazard Factors, and Mitigations from system hazards.

We have created eight different hazard trees, each covering a specific aspect of sUAS safety: Collisions: Addresses hazards related to collisions between sUAS, other objects, and terrain. Communication: Addresses hazards related to the loss of communication with sUAS during flight. Hardware/Sensors: Addresses hazards related to sUAS hardware such as cameras used for object detection, GPS, parachutes, etc. Mission Awareness: Addresses hazards related to a mission executed by an sUAS, its mission status, and decision-making during a mission. Mission Planning: Addresses hazards related to mission planning - before the mission is executed, such as planning and assigning flight routes and sUAS task allocation. Preflight Configuration: Addresses hazards related to preflight configuration properties such as geofence settings, or launch parameters. Regulatory Compliance: Addresses hazards related to airspace, flight constraints, and regulations for operating sUAS in an airspace. Weather: Addresses hazards related to weather conditions, temperature, wind or reduced visibility due to adverse weather conditions.

DOI: 10.1145/3468264.3468534

UAV bugs dataset and taxonomy

作者: Wang, Dinghua and Li, Shuqing and Xiao, Guanping and Liu, Yepang and Sui, Yulei
关键词: Bug, Taxonomy, UAV

Abstract

In our paper, we conducted a large-scale empirical study to characterize UAV-specific bugs in two open-source UAV platforms, namely PX4 and Ardupilot. We identified 168 UAV-specific bugs from 569 collected real bugs on GitHub. By analyzing these bugs (including bug reports, patches, and project development history), we proposed a taxonomy of UAV-specific bugs and summarized five challenges for detecting and fixing bugs in UAV systems. We believe that this study can facilitate future research and the development of UAV systems. Both UAV developers and users can receive useful guidance from our study.

The link to our replication package is：https://doi.org/10.5281/zenodo.4898868

This data set contains 569 real-world bugs, 168 UAV-specific bugs, and their taxonomy. Our replication package consists of two maim folders：bugSet and bugTaxonomy.

DOI: 10.1145/3468264.3468559

Code integrity attestation for PLCs using black box neural network predictions

作者: Chen, Yuqi and Poskitt, Christopher M. and Sun, Jun
关键词: Cyber-physical systems, adversarial attacks, attestation, code integrity checking, neural networks, programmable logic controllers

Abstract

Cyber-physical systems (CPSs) are widespread in critical domains, and significant damage can be caused if an attacker is able to modify the code of their programmable logic controllers (PLCs). Unfortunately, traditional techniques for attesting code integrity (i.e. verifying that it has not been modified) rely on firmware access or roots-of-trust, neither of which proprietary or legacy PLCs are likely to provide. In this paper, we propose a practical code integrity checking solution based on privacy-preserving black box models that instead attest the input/output behaviour of PLC programs. Using faithful offline copies of the PLC programs, we identify their most important inputs through an information flow analysis, execute them on multiple combinations to collect data, then train neural networks able to predict PLC outputs (i.e. actuator commands) from their inputs. By exploiting the black box nature of the model, our solution maintains the privacy of the original PLC code and does not assume that attackers are unaware of its presence. The trust instead comes from the fact that it is extremely hard to attack the PLC code and neural networks at the same time and with consistent outcomes. We evaluated our approach on a modern six-stage water treatment plant testbed, finding that it could predict actuator states from PLC inputs with near-100% accuracy, and thus could detect all 120 effective code mutations that we subjected the PLCs to. Finally, we found that it is not practically possible to simultaneously modify the PLC code and apply discreet adversarial noise to our attesters in a way that leads to consistent (mis-)predictions.

DOI: 10.1145/3468264.3468617

Artifact for “PHYSFRAME： Type Checking Physical Frames of Reference for Robotic Systems”

作者: Kate, Sayali and Chinn, Michael and Choi, Hongjun and Zhang, Xiangyu and Elbaum, Sebastian
关键词: Frame Consistency, Physical Frame of Reference, ROS, Static Analysis, Type Checking, z-score Mining

Abstract

Summary

PHYSFRAME is a static analysis tool for detecting reference frame inconsistencies and violations of common practices (i.e., implicit frame conventions) in C/C++ projects that build against the Robot Operating System. It requires nothing from the developers except running the tool on their project. The tool automatically models the project, and checks for problems.

Contents

This repository contains the following files and folders:

README.md : This file.
LICENSE.txt : BSD 2-Clause license.
STATUS.txt : Describes the ACM artifact badges sought for this artifact.
REQUIREMENTS.md : Describes the hardware and software requirements.
INSTALL.txt : Describes how to install PHYSFRAME using Docker and a minimal working example.
HOWTO.txt: Describes how to use PHYSFRAME with the provided data.
HOWTO-WITH-SCRIPT.txt : Describes how to run scripts to evaluate PHYSFRAME with the provided data.
untar_data.sh : Script to untar files in data/.
evaluate_data.sh : Script to run PHYSFRAME on the provided data.
evaluate_z-score.sh : Script to run PHYSFRAME on the provided data with z-score = 2, 5 or 10.
Dockerfile : a Docker install file for PHYSFRAME.
requirements.txt : List of python dependencies required by PHYSFRAME. Referenced by the Dockerfile.
src/ : The python source code for PHYSFRAME, files containing implicit frame conventions.
data/ : Dataset of C/C++ projects (downloaded from public GitHub repositories) used to evaluate PHYSFRAME.
USER-GUIDE.txt : Helpful notes for user.

DOI: 10.1145/3468264.3468608

Automating Serverless Deployments for DevOps Organizations： Root Artifact

作者: Sokolowski, Daniel and Weisenburger, Pascal and Salvaneschi, Guido
关键词: DevOps, Infrastructure as Code, Software Dependencies, Software Engineering

Abstract

This artifact bundles all material supplementing:

[1] Daniel Sokolowski, Pascal Weisenburger, and Guido Salvaneschi. 2021. Automating Serverless Deployments for DevOps Organizations. In Proceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’21), August 23–28, 2021, Athens, Greece. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3468264.3468575

Dependencies in DevOps Survey 2021

https://doi.org/10.5281/zenodo.4873909 provides the dataset, a detailed report, and all analysis and content creation scripts for the contained technical report and all survey-related content in [1]. It supplements Section 2 in [1].

µs Infrastructure as Code

https://doi.org/10.5281/zenodo.4902323 is the implementation of µs. It is reusable for IaC deployments and sets the base for future research on reactive IaC deployments. We suggest looking at the contained webpage example project and running it using the provided mjuz/mjuz Docker image. For this, follow the instructions in the README in the webpage’s subdirectory, showcasing an example setup using µs and plain Pulumi with both a centralized and a decentralized deployment. The “decentralized-mjuz” version uses the automated deployment coordination proposed in [1]. The Docker image is available on Docker Hub, but for long-term archiving, it is also included in this root artifact in mjuz-mjuz-docker-image.tar.zip. You can load and register it locally with the tags mjuz/mjuz:latest and mjuz/mjuz:1.0.0 by unzipping the file and running docker load -i mjuz-mjuz-docker-image.tar.

The µs implementation uses – and its Docker image builds upon – the Pulumi for µs CLI: https://doi.org/10.5281/zenodo.4902319. Its demonstration is already covered by the µs artifact in the previous paragraph; still, we include it here for completeness. Its Docker image is available on Docker Hub, too, and included in this artifact in mjuz-pulumi-docker-image.tar.zip. You can load and register it locally with the tags mjuz/pulumi:latest and mjuz/pulumi:1.0.0 by unzipping the file and running docker load -i mjuz-pulumi-docker-image.tar.

µs Performance Evaluation

http://doi.org/10.5281/zenodo.4902330 contains the materials used for the performance evaluation of µs in Subsection 8.2 in [1]. It includes the deployment definitions, the measurement scripts, the measured data, and the scripts to generate the paper’s plots from the data.

Pulumi TypeScript Projects using Stack References

https://doi.org/10.5281/zenodo.4878577 is the dataset of public GitHub repositories that contain Pulumi TypeScript projects using stack references. It supplements Subsection 8.3 in [1].

Pulumi TypeScript Stack References to µs Converter

https://doi.org/10.5281/zenodo.4902171 converts existing stack references and outputs in Pulumi TypeScript projects to µs remotes, wishes, and offers. It supplements Subsection 8.3 in [1], where it is applied to the Pulumi TypeScript Projects using Stack References dataset.

DOI: 10.1145/3468264.3468575

Replication package for article： Algebraic-Datatype Taint Tracking, with Applications to Understanding Android Identifier Leaks

作者: Rahaman, Sydur and Neamtiu, Iulian and Yin, Xin
关键词: android, fingerprinting, identifier leak, mobile security, static analysis, taint analysis

Abstract

Given a list of sources, this TaintTracker tool produces algebric leak signatures of the sources and also categorize the leak as third party or own code leak.

CFG-Generator.jar (extension of Amandroid) will generate CFG (Control Flow Graph) given an apk

App_Wise_Signature.py will create the leak signature given the CFG file as text input

DOI: 10.1145/3468264.3468550

Vet： identifying and avoiding UI exploration tarpits

作者: Wang, Wenyu and Yang, Wei and Xu, Tianyin and Xie, Tao
关键词: Android testing, UI testing, trace analysis

Abstract

Despite over a decade of research, it is still challenging for mobile UI testing tools to achieve satisfactory effectiveness, especially on industrial apps with rich features and large code bases. Our experiences suggest that existing mobile UI testing tools are prone to exploration tarpits, where the tools get stuck with a small fraction of app functionalities for an extensive amount of time. For example, a tool logs out an app at early stages without being able to log back in, and since then the tool gets stuck with exploring the app’s pre-login functionalities (i.e., exploration tarpits) instead of its main functionalities. While tool vendors/users can manually hardcode rules for the tools to avoid specific exploration tarpits, these rules can hardly generalize, being fragile in face of diverted testing environments, fast app iterations, and the demand of batch testing product lines. To identify and resolve exploration tarpits, we propose VET, a general approach including a supporting system for the given specific Android UI testing tool on the given specific app under test (AUT). VET runs the tool on the AUT for some time and records UI traces, based on which VET identifies exploration tarpits by recognizing their patterns in the UI traces. VET then pinpoints the actions (e.g., clicking logout) or the screens that lead to or exhibit exploration tarpits. In subsequent test runs, VET guides the testing tool to prevent or recover from exploration tarpits. From our evaluation with state-of-the-art Android UI testing tools on popular industrial apps, VET identifies exploration tarpits that cost up to 98.6% testing time budget. These exploration tarpits reveal not only limitations in UI exploration strategies but also defects in tool implementations. VET automatically addresses the identified exploration tarpits, enabling each evaluated tool to achieve higher code coverage and improve crash-triggering capabilities.

DOI: 10.1145/3468264.3468554

Checking conformance of applications against GUI policies

作者: Zhang, Zhen and Feng, Yu and Ernst, Michael D. and Porst, Sebastian and Dillig, Isil
关键词: Android, ad fraud, mobile app, static analysis, user interface

Abstract

A good graphical user interface (GUI) is crucial for an application’s usability, so vendors and regulatory agencies increasingly place restrictions on how GUI elements should appear to and interact with users. Motivated by this concern, this paper presents a new technique (based on static analysis) for checking conformance between (Android) applications and GUI policies expressed in a formal specification language. In particular, this paper (1) describes a specification language for formalizing GUI policies, (2) proposes a new program abstraction called an event-driven layout forest, and (3) describes a static analysis for constructing this abstraction and checking it against a GUI policy. We have implemented the proposed approach in a tool called Venus, and we evaluate it on 2361 Android applications and 17 policies. Our evaluation shows that Venus can uncover malicious applications that perform ad fraud and identify violations of GUI design guidelines and GDPR laws.

DOI: 10.1145/3468264.3468561

Data-driven accessibility repair revisited： on the effectiveness of generating labels for icons in Android apps

作者: Mehralian, Forough and Salehnamadi, Navid and Malek, Sam
关键词: Accessibility, Alternative Text, Android, Deep Learning, Screen Reader

Abstract

Mobile apps are playing an increasingly important role in our daily lives, including the lives of approximately 304 million users worldwide that are either completely blind or suffer from some form of visual impairment. These users rely on screen readers to interact with apps. Screen readers, however, cannot describe the image icons that appear on the screen, unless those icons are accompanied with developer-provided textual labels. A prior study of over 5,000 Android apps found that in around 50% of the apps, less than 10% of the icons are labeled. To address this problem, a recent award-winning approach, called LabelDroid, employed deep-learning techniques to train a model on a dataset of existing icons with labels to automatically generate labels for visually similar, unlabeled icons. In this work, we empirically study the nature of icon labels in terms of distribution and their dependency on different sources of information. We then assess the effectiveness of LabelDroid in predicting labels for unlabeled icons. We find that icon images are insufficient in representing icon labels, while other sources of information from the icon usage context can enrich images in determining proper tokens for labels. We propose the first context-aware label generation approach, called COALA, that incorporates several sources of information from the icon in generating accurate labels. Our experiments show that although COALA significantly outperforms LabelDroid in both user study and automatic evaluation, further research is needed. We suggest that future studies should be more cautious when basing their approach on automatically extracted labeled data.

DOI: 10.1145/3468264.3468604

Replication Package for Article： Benchmarking Automated GUI Testing for Android against Real-World Bugs

作者: Su, Ting and Wang, Jue and Su, Zhendong
关键词: Android apps, Benchmarking, Crash bugs, GUI testing

Abstract

Our artifact is named Themis. Themis is a collection of real-world, reproducible crash bugs (collected from open-source Android apps) and a unified, extensible infrastructure for benchmarking automated GUI testing for Android and beyond. Themis now contains 52 critical crash bugs and integrates six state-of-the-art/practice GUI testing tools.

DOI: 10.1145/3468264.3468620

Replication of checking LTL[F,G,X] on compressed traces in polynomial time

作者: Zhang, Minjian and Mathur, Umang and Viswanathan, Mahesh
关键词: Model checking, software engineering, verification

Abstract

The artifact contains all tools to reproduce the experiment result of the corresponding paper, which includes the algorithm introduced, manually encoded automata, graphs and formal descriptions of tested properties, and the compressed, uncompressed traces.

DOI: 10.1145/3468264.3468557

Conditional interpolation： making concurrent program verification more effective

作者: Su, Jie and Tian, Cong and Duan, Zhenhua
关键词: CEGAR, concurrent program verification, conditional interpolation, software model checking, state-space reduction

Abstract

Due to the state-space explosion problem, efficient verification of real-world programs in large scale is still a big challenge. Particularly, thread alternation makes the verification of concurrent programs much more difficult since it aggravates this problem. In this paper, an application of Craig interpolation, namely conditional interpolation, is proposed to work together with CEGAR-based approach to reduce the state-space of concurrent tasks. Specifically, conditional interpolation is formalized to confine the reachable region of states so that infeasible conditional branches could be pruned. Furthermore, the generated conditional interpolants are utilized to shorten the interpolation paths, which makes the time consumed for verification significantly reduced. We have implemented the proposed approach on top of an open-source software model checker. Empirical results show that the conditional interpolation is effective in improving the verification efficiency of concurrent tasks.

DOI: 10.1145/3468264.3468602

Benchmark for Paper： AlloyMax： Bringing Maximum Satisfaction to Relational Specifications

作者: Zhang, Changjian and Wagner, Ryan and Orvalho, Pedro and Garlan, David and Manquinho, Vasco and Martins, Ruben and Kang, Eunsuk
关键词: Alloy, MaxSAT, Model synthesis, Relational specifications, SAT

Abstract

This is the reproduction package of the benchmarks used in the work AlloyMax: Bringing Maximum Satisfaction to Relational Specifications. This package contains AlloyMax executable, the necessary libraries, the models used in the paper, and the scripts for running the benchmark.

DOI: 10.1145/3468264.3468587

MoD2： Model-guided Deviation Detector

作者: Tong, Yanxiang and Qin, Yi and Jiang, Yanyan and Xu, Chang and Cao, Chun and Ma, Xiaoxing
关键词: Control Theory, Model Deviation, Self-Adaptive Software

Abstract

a tool for timely and accurate detection of model deviation

DOI: 10.1145/3468264.3468548

Artifact for “Lightweight and Modular Resource Leak Verification”

作者: Kellogg, Martin and Shadab, Narges and Sridharan, Manu and Ernst, Michael D.
关键词: accumulation analysis, Pluggable type systems, resource leaks, static analysis, type- state analysis

Abstract

This upload is a docker image containing the artifact accompanying our ESEC/FSE’21 paper “Lightweight and Modular Resource Leak Verification”.

To run the image,

0.) Install Docker following the directions at [https://www.docker.com/get-started] for your OS, if it is not already installed. We have tested the artifact with Docker Desktop on MacOS, but it should work for other operating systems.

1.) Unzip the provided Docker image. gunzip -c path/to/resource-leak-checker.tar.gz > resource-leak-checker.tar

2.) Load it into Docker. docker load < resource-leak-checker.tar

3.) Run the image. This should open a bash shell, at the home directory of user fse. docker run -it msridhar/rlc:latest

Instructions for how to run the paper’s experiments are inside the container in the object-construction-checker/fse-2021/README.md file in the fse user’s home directory.

DOI: 10.1145/3468264.3468576

zhangmx1997/fse21-jsisolate-artifact： JSIsolate version 1.1.0

作者: Zhang, Mingxue and Meng, Wei
关键词: JavaScript isolation, JSIsolate

Abstract

The artefact contains the implementation of JSIsolate, as well as our analysis scripts. Detailed description can be found in the README file. Please refer to the supplementary material for the pre-built version of JSIsolate and the dataset we collected.

DOI: 10.1145/3468264.3468577

Code and Data Repository for Article： Cross-Language Code Search using Static and Dynamic Analyses

作者: Mathew, George and Stolee, Kathryn T.
关键词: code-to-code search, cross-language code search, dynamic analysis, non-dominated sorting, static analysis

Abstract

Code repository with code for the COSAL tool which performs code search across Java and Python. COSAL uses static and dynamic similarity measures to identify similar code across languages. For static similarity, COSAL uses token based and AST based similarity while it uses Input-Output behavior of code for dynamic similarity.

DOI: 10.1145/3468264.3468538

Automating the removal of obsolete TODO comments

作者: Gao, Zhipeng and Xia, Xin and Lo, David and Grundy, John and Zimmermann, Thomas
关键词: Bert, Code-Comment Inconsistency, TODO comment

Abstract

TODO comments are very widely used by software developers to describe their pending tasks during software development. However, after performing the task developers sometimes neglect or simply forget to remove the TODO comment, resulting in obsolete TODO comments. These obsolete TODO comments can confuse development teams and may cause the introduction of bugs in the future, decreasing the software’s quality and maintainability. Manually identifying obsolete TODO comments is time-consuming and expensive. It is thus necessary to detect obsolete TODO comments and remove them automatically before they cause any unwanted side effects. In this work, we propose a novel model, named TDCleaner, to identify obsolete TODO comments in software projects. TDCleaner can assist developers in just-in-time checking of TODO comments status and avoid leaving obsolete TODO comments. Our approach has two main stages: offline learning and online prediction. During offline learning, we first automatically establish <code_change, todo_comment, commit_msg> training samples and leverage three neural encoders to capture the semantic features of TODO comment, code change and commit message respectively. TDCleaner then automatically learns the correlations and interactions between different encoders to estimate the final status of the TODO comment. For online prediction, we check a TODO comment’s status by leveraging the offline trained model to judge the TODO comment’s likelihood of being obsolete. We built our dataset by collecting TODO comments from the top-10,000 Python and Java Github repositories and evaluated TDCleaner on them. Extensive experimental results show the promising performance of our model over a set of benchmarks. We also performed an in-the-wild evaluation with real-world software projects, we reported 18 obsolete TODO comments identified by TDCleaner to Github developers and 9 of them have already been confirmed and removed by the developers, demonstrating the practical usage of our approach.

DOI: 10.1145/3468264.3468553

Estimating Residual Risk in Greybox Fuzzing - Artifacts

作者: B"{o
关键词: estimation, fuzzing, probability, software testing, statistics

Abstract

We make publicly available the tool used to produce the data, the data used to validate the claims made in the paper, and the simulation+evaluation scripts to produce from the data the figures shown in the paper. In the context of our paper, we conducted several simulation studies and evaluated the performance of the classical and proposed estimators of residual risk in the presence of adaptive bias in greybox fuzzing.

The data for the empirical evaluation were generated through fuzzing campaigns with a modified version of LibFuzzer. In this experimental setup, we establish ground truth for discovery probability to evaluate estimator performance with respect to the ground truth.

The workbooks and source code for experimental setup are available at https://github.com/Adaptive-Bias/fse21_paper270.

DOI: 10.1145/3468264.3468570

HeteroFuzz： fuzz testing to detect platform dependent divergence for heterogeneous applications

作者: Zhang, Qian and Wang, Jiyuan and Kim, Miryung
关键词: Fuzz testing, heterogeneous applications, platform-dependent divergence

Abstract

As specialized hardware accelerators like FPGAs become a prominent part of the current computing landscape, software applications are increasingly constructed to leverage heterogeneous architectures. Such a trend is already happening in the domain of machine learning and Internet-of-Things (IoT) systems built on edge devices. Yet, debugging and testing methods for heterogeneous applications are currently lacking. These applications may look similar to regular C/C++ code but include hardware synthesis details in terms of preprocessor directives. Therefore, their behavior under heterogeneous architectures may diverge significantly from CPU due to hardware synthesis details. Further, the compilation and hardware simulation cycle takes an enormous amount of time, prohibiting frequent invocations required for fuzz testing. We propose a novel fuzz testing technique, called HeteroFuzz, designed to specifically target heterogeneous applications and to detect platform-dependent divergence. The key essence of HeteroFuzz is that it uses a three-pronged approach to reduce the long latency of repetitively invoking a hardware simulator on a heterogeneous application. First, in addition to monitoring code coverage as a fuzzing guidance mechanism, we analyze synthesis pragmas in kernel code and monitor accelerator-relevant value spectra. Second, we design dynamic probabilistic mutations to increase the chance of hitting divergent behavior under different platforms. Third, we memorize the boundaries of seen kernel inputs and skip HLS simulator invocation if it can expose only redundant divergent behavior. We evaluate HeteroFuzz on seven real-world heterogeneous applications with FPGA kernels. HeteroFuzz is 754X faster in exposing the same set of distinct divergence symptoms than naive fuzzing. Probabilistic mutations contribute to 17.5X speed up than the one without. Selective invocation of HLS simulation contributes to 8.8X speed up than the one without.

DOI: 10.1145/3468264.3468610

Sound and efficient concurrency bug prediction

作者: Cai, Yan and Yun, Hao and Wang, Jinqiu and Qiao, Lei and Palsberg, Jens
关键词: Concurrency bugs, atomicity violations, data races, deadlocks

Abstract

Concurrency bugs are extremely difficult to detect. Recently, several dynamic techniques achieve sound analysis. M2 is even complete for two threads. It is designed to decide whether two events can occur consecutively. However, real-world concurrency bugs can involve more events and threads. Some can occur when the order of two or more events can be exchanged even if they occur not consecutively. We propose a new technique SeqCheck to soundly decide whether a sequence of events can occur in a specified order. The ordered sequence represents a potential concurrency bug. And several known forms of concurrency bugs can be easily encoded into event sequences where each represents a way that the bug can occur. To achieve it, SeqCheck explicitly analyzes branch events and includes a set of efficient algorithms. We show that SeqCheck is sound; and it is also complete on traces of two threads. We have implemented SeqCheck to detect three types of concurrency bugs and evaluated it on 51 Java benchmarks producing up to billions of events. Compared with M2 and other three recent sound race detectors, SeqCheck detected 333 races in ~30 minutes; while others detected from 130 to 285 races in ~6 to ~12 hours. SeqCheck detected 20 deadlocks in ~6 seconds. This is only one less than Dirk; but Dirk spent more than one hour. SeqCheck also detected 30 atomicity violations in ~20 minutes. The evaluation shows SeqCheck can significantly outperform existing concurrency bug detectors.

DOI: 10.1145/3468264.3468549

ObjLupAnsys

作者: Li, Song and Kang, Mingqing and Hou, Jianwei and Cao, Yinzhi
关键词: ESEC, FSE, NodeJS, ObjLupAnsys, prototype pollution, Vulnerability

Abstract

ObjLupAnsys is a tool to detect prototype pollution vulnerabilities in Node.js packages. This project is written in Python and JavaScript.

DOI: 10.1145/3468264.3468542

Detecting concurrency vulnerabilities based on partial orders of memory and thread events

作者: Yu, Kunpeng and Wang, Chenxu and Cai, Yan and Luo, Xiapu and Yang, Zijiang
关键词: concurrency vulnerability, multi-threaded programs, partial orders

Abstract

Memory vulnerabilities are the main causes of software security problems. However, detecting vulnerabilities in multi-threaded programs is challenging because many vulnerabilities occur under specific executions, and it is hard to explore all possible executions of a multi-threaded program. Existing approaches are either computationally intensive or likely to miss some vulnerabilities due to the complex thread interleaving. This paper introduces a novel approach to detect concurrency memory vulnerabilities based on partial orders of events. A partial order on a set of events represents the definite execution orders of events. It allows constructing feasible traces exposing specific vulnerabilities by exchanging the execution orders of vulnerability-potential events. It also reduces the search space of possible executions and thus improves computational efficiency. We propose new algorithms to extract vulnerability-potential event pairs for three kinds of memory vulnerabilities. We also design a novel algorithm to compute a potential event pair’s feasible set, which contains the relevant events required by a feasible trace. Our method extends existing approaches for data race detection by considering that two events are protected by the same lock. We implement a prototype of our approach and conduct experiments to evaluate its performance. Experimental results show that our tool exhibits superiority over state-of-the-art algorithms in both effectiveness and efficiency.

DOI: 10.1145/3468264.3468572

Vulnerability detection with fine-grained interpretations

作者: Li, Yi and Wang, Shaohua and Nguyen, Tien N.
关键词: Deep Learning, Explainable AI (XAI), Intelligence Assistant, Interpretable AI, Vulnerability Detection

Abstract

Despite the successes of machine learning (ML) and deep learning (DL)-based vulnerability detectors (VD), they are limited to providing only the decision on whether a given code is vulnerable or not, without details on what part of the code is relevant to the detected vulnerability. We present IVDetect, an interpretable vulnerability detector with the philosophy of using Artificial Intelligence (AI) to detect vulnerabilities, while using Intelligence Assistant (IA) to provide VD interpretations in terms of vulnerable statements. For vulnerability detection, we separately consider the vulnerable statements and their surrounding contexts via data and control dependencies. This allows our model better discriminate vulnerable statements than using the mixture of vulnerable code and contextual code as in existing approaches. In addition to the coarse-grained vulnerability detection result, we leverage interpretable AI to provide users with fine-grained interpretations that include the sub-graph in the Program Dependency Graph (PDG) with the crucial statements that are relevant to the detected vulnerability. Our empirical evaluation on vulnerability databases shows that IVDetect outperforms the existing DL-based approaches by 43%–84% and 105%–255% in top-10 nDCG and MAP ranking scores. IVDetect correctly points out the vulnerable statements relevant to the vulnerability via its interpretation in 67% of the cases with a top-5 ranked list. IVDetect improves over the baseline interpretation models by 12.3%–400% and 9%–400% in accuracy.

DOI: 10.1145/3468264.3468597

Identifying casualty changes in software patches

作者: Sejfia, Adriana and Zhao, Yixue and Medvidovi'{c
关键词: Change-based Analysis, Noise in Patches, Software Patches

Abstract

Noise in software patches impacts their understanding, analysis, and use for tasks such as change prediction. Although several approaches have been developed to identify noise in patches, this issue has persisted. An analysis of a dataset of security patches for the Tomcat web server, which we further expanded with security patches from five additional systems, uncovered several kinds of previously unreported noise which we call nonessential casualty changes. These are changes that themselves do not alter the logic of the program but are necessitated by other changes made in the patch. In this paper, we provide a comprehensive taxonomy of casualty changes. We then develop CasCADe, an automated technique for automatically identifying casualty changes. We evaluate CasCADe with several publicly available datasets of patches and tools that focus on them. Our results show that CasCADe is highly accurate, that the kinds of noise it identifies occur relatively commonly in patches, and that removing this noise improves upon the evaluation results of a previously published change-based approach.

DOI: 10.1145/3468264.3468624

Software tools for the paper - ACHyb： A Hybrid Analysis Approach to Detect Kernel Access Control Vulnerabilities

作者: Hu, Yang and Wang, Wenxi and Hunger, Casen and Wood, Riley and Khurshid, Sarfraz and Tiwari, Mohit
关键词: Access Control, Operating System, Program Analysis

Abstract

The artifact includes four software tools developed in our work: 1) a cve analysis tool to conduct our KACV study 2) a static analysis tool to detect potentially vulnerable paths, 3) a clustering-base seed distillation tool to generate high-quality seed programs, and 4) a kernel fuzzer to reduce false positives of the potential paths reported our static analysis tool. For each tool, we document setup procedures and usage, and provide the corresponding datasets.

DOI: 10.1145/3468264.3468627

Replication Package for Article： Context-Aware and Data-Driven Feedback Generation for Programming Assignments

作者: Song, Dowon and Lee, Woosuk and Oh, Hakjoo
关键词: Program Repair, Program Synthesis

Abstract

It is an artifact to reproduce the experimental results of the paper “Context-Aware and Data-Driven Feedback Generation for Programming Assignments”. The artifact contains the codes, benchmarks, and python scripts to reproduce the results easily.

The detailed description (e.g., Install, Usage) of the tool is available on the public repository: https://github.com/kupl/LearnML

DOI: 10.1145/3468264.3468598

A Replication of “A Syntax-Guided Edit Decoder for Neural Program Repair”

作者: Zhu, Qihao and Sun, Zeyu and Xiao, Yuan-an and Zhang, Wenjie and Yuan, Kang and Xiong, Yingfei and Zhang, Lu
关键词: Neural Network, Program Repair

Abstract

A PyTorch Implementation of “A Syntax-Guided Edits Decoder for Neural Program Repair”

DOI: 10.1145/3468264.3468544

VarFix： balancing edit expressiveness and search effectiveness in automated program repair

作者: Wong, Chu-Pan and Santiesteban, Priscila and K"{a
关键词: automatic program repair, variational execution

Abstract

Automatically repairing a buggy program is essentially a search problem, searching for code transformations that pass a set of tests. Various search strategies have been explored, but they either navigate the search space in an ad hoc way using heuristics, or systemically but at the cost of limited edit expressiveness in the kinds of supported program edits. In this work, we explore the possibility of systematically navigating the search space without sacrificing edit expressiveness. The key enabler of this exploration is variational execution, a dynamic analysis technique that has been shown to be effective at exploring many similar executions in large search spaces. We evaluate our approach on IntroClassJava and Defects4J, showing that a systematic search is effective at leveraging and combining fixing ingredients to find patches, including many high-quality patches and multi-edit patches.

DOI: 10.1145/3468264.3468600

Flaky test detection in Android via event order exploration

作者: Dong, Zhen and Tiwari, Abhishek and Yu, Xiao Liang and Roychoudhury, Abhik
关键词: concurrency, event order, flaky tests, non-determinism

Abstract

Validation of Android apps via testing is difficult owing to the presence of flaky tests. Due to non-deterministic execution environments, a sequence of events (a test) may lead to success or failure in unpredictable ways. In this work, we present an approach and tool FlakeScanner for detecting flaky tests through exploration of event orders. Our key observation is that for a test in a mobile app, there is a testing framework thread which creates the test events, a main User-Interface (UI) thread processing these events, and there may be several other background threads running asynchronously. For any event e whose execution involves potential non-determinism, we localize the earliest (latest) event after (before) which e must happen. We then efficiently explore the schedules between the upper/lower bound events while grouping events within a single statement, to find whether the test outcome is flaky. We also create a suite of subject programs called FlakyAppRepo (containing 33 widely-used Android projects) to study flaky tests in Android apps. Our experiments on the subject-suite FlakyAppRepo show FlakeScanner detected 45 out of 52 known flaky tests as well as 245 previously unknown flaky tests among 1444 tests.

DOI: 10.1145/3468264.3468584

Artifact for Article： SmartCommit： A Graph-Based Interactive Assistant for Activity-Oriented Commits

作者: Shen, Bo and Zhang, Wei and K"{a
关键词: changes decomposition, code commit, collaboration in software development, revision control system

Abstract

This artifact is the core algorithm of SmartCommit—an assistant tool to lead and help developers follow the best practice of cohesive commits, which is advocated by many companies (like Google and Facebook) and open source communities (like Git and Angular). A cohesive commit should specifically focus on a development or maintenance activity, such as feature addition, bugfix or refactoring. Cohesive commits form a clear change history that facilitates software maintenance and team collaboration. To help the developer make cohesive commits, SmartCommit can suggest a decomposition (groups of related and self-contained code changes) to the code changes, and allows the developer to interactively adjust the suggested decomposition, until it reaches a state that the developer feels reasonable to submit code change groups as commits.

DOI: 10.1145/3468264.3468551

Data for article： A First Look at Developers’ Live Chat on Gitter

作者: Shi, Lin and Chen, Xiao and Yang, Ye and Jiang, Hanzhi and Jiang, Ziyou and Niu, Nan and Wang, Qing
关键词: Empirical Study, Live chat, Open source, Team communication

Abstract

The artifact contains a dataset including original chat history, automatically disentangled dialogs and manually disentangled dialogs. We hope that the data that we have uncovered will pave the way for other researches, help drive a more in-depth understanding of OSS development collaboration, and promote a better utilization and mining of knowledge embedded in the massive chat history.

DOI: 10.1145/3468264.3468562

作者: Chattopadhyay, Souti and Zimmermann, Thomas and Ford, Denae
关键词: Day in the Life, Software Developer Workdays, Vlogs

Abstract

Software developers are turning to vlogs (video blogs) to share what a day is like to walk in their shoes. Through these vlogs developers share a rich perspective of their technical work as well their personal lives. However, does the type of activities portrayed in vlogs differ from activities developers in the industry perform? Would developers at a software company prefer to show activities to different extents if they were asked to share about their day through vlogs? To answer these questions, we analyzed 130 vlogs by software developers on YouTube and conducted a survey with 335 software developers at a large software company. We found that although vlogs present traditional development activities such as coding and code peripheral activities (11%), they also prominently feature wellness and lifestyle related activities (47.3%) that have not been reflected in previous software engineering literature. We also found that developers at the software company were inclined to share more non-coding tasks (e.g., personal projects, time spent with family and friends, and health) when asked to create a mock-up vlog to promote diversity. These findings demonstrate a shift in our understanding of how software developers are spending their time and find valuable to share publicly. We discuss how vlogs provide a more complete perspective of software development work and serve as a valuable source of data for empirical research.

DOI: 10.1145/3468264.3468599

An empirical study on challenges of application development in serverless computing

作者: Wen, Jinfeng and Chen, Zhenpeng and Liu, Yi and Lou, Yiling and Ma, Yun and Huang, Gang and Jin, Xin and Liu, Xuanzhe
关键词: Application Development, Empirical Study, Serverless Computing, Stack Overflow

Abstract

Serverless computing is an emerging paradigm for cloud computing, gaining traction in a wide range of applications such as video processing and machine learning. This new paradigm allows developers to focus on the development of the logic of serverless computing based applications (abbreviated as serverless-based applications) in the granularity of function, thereby freeing developers from tedious and error-prone infrastructure management. Meanwhile, it also introduces new challenges on the design, implementation, and deployment of serverless-based applications, and current serverless computing platforms are far away from satisfactory. However, to the best of our knowledge, these challenges have not been well studied. To fill this knowledge gap, this paper presents the first comprehensive study on understanding the challenges in developing serverless-based applications from the developers’ perspective. We mine and analyze 22,731 relevant questions from Stack Overflow (a popular Q&A website for developers), and show the increasing popularity trend and the high difficulty level of serverless computing for developers. Through manual inspection of 619 sampled questions, we construct a taxonomy of challenges that developers encounter, and report a series of findings and actionable implications. Stakeholders including application developers, researchers, and cloud providers can leverage these findings and implications to better understand and further explore the serverless computing paradigm.

DOI: 10.1145/3468264.3468558

Bias in machine learning software： why? how? what to do?

作者: Chakraborty, Joymallya and Majumder, Suvodeep and Menzies, Tim
关键词: Bias Mitigation, Fairness Metrics, Software Fairness

Abstract

Increasingly, software is making autonomous decisions in case of criminal sentencing, approving credit cards, hiring employees, and so on. Some of these decisions show bias and adversely affect certain social groups (e.g. those defined by sex, race, age, marital status). Many prior works on bias mitigation take the following form: change the data or learners in multiple ways, then see if any of that improves fairness. Perhaps a better approach is to postulate root causes of bias and then applying some resolution strategy. This paper postulates that the root causes of bias are the prior decisions that affect- (a) what data was selected and (b) the labels assigned to those examples. Our Fair-SMOTE algorithm removes biased labels; and rebalances internal distributions such that based on sensitive attribute, examples are equal in both positive and negative classes. On testing, it was seen that this method was just as effective at reducing bias as prior approaches. Further, models generated via Fair-SMOTE achieve higher performance (measured in terms of recall and F1) than other state-of-the-art fairness improvement algorithms. To the best of our knowledge, measured in terms of number of analyzed learners and datasets, this study is one of the largest studies on bias mitigation yet presented in the literature.

DOI: 10.1145/3468264.3468537

Artifact for Article (SIVAND)： Understanding Neural Code Intelligence Through Program Simplification

作者: Rabin, Md Rafiqul Islam and Hellendoorn, Vincent J. and Alipour, Mohammad Amin
关键词: Interpretable AI, Models of Code, Program Simplification

Abstract

This artifact contains the code of prediction-preserving simplification and the simplified data using DD module for our paper ‘Understanding Neural Code Intelligence Through Program Simplification’ accepted at ESEC/FSE’21. Delta Debugging (DD) was implemented with Python 2. We have modified the core modules (DD.py, MyDD.py) to run in Python 3 (i.e., Python 3.7.3), and then adopted the DD modules for prediction-preserving program simplification using different models. The approach, SIVAND, is model-agnostic and can be applied to any model by loading a model and making a prediction with the model for a task.

Following is the structure of the artifact:

├── ./ # code for model-agnostic DD framework ├── data/ ├── selected_input # randomly selected test inputs from different datasets ├── simplified_input # traces of simplified inputs for different models ├── summary_result # summary results of all experiments as csv ├── models/ ├── dd-code2seq # DD module with code2seq model ├── dd-code2vec # DD module with code2vec model ├── dd-great # DD module with RNN/Transformer model ├── others/ # related helper functions ├── save/ # images of SIVAND

How to Start: To apply SIVAND (for MethodName task as an example), first update (path to a file that contains all selected inputs) and (select token or char type delta for DD) in helper.py. Then, modify load_model_M() to load a target model (i.e., code2vec/code2seq) from , and prediction_with_M() to get the predicted name, score, and loss value with for an input . Also, check whether is parsable into is_parsable(), and load method by language (i.e. Java) from load_method(). Finally, run MyDD.py that will simplify programs one by one and save all simplified traces in the dd_data/ folder.


More Details: Check models/dd-code2vec/ and models/dd-code2seq/ folders to see how SIVAND works with code2vec and code2seq models for MethodName task on Java program. Similarly, for VarMisuse task (RNN \& Transformer models, Python program), check the models/dd-great/ folder for our modified code.
DOI: 10.1145/3468264.3468539

Package for Article： Multi-objectivizing Software Configuration Tuning
作者: Chen, Tao and Li, Miqing

关键词: Configuration tuning, multi-objectivization, performance optimization, search-based software engineering
Abstract
We present the basic information needed to download, run, and then interpret the instructions we provide as requested in the ESEC/FSE 2021 Artifact Submission Guidelines. The artifact contains all the subject subjects, data, and a series of guidelines on how to use them.
DOI: 10.1145/3468264.3468555

Embedding app-library graph for neural third party library recommendation
作者: Li, Bo and He, Qiang and Chen, Feifei and Xia, Xin and Li, Li and Grundy, John and Yang, Yun

关键词: app-library graph, graph neural network, mobile app development, recommendation, third-party library
Abstract
The mobile app marketplace has fierce competition for mobile app developers, who need to develop and update their apps as soon as possible to gain first mover advantage. Third-party libraries (TPLs) offer developers an easier way to enhance their apps with new features. However, how to find suitable candidates among the high number and fast-changing TPLs is a challenging problem. TPL recommendation is a promising solution, but unfortunately existing approaches suffer from low accuracy in recommendation results. To tackle this challenge, we propose GRec, a graph neural network (GNN) based approach, for recommending potentially useful TPLs for app development. GRec models mobile apps, TPLs, and their interactions into an app-library graph. It then distills app-library interaction information from the app-library graph to make more accurate TPL recommendations. To evaluate GRec’s performance, we conduct comprehensive experiments based on a large-scale real-world Android app dataset containing 31,432 Android apps, 752 distinct TPLs, and 537,011 app-library usage records. Our experimental results illustrate that GRec can significantly increase the prediction accuracy and diversify the prediction results compared with state-of-the-art methods. A user study performed with app developers also confirms GRec’s usefulness for real-world mobile app development.
DOI: 10.1145/3468264.3468552

Replication Package for ESEC/FSE 2021 Paper “A Large-Scale Empirical Study of Java Library Migrations： Prevalence, Trends, and Rationales”
作者: He, Hao and He, Runzhi and Gu, Haiqiao and Zhou, Minghui

关键词: empirical software engineering, evolution and maintenance, library migration, mining software repositories
Abstract
This is the replication package for our ESEC/FSE 2021 paper A Large-Scale Empirical Study on Java Library Migrations: Prevalence, Trends, and Rationales. It can be used to replicate all three research questions in the paper using our preprocessed and manually labeled data. Please refer to this GitHub repository (https://github.com/hehao98/LibraryMigration) or the git repository archive (gitrepo.zip) in this package for detailed documentation about how to use this replication package.
It consists of the following files:
cache.zip: This file contains some most important datasets used in this paper, including the GitHub repositories and Maven libraries used, the set of all dependency changes, and the migration graph. Data related to thematic analysis can be found in the git repository. dbdata.tar.xz: This file contains the raw MongoDB data folder that will be used if you choose to install the required environment using Docker. dbdump.zip: This file contains the MongoDB data dump which will be used if you choose to manually install the required environment. gitrepo.zip: A git repository archive for the scripts, notebooks, and spreadsheets we used for this paper. Note that this archive may be somewhat older than the GitHub repository (https://github.com/hehao98/LibraryMigration). We recommend referring to the latest version at GitHub and only resort to this archive if the GitHub repository becomes unavailable in the unforeseeable future. We hope the provided scripts and dataset can be used to facilitate further research.
DOI: 10.1145/3468264.3468571

Learning-based extraction of first-order logic representations of API directives
作者: Liu, Mingwei and Peng, Xin and Marcus, Andrian and Treude, Christoph and Bai, Xuefang and Lyu, Gang and Xie, Jiazhan and Zhang, Xiaoxin

关键词: API Documentation, Directive, First Order Logic
Abstract
Developers often rely on API documentation to learn API directives, i.e., constraints and guidelines related to API usage. Failing to follow API directives may cause defects or improper implementations. Since there are no industry-wide standards on how to document API directives, they take many forms and are often hard to understand by developers or challenging to parse with tools.  In this paper, we propose a learning based approach for extracting first-order logic representations of API directives (FOL directives for short). The approach, called LEADFOL, uses a joint learning method to extract atomic formulas by identifying the predicates and arguments involved in directive sentences, and recognizes the logical relations between atomic formulas, by parsing the sentence structures. It then parses the arguments and uses a learning based method to link API references to their corresponding API elements. Finally, it groups the formulas of the same class or method together and transforms them into conjunctive normal form. Our evaluation shows that LEADFOL can accurately extract more FOL directives than a state-of-the-art approach and that the extracted FOL directives are useful in supporting code reviews.
DOI: 10.1145/3468264.3468618

Replication Data for： DIFFBASE： A Differential Factbase for Effective Software Evolution Management
作者: Wu, Xiuheng and Zhu, Chenguang and Li, Yi

关键词: Program facts, Software evolution, Software maintenance
Abstract
Numerous tools and techniques have been developed to extract and analyze information from software development artifacts. Yet, there is a lack of effective method to process, store, and exchange information among different analyses. DiffBase provides a uniform exchangeable representation supporting efficient querying and manipulation, based on the existing concept of program facts. We consider program changes as first-class objects, which establish links between intra-version facts of single program snapshots and provide insights on how certain artifacts evolve over time via inter-version facts. DiffBase includes a series of differential fact extractors and multiple software evolution management tasks have been implemented with DiffBase, demonstrating its usefulness and efficiency.
DOI: 10.1145/3468264.3468605

Would you like a quick peek? providing logging support to monitor data processing in big data applications
作者: Wang, Zehao and Zhang, Haoxiang and Chen, Tse-Hsun (Peter) and Wang, Shaowei

关键词: Apache Spark, Logging, Monitoring
Abstract
To analyze large-scale data efficiently, developers have created various big data processing frameworks (e.g., Apache Spark). These big data processing frameworks provide abstractions to developers so that they can focus on implementing the data analysis logic. In traditional software systems, developers leverage logging to monitor applications and record intermediate states to assist workload understanding and issue diagnosis. However, due to the abstraction and the peculiarity of big data frameworks, there is currently no effective monitoring approach for big data applications. In this paper, we first manually study 1,000 randomly sampled Spark-related questions on Stack Overflow to study their root causes and the type of information, if recorded, that can assist developers with motioning and diagnosis. Then, we design an approach, DPLOG, which assists developers with monitoring Spark applications. DPLOG leverages statistical sampling to minimize performance overhead and provides intermediate information and hint/warning messages for each data processing step of a chained method pipeline. We evaluate DPLOG on six benchmarking programs and find that DPLOG has a relatively small overhead (i.e., less than 10% increase in response time when processing 5GB data) compared to without using DPLOG, and reduce the overhead by over 500% compared to the baseline. Our user study with 20 developers shows that DPLOG can reduce the needed time to debug big data applications by 63% and the participants give DPLOG an average of 4.85/5 for its usefulness. The idea of DPLOG may be applied to other big data processing frameworks, and our study sheds light on future research opportunities in assisting developers with monitoring big data applications.
DOI: 10.1145/3468264.3468613

Identifying bad software changes via multimodal anomaly detection for online service systems
作者: Zhao, Nengwen and Chen, Junjie and Yu, Zhaoyang and Wang, Honglin and Li, Jiesong and Qiu, Bin and Xu, Hongyu and Zhang, Wenchi and Sui, Kaixin and Pei, Dan

关键词: Anomaly Detection, Online Service Systems, Software Change
Abstract
In large-scale online service systems, software changes are inevitable and frequent. Due to importing new code or configurations, changes are likely to incur incidents and destroy user experience. Thus it is essential for engineers to identify bad software changes, so as to reduce the influence of incidents and improve system re- liability. To better understand bad software changes, we perform the first empirical study based on large-scale real-world data from a large commercial bank. Our quantitative analyses indicate that about 50.4% of incidents are caused by bad changes, mainly be- cause of code defect, configuration error, resource contention, and software version. Besides, our qualitative analyses show that the current practice of detecting bad software changes performs not well to handle heterogeneous multi-source data involved in soft- ware changes. Based on the findings and motivation obtained from the empirical study, we propose a novel approach named SCWarn aiming to identify bad changes and produce interpretable alerts accurately and timely. The key idea of SCWarn is drawing support from multimodal learning to identify anomalies from heterogeneous multi-source data. An extensive study on two datasets with various bad software changes demonstrates our approach significantly outperforms all the compared approaches, achieving 0.95 F1-score on average and reducing MTTD (mean time to detect) by 20.4%∼60.7%. In particular, we shared some success stories and lessons learned from the practical usage.
DOI: 10.1145/3468264.3468543

JMocker： An Automatic Refactoring Framework for ReplacingTest-Production Inheritance by Mocking Mechanism
作者: Wang, Xiao and Xiao, Lu and Yu, Tingting and Woepse, Anne and Wong, Sunny

关键词: Software Refactoring, Software Testing
Abstract
JMocker is an Eclipse plugin for automatically identifying and refactoring the usage of inheritance for mocking by using Mockito-a well received mocking framework. The refactoring performed by JMocker can improve the quality of the unit test cases in various aspects, including improving the cohesion/concise, readability/understandability, and maintainability of unit test cases.
DOI: 10.1145/3468264.3468590

Implementation of the Detection Tool： \DH{
作者: Zhang, Wuqi and Wei, Lili and Li, Shuqing and Liu, Yepang and Cheung, Shing-Chi

关键词: dapp, ethereum, testing, testing-framework, testing-tools
Abstract
undefinedArcher is an automated testing framework aiming to test on-chain-off-chain synchronization bugs in decentralized applications (DApps). A detailed introduction to undefinedArcher can be found in the README.md file inside the artifact.
DOI: 10.1145/3468264.3468546

iBatch： saving Ethereum fees via secure and cost-effective batching of smart-contract invocations
作者: Wang, Yibo and Zhang, Qi and Li, Kai and Tang, Yuzhe and Chen, Jiaqi and Luo, Xiapu and Chen, Ting

关键词: Blockchains, DeFi, cost effectiveness, replay attacks, smart contracts
Abstract
This paper presents iBatch, a middleware system running on top of an operational Ethereum network to enable secure batching of smart-contract invocations against an untrusted relay server off-chain. iBatch does so at a low overhead by validating the server’s batched invocations in smart contracts without additional states. The iBatch mechanism supports a variety of policies, ranging from conservative to aggressive batching, and can be configured adaptively to the current workloads. iBatch automatically rewrites smart contracts to integrate with legacy applications and support large-scale deployment.  For cost evaluation, we develop a platform with fast and cost-accurate transaction replaying, build real transaction benchmarks on popular Ethereum applications, and build a functional prototype of iBatch on Ethereum. The evaluation results show that iBatch saves 14.6%-59.1% Gas cost per invocation with a moderate 2-minute delay and 19.06%-31.52% Ether cost per invocation with a delay of 0.26-1.66 blocks.
DOI: 10.1145/3468264.3468568

Replication Package for smartExpander
作者: Jiang, Yanjie and Liu, Hui and Zhang, Yuxia and Niu, Nan and Zhao, Yuhai and Zhang, Lu

关键词: Abbreviation, Cliques, Data Mining, Expansion, Software Quality
Abstract
SmartExpander is a tool to decide whether a given abbreviation needs to be expanded at all. The rationale of the approach is that abbreviations should not be expanded if the expansion would result in lengthy identifiers or if developers/maintainers can easily figure out the meaning of the abbreviations. Consequently, we design a sequence of heuristics according to the rationale to pick up such abbreviations that do not require expansion.
DOI: 10.1145/3468264.3468616

Validation on machine reading comprehension software without annotated labels： a property-based method
作者: Chen, Songqiang and Jin, Shuo and Xie, Xiaoyuan

关键词: language understanding capability, machine reading comprehension, metamorphic relation, property-based validation
Abstract
Machine Reading Comprehension (MRC) in Natural Language Processing has seen great progress recently. But almost all the current MRC software is validated with a reference-based method, which requires well-annotated labels for test cases and tests the software by checking the consistency between the labels and the outputs. However, labeling test cases of MRC could be very costly due to their complexity, which makes reference-based validation hard to be extensible and sufficient. Furthermore, solely checking the consistency and measuring the overall score may not be sensible and flexible for assessing the language understanding capability. In this paper, we propose a property-based validation method for MRC software with Metamorphic Testing to supplement the reference-based validation. It does not refer to the labels and hence can make much data available for testing. Besides, it validates MRC software against various linguistic properties to give a specific and in-depth picture on linguistic capabilities of MRC software. Comprehensive experimental results show that our method can successfully reveal violations to the target linguistic properties without the labels. Moreover, it can reveal problems that have been concealed by the traditional validation. Comparison according to the properties provides deeper and more concrete ideas about different language understanding capabilities of the MRC software.
DOI: 10.1145/3468264.3468569

FLEX： fixing flaky tests in machine learning projects by updating assertion bounds
作者: Dutta, Saikat and Shi, August and Misailovic, Sasa

关键词: Extreme Value Theory, Flaky tests, Machine Learning
Abstract
Many machine learning (ML) algorithms are inherently random – multiple executions using the same inputs may produce slightly different results each time. Randomness impacts how developers write tests that check for end-to-end quality of their implementations of these ML algorithms. In particular, selecting the proper thresholds for comparing obtained quality metrics with the reference results is a non-intuitive task, which may lead to flaky test executions.  We present FLEX, the first tool for automatically fixing flaky tests due to algorithmic randomness in ML algorithms. FLEX fixes tests that use approximate assertions to compare actual and expected values that represent the quality of the outputs of ML algorithms. We present a technique for systematically identifying the acceptable bound between the actual and expected output quality that also minimizes flakiness. Our technique is based on the Peak Over Threshold method from statistical Extreme Value Theory, which estimates the tail distribution of the output values observed from several runs. Based on the tail distribution, FLEX updates the bound used in the test, or selects the number of test re-runs, based on a desired confidence level.  We evaluate FLEX on a corpus of 35 tests collected from the latest versions of 21 ML projects. Overall, FLEX identifies and proposes a fix for 28 tests. We sent 19 pull requests, each fixing one test, to the developers. So far, 9 have been accepted by the developers.
DOI: 10.1145/3468264.3468615

PFPSanitizer - A Parallel Shadow Execution Tool for Debugging Numerical Errors
作者: Chowdhary, Sangeeta and Nagarakatte, Santosh

关键词: FPSanitizer, numerical errors, parallel execution, PFPSanitizer
Abstract
This is the artifact for the FSE 2021 paper - Parallel Shadow Execution to Accelerate the Debugging of Numerical Errors appearing at FSE 2021. This artifact provides the link to the source code and step-by-step instructions to reproduce the performance graphs and case study from the accepted paper. We also provide the test harness to evaluate the correctness of our tool. In this artifact, we provide the scripts and instructions required to execute the different parts of the experiment automatically.
DOI: 10.1145/3468264.3468585

Exposing numerical bugs in deep learning via gradient back-propagation
作者: Yan, Ming and Chen, Junjie and Zhang, Xiangyu and Tan, Lin and Wang, Gan and Wang, Zan

关键词: Deep Learning Testing, Gradient Back-propagation, Numerical Bug, Search-based Software Testing
Abstract
Numerical computation is dominant in deep learning (DL) programs. Consequently, numerical bugs are one of the most prominent kinds of defects in DL programs. Numerical bugs can lead to exceptional values such as NaN (Not-a-Number) and INF (Infinite), which can be propagated and eventually cause crashes or invalid outputs. They occur when special inputs cause invalid parameter values at internal mathematical operations such as log(). In this paper, we propose the first dynamic technique, called GRIST, which automatically generates a small input that can expose numerical bugs in DL programs. GRIST piggy-backs on the built-in gradient computation functionalities of DL infrastructures. Our evaluation on 63 real-world DL programs shows that GRIST detects 78 bugs including 56 unknown bugs. By submitting them to the corresponding issue repositories, eight bugs have been confirmed and three bugs have been fixed. Moreover, GRIST can save 8.79X execution time to expose numerical bugs compared to running original programs with its provided inputs. Compared to the state-of-the-art technique DEBAR (which is a static technique), DEBAR produces 12 false positives and misses 31 true bugs (of which 30 bugs can be found by GRIST), while GRIST only misses one known bug in those programs and no false positive. The results demonstrate the effectiveness of GRIST.
DOI: 10.1145/3468264.3468612

Metamorphic testing of Datalog engines
作者: Mansur, Muhammad Numair and Christakis, Maria and W"{u

关键词: Datalog, fuzzing, metamorphic testing
Abstract
Datalog is a popular query language with applications in several domains. Like any complex piece of software, Datalog engines may contain bugs. The most critical ones manifest as incorrect results when evaluating queries—we refer to these as query bugs. Given the wide applicability of the language, query bugs may have detrimental consequences, for instance, by compromising the soundness of a program analysis that is implemented and formalized in Datalog. In this paper, we present the first metamorphic-testing approach for detecting query bugs in Datalog engines. We ran our tool on three mature engines and found 13 previously unknown query bugs, some of which are deep and revealed critical semantic issues.
DOI: 10.1145/3468264.3468573

Replication package for FSE '21, Synthesis of Web Layouts from Examples
作者: Lukes, Dylan and Sarracino, John and Coleman, Cora and Peleg, Hila and Lerner, Sorin and Polikarpova, Nadia

关键词: cassowary, constraint-based layout, constraints, layout, linear constraints, program synthesis, synthesis
Abstract
Contains all code, experiments, and data, as well as instructions for installation, usage and replication of our experiments.
This package contains:
In ./implementation:

mockdown, Our main tool, written in Python.
mockdown-client, A JavaScript client for mockdown, intended for use in writing benchmarks, or integrating mockdown into web applications. Used by auto-mock.
auto-mock (placed in implementation/web), Evaluation for the web backend (RQ1-RQ3).
inferui-eval (placed in implementation/android), Evaluation for the Android backend (RQ4).
flightlessbird.js, A fork of the kiwi.js constraint solver, with some bug fixes and changes to facilitate adding multiple constraints at once. Used by mockdown-client.

In ./layouts, there is a variety of JSON files. These correspond to our scraped websites (input data).
In ./experiments/, there is the data and scripts for our experiments:

overall, CSV files and Excel spreadsheets for our RQ1 trials.
noise, CSV files and plotting scripts for our RQ2 trials. There are two subfolders, 3/ and 10/ which correspond to the 3 and 10 training examples.
scaling, a CSV file, Excel spreadsheet, and helper python script for RQ3.
android, a CSV file for RQ4.

DOI: 10.1145/3468264.3468533

Boosting coverage-based fault localization via graph-based representation learning
作者: Lou, Yiling and Zhu, Qihao and Dong, Jinhao and Li, Xia and Sun, Zeyu and Hao, Dan and Zhang, Lu and Zhang, Lingming

关键词: Fault Localization, Graph Neural Network, Representation Learning
Abstract
Coverage-based fault localization has been extensively studied in the literature due to its effectiveness and lightweightness for real-world systems. However, existing techniques often utilize coverage in an oversimplified way by abstracting detailed coverage into numbers of tests or boolean vectors, thus limiting their effectiveness in practice. In this work, we present a novel coverage-based fault localization technique, GRACE, which fully utilizes detailed coverage information with graph-based representation learning. Our intuition is that coverage can be regarded as connective relationships between tests and program entities, which can be inherently and integrally represented by a graph structure: with tests and program entities as nodes, while with coverage and code structures as edges. Therefore, we first propose a novel graph-based representation to reserve all detailed coverage information and fine-grained code structures into one graph. Then we leverage Gated Graph Neural Network to learn valuable features from the graph-based coverage representation and rank program entities in a listwise way. Our evaluation on the widely used benchmark Defects4J (V1.2.0) shows that GRACE significantly outperforms state-of-the-art coverage-based fault localization: GRACE localizes 195 bugs within Top-1 whereas the best compared technique can at most localize 166 bugs within Top-1. We further investigate the impact of each GRACE component and find that they all positively contribute to GRACE. In addition, our results also demonstrate that GRACE has learnt essential features from coverage, which are complementary to various information used in existing learning-based fault localization. Finally, we evaluate GRACE in the cross-project prediction scenario on extra 226 bugs from Defects4J (V2.0.0), and find that GRACE consistently outperforms state-of-the-art coverage-based techniques.
DOI: 10.1145/3468264.3468580

SynGuar： Guaranteeing Generalization in Programming by Example (Artifact)
作者: Wang, Bo and Baluta, Teodora and Kolluri, Aashish and Saxena, Prateek

关键词: Program Synthesis, Programming by Example
Abstract
This is the artifact accompanying the paper SynGuar: Guaranteeing Generalization in Programming by Example accepted by the conference ESEC/FSE 2021. It is a framework for PBE synthesizers that guarantees to achieve low generalization error with high probability. It contains a tool named SynGuar that dynamically calculates how many additional examples suffice to theoretically guarantee generalization. It also contains two string program synthesizers StrPROSE and StrSTUN to show how SynGuar can be used in well-known program synthesis approaches such as the PROSE framework and STUN (synthesis through unification).
DOI: 10.1145/3468264.3468621

Code for Article： StateFormer： Fine-Grained Type Recovery from Binaries using Generative State Modeling
作者: Pei, Kexin and Guan, Jonas and Broughton, Matthew and Chen, Zhongtian and Yao, Songchen and Williams-King, David and Ummadisetty, Vikas and Yang, Junfeng and Ray, Baishakhi and Jana, Suman

关键词: Machine Learning for Program Analysis, Reverse Engineering, Transfer Learning, Type Inference
Abstract
StateFormer is a tool that aims to recover source-level type information from stripped binary executable based on transfer learning. Inspired by how human analyzer reason about the program, we propose a pretraining task called Generative State Modeling (GSM) to teach an ML model assembly code operational semantics, and then transfer the learned knowledge for type inference. See our paper for details.
DOI: 10.1145/3468264.3468607

Empirical study of transformers for source code
作者: Chirkova, Nadezhda and Troshin, Sergey

关键词: code completion, function naming, neural networks, transformer, variable misuse detection
Abstract
Initially developed for natural language processing (NLP), Transformers are now widely used for source code processing, due to the format similarity between source code and text. In contrast to natural language, source code is strictly structured, i.e., it follows the syntax of the programming language. Several recent works develop Transformer modifications for capturing syntactic information in source code. The drawback of these works is that they do not compare to each other and consider different tasks. In this work, we conduct a thorough empirical study of the capabilities of Transformers to utilize syntactic information in different tasks. We consider three tasks (code completion, function naming and bug fixing) and re-implement different syntax-capturing modifications in a unified framework. We show that Transformers are able to make meaningful predictions based purely on syntactic information and underline the best practices of taking the syntactic information into account for improving the performance of the model.
DOI: 10.1145/3468264.3468611

Explaining mispredictions of machine learning models using rule induction
作者: Cito, J"{u

关键词: explainability, machine learning, rule induction
Abstract
While machine learning (ML) models play an increasingly prevalent role in many software engineering tasks, their prediction accuracy is often problematic. When these models do mispredict, it can be very difficult to isolate the cause. In this paper, we propose a technique that aims to facilitate the debugging process of trained statistical models. Given an ML model and a labeled data set, our method produces an interpretable characterization of the data on which the model performs particularly poorly. The output of our technique can be useful for understanding limitations of the training data or the model itself; it can also be useful for ensembling if there are multiple models with different strengths. We evaluate our approach through case studies and illustrate how it can be used to improve the accuracy of predictive models used for software engineering tasks within Facebook.
DOI: 10.1145/3468264.3468614

Generalizable and interpretable learning for configuration extrapolation
作者: Ding, Yi and Pervaiz, Ahsan and Carbin, Michael and Hoffmann, Henry

关键词: Configuration, generalizability, interpretability, machine learning
Abstract
Modern software applications are increasingly configurable, which puts a burden on users to tune these configurations for their target hardware and workloads. To help users, machine learning techniques can model the complex relationships between software configuration parameters and performance. While powerful, these learners have two major drawbacks: (1) they rarely incorporate prior knowledge and (2) they produce outputs that are not interpretable by users. These limitations make it difficult to (1) leverage information a user has already collected (e.g., tuning for new hardware using the best configurations from old hardware) and (2) gain insights into the learner’s behavior (e.g., understanding why the learner chose different configurations on different hardware or for different workloads). To address these issues, this paper presents two configuration optimization tools, GIL and GIL+, using the proposed generalizable and interpretable learning approaches. To incorporate prior knowledge, the proposed tools (1) start from known configurations, (2) iteratively construct a new linear model, (3) extrapolate better performance configurations from that model, and (4) repeat. Since the base learners are linear models, these tools are inherently interpretable. We enhance this property with a graphical representation of how they arrived at the highest performance configuration. We evaluate GIL and GIL+ by using them to configure Apache Spark workloads on different hardware platforms and find that, compared to prior work, GIL and GIL+ produce comparable, and sometimes even better performance configurations, but with interpretable results.
DOI: 10.1145/3468264.3468603

Replication Package of the Article： Lightweight Global and Local Contexts Guided Method Name Recommendation with Prior Knowledge
作者: Wang, Shangwen and Wen, Ming and Lin, Bo and Mao, Xiaoguang

关键词: Code embedding, Deep learning, Method name recommendation
Abstract
It contains the source code of Cognac and the detailed instructions on how to reproduce the study.
DOI: 10.1145/3468264.3468567

To read or to rotate? comparing the effects of technical reading training and spatial skills training on novice programming ability
作者: Endres, Madeline and Fansher, Madison and Shah, Priti and Weimer, Westley

关键词: CS1, Spatial Ability, Technical Reading, Transfer Training
Abstract
Understanding how to best support and train novice programmers is a critical component of producing better and more diverse software engineers. In this paper, we present the results of a controlled 11-week longitudinal study with 57 CS1 students comparing two skill-based interventions to improve programming performance. The first intervention involves spatial training, an established baseline known to be helpful in engineering contexts. The second intervention is a novel CS-focused technical reading training. In our reading training, we teach strategies for summarizing scientific papers and understanding scientific charts and figures; most of the covered readings were CS1-accessible portions of computer science research papers. For the spatial training, we use a standardized training curriculum previously found to improve programming skills by focusing on spatial ability (i.e., the ability to mentally manipulate objects). We first replicate findings that both reading ability and spatial ability correlate with programming success. Significantly, however, we find that those in our reading training exhibit larger programming ability gains than those in the standard spatial training (p = 0.02, f2=0.10). We also find that reading trained participants perform particularly well on programming problems that require tracing through code (p = 0.03, f2=0.10). Our results suggest that technical reading training could be beneficial for novice programmers. Finally, we discuss the implications of our results for future CS1 interventions, the possibility for non-programming based training to positively impact developers, and future directions for software engineering education research.
DOI: 10.1145/3468264.3468583

Connecting the dots： rethinking the relationship between code and prose writing with functional connectivity
作者: Karas, Zachary and Jahn, Andrew and Weimer, Westley and Huang, Yu

关键词: code writing, expertise, fMRI, functional connectivity
Abstract
Medical imaging studies of software engineering have risen in popularity and may reveal the neural underpinnings of coding activities. To date, however, all studies in computer science venues have treated brain regions independently and in isolation. Since most complex neural activity involves coordination among multiple regions, previous analyses may overlook neural behavior. We propose to apply functional connectivity analysis to medical imaging data from software engineering tasks. Informally, this analysis treats the brain as a graph, rather than a series of independent modules, and statistically infers relevant edges. We present a functional connectivity analysis of existing data, which elucidates the interconnections between code writing and prose writing, especially regarding higher mathematics and semantic processing. First, we found a significant link between Broca’s Area (language) and the Number Form Area (higher mathematics) for coding. This both refines previous interpretations that code writing and natural language are distinct from each other, and may also contribute to the understanding of the Number Form Area in the Psychology literature. Second, we identify an area with important functional connectivity for both prose writing and coding, unlike previous analyses that associated it with coding. This advances our neural understanding of coding and prose writing, and was only exposed by using functional connectivity analysis. Third, for coding, we find a strong functional connectivity result for a brain region involved in semantic processing for language, with implications for CS training. Finally, we find a neural relationship between coding and expertise, including a more grounded explanation than prior work.
DOI: 10.1145/3468264.3468579

LastPyMile Replication Package
作者: Vu, Duc-Ly and Massacci, Fabio and Pashchenko, Ivan and Plate, Henrik and Sabetta, Antonino

关键词: Open source software, PyPI, Python, software supply chain
Abstract
The artifact consists of several CSV files generated by LastPyMile and the existing security scanning tools (e.g., bandit and PyPI MalwareChecks) and a Jupyter notebook to reproduce the table of comparison between LastPyMile and the current tools (Table 11 in the paper).
DOI: 10.1145/3468264.3468592

A grounded theory of the role of coordination in software security patch management
作者: Dissanayake, Nesara and Zahedi, Mansooreh and Jayatilaka, Asangi and Babar, Muhammad Ali

关键词: coordination, grounded theory, socio-technical factors, software security patch management
Abstract
Several disastrous security attacks can be attributed to delays in patching software vulnerabilities. While researchers and practitioners have paid significant attention to automate vulnerabilities identification and patch development activities of software security patch management, there has been relatively little effort dedicated to gain an in-depth understanding of the socio-technical aspects, e.g., coordination of interdependent activities of the patching process and patching decisions, that may cause delays in applying security patches. We report on a Grounded Theory study of the role of coordination in security patch management. The reported theory consists of four inter-related dimensions, i.e., causes, breakdowns, constraints, and mechanisms. The theory explains the causes that define the need for coordination among interdependent software/hardware components and multiple stakeholders’ decisions, the constraints that can negatively impact coordination, the breakdowns in coordination, and the potential corrective measures. This study provides potentially useful insights for researchers and practitioners who can carefully consider the needs of and devise suitable solutions for supporting the coordination of interdependencies involved in security patch management.
DOI: 10.1145/3468264.3468595

TaintStream： fine-grained taint tracking for big data platforms through dynamic code translation
作者: Yang, Chengxu and Li, Yuanchun and Xu, Mengwei and Chen, Zhenpeng and Liu, Yunxin and Huang, Gang and Liu, Xuanzhe

关键词: GDPR, Taint tracking, big data platform, privacy compliance
Abstract
Big data has become valuable property for enterprises and enabled various intelligent applications. Today, it is common to host data in big data platforms (e.g., Spark), where developers can submit scripts to process the original and intermediate data tables. Meanwhile, it is highly desirable to manage the data to comply with various privacy requirements. To enable flexible and automated privacy policy enforcement, we propose TaintStream, a fine-grained taint tracking framework for Spark-like big data platforms. TaintStream works by automatically injecting taint tracking logic into the data processing scripts, and the injected scripts are dynamically translated to maintain a taint tag for each cell during execution. The dynamic translation rules are carefully designed to guarantee non-interference in the original data operation. By defining different semantics of taint tags, TaintStream can enable various data management applications such as access control, data retention, and user data erasure. Our experiments on a self-crafted benchmarksuite show that TaintStream is able to achieve accurate cell-level taint tracking with a precision of 93.0% and less than 15% overhead. We also demonstrate the usefulness of TaintStream through several real-world use cases of privacy policy enforcement.
DOI: 10.1145/3468264.3468532

Demystifying “bad” error messages in data science libraries
作者: Tao, Yida and Chen, Zhihui and Liu, Yepang and Xuan, Jifeng and Xu, Zhiwu and Qin, Shengchao

关键词: Error message, data science, debugging aid, empirical study
Abstract
Error messages are critical starting points for debugging. Unfortunately, they seem to be notoriously cryptic, confusing, and uninformative. Yet, it still remains a mystery why error messages receive such bad reputations, especially given that they are merely very short pieces of natural language text. In this paper, we empirically demystify the causes and fixes of “bad” error messages, by qualitatively studying 201 Stack Overflow threads and 335 GitHub issues. We specifically focus on error messages encountered in data science development, which is an increasingly important but not well studied domain. We found that the causes of “bad” error messages are far more complicated than poor phrasing or flawed articulation of error message content. Many error messages are inherently and inevitably misleading or uninformative, since libraries do not know user intentions and cannot “see” external errors. Fixes to error-message-related issues mostly involve source code changes, while exclusive message content updates only take up a small portion. In addition, whether an error message is informative or helpful is not always clear-cut; even error messages that clearly pinpoint faults and resolutions can still cause confusion for certain users. These findings thus call for a more in-depth investigation on how error messages should be evaluated and improved in the future.
DOI: 10.1145/3468264.3468560

NIL： large-scale detection of large-variance clones
作者: Nakagawa, Tasuku and Higo, Yoshiki and Kusumoto, Shinji

关键词: Clone Detection, Large-Variance Clone, Scalability
Abstract
A code clone (in short, clone) is a code fragment that is identical or similar to other code fragments in source code. Clones generated by a large number of changes to copy-and-pasted code fragments are called large-variance (modifications are scattered) or large-gap (modifications are in one place) clones. It is difficult for general clone detection techniques to detect such clones and thus specialized techniques are necessary. In addition, with the rapid growth of software development, scalable clone detectors that can detect clones in large codebases are required. However, there are no existing techniques for quickly detecting large-variance or large-gap clones in large codebases. In this paper, we propose a scalable clone detection technique that can detect large-variance clones from large codebases and describe its implementation, called NIL. NIL is a token-based clone detector that efficiently identifies clone candidates using an N-gram representation of token sequences and an inverted index. Then, NIL verifies the clone candidates by measuring their similarity based on the longest common subsequence between their token sequences. We evaluate NIL in terms of large- variance clone detection accuracy, general Type-1, Type-2, and Type- 3 clone detection accuracy, and scalability. Our experimental results show that NIL has higher accuracy in terms of large-variance clone detection, equivalent accuracy in terms of general clone detection, and the shortest execution time for inputs of various sizes (1–250 MLOC) compared to existing state-of-the-art tools.
DOI: 10.1145/3468264.3468564

ReqRacer： Dynamic framework for detecting and exposing server-side request races in database-backed web applications
作者: Qiu, Zhengyi and Shao, Shudi and Zhao, Qi and Jin, Guoliang

关键词: characteristic study, happens-before relationships, race detection, web-application request races
Abstract
This artifact includes scripts to build PHP-Apache-MySQL stack, and ReqRacer source code to generate request logs, query logs, and token logs during the recording phase, off-line analysis scripts to detect potential request races, and scripts to replay racing requests and check the race effects. The artifact includes web application race bug characteristic study results and detailed steps to manually reproduing 12 race bugs in a browser. The artifact also includes links to demo video to help people quickly understand how to use ReqRacer after installation.
DOI: 10.1145/3468264.3468594

Detecting and localizing keyboard accessibility failures in web applications
作者: Chiou, Paul T. and Alotaibi, Ali S. and Halfond, William G. J.

关键词: Keyboard Navigation, Software Testing, WCAG, Web Accessibility
Abstract
The keyboard is the most universally supported input method operable by people with disabilities. Yet, many popular websites lack keyboard accessible mechanism, which could cause failures that make the website unusable. In this paper, we present a novel approach for automatically detecting and localizing keyboard accessibility failures in web applications. Our extensive evaluation of our technique on real world web pages showed that our technique was able to detect keyboard failures in web applications with high precision and recall and was able to accurately identify the underlying elements in the web pages that led to the observed problems.
DOI: 10.1145/3468264.3468581

Swarmbug： debugging configuration bugs in swarm robotics
作者: Jung, Chijung and Ahad, Ali and Jung, Jinho and Elbaum, Sebastian and Kwon, Yonghwi

关键词: configuration bug, debugging, swarm robotics
Abstract
Swarm robotics collectively solve problems that are challenging for individual robots, from environmental monitoring to entertainment. The algorithms enabling swarms allow individual robots of the swarm to plan, share, and coordinate their trajectories and tasks to achieve a common goal. Such algorithms rely on a large number of configurable parameters that can be tailored to target particular scenarios. This large configuration space, the complexity of the algorithms, and the dependencies with the robots’ setup and performance make debugging and fixing swarms configuration bugs extremely challenging. This paper proposes Swarmbug, a swarm debugging system that automatically diagnoses and fixes buggy behaviors caused by misconfiguration. The essence of Swarmbug is the novel concept called the degree of causal contribution (Dcc), which abstracts impacts of environment configurations (e.g., obstacles) to the drones in a swarm via behavior causal analysis. Swarmbug automatically generates, validates, and ranks fixes for configuration bugs. We evaluate Swarmbug on four diverse swarm algorithms. Swarmbug successfully fixes four configuration bugs in the evaluated algorithms, showing that it is generic and effective. We also conduct a real-world experiment with physical drones to show the Swarmbug’s fix is effective in the real-world.
DOI: 10.1145/3468264.3468601

Artifact for article： Probabilistic Delta Debugging
作者: Wang, Guancheng and Shen, Ruobing and Chen, Junjie and Xiong, Yingfei and Zhang, Lu

关键词: Delta Debugging, Probabilistic Model
Abstract
The artifact provides a package of the tools (ProbDD) and benchmarks for evaluation in our original paper (Probabilistic Delta Debugging). In our paper, we propose a delta debugging algorithm named ProbDD. Compared with the traditional method (i.e., ddmin), ProbDD models the probability of each element being selected in the optimal subsequence by building a probabilistic model to guide tests and updating the model based on the test results. By evaluating ProbDD on two representative approaches (i.e., HDD and CHISEL), our paper shows that ProbDD can improve the existing approaches in terms of effectiveness and efficiency by replacing ddmin with ProbDD. The artifact has two purposes. The first purpose is to reproduce the main results in our original paper. The baseline in our paper is ddmin algorithm, which is integrated into HDD and CHISEL (i.e., the tools used in our paper). For comparison between our algorithm (i.e., ProbDD) and ddmin, we replace ddmin module with ProbDD in HDD and CHISEL, respectively. In our paper, we used two benchmarks to evaluate ProbDD, i.e., Trees and C Programs. All the mentioned tools and benchmarks are contained in this artifact. We also provide a docker file for convenient reproduction. The users need to install Docker first and then get into the container by running the docker file. More details can be found in a README file of the artifact. The second is to provide an implementation of ProbDD that can be used for delta debugging tasks beyond the evaluation dataset. When a set of elements and a test function is provided, it reduces the elements to a smaller set. More details can be found in a README file of the artifact.
DOI: 10.1145/3468264.3468625

Artifact from “Finding Broken Linux Configuration Specifications by Statically Analyzing the Kconfig Language”
作者: Oh, Jeho and Y\i{

关键词: formal verification, Kconfig, software configuration, static analysis
Abstract
Artifact from “Finding Broken Linux Configuration Specifications by Statically Analyzing the Kconfig Language”
DOI: 10.1145/3468264.3468578

Source code package for ‘Semantic Bug Seeding： A Learning-Based Approach for Creating Realistic Bugs’
作者: Patra, Jibesh and Pradel, Michael

关键词: bug injection, bugs, dataset, machine learning, token embeddings
Abstract
The package contains source code and documentation that may be used to run experiments mentioned in the paper.
DOI: 10.1145/3468264.3468623

Characterizing search activities on stack overflow
作者: Liu, Jiakun and Baltes, Sebastian and Treude, Christoph and Lo, David and Zhang, Yun and Xia, Xin

关键词: Data Mining, Query Logs, Query Reformulation, Stack Overflow
Abstract
To solve programming issues, developers commonly search on Stack Overflow to seek potential solutions. However, there is a gap between the knowledge developers are interested in and the knowledge they are able to retrieve using search engines. To help developers efficiently retrieve relevant knowledge on Stack Overflow, prior studies proposed several techniques to reformulate queries and generate summarized answers. However, few studies performed a large-scale analysis using real-world search logs. In this paper, we characterize how developers search on Stack Overflow using such logs. By doing so, we identify the challenges developers face when searching on Stack Overflow and seek opportunities for the platform and researchers to help developers efficiently retrieve knowledge. To characterize search activities on Stack Overflow, we use search log data based on requests to Stack Overflow’s web servers. We find that the most common search activity is reformulating the immediately preceding queries. Related work looked into query reformulations when using generic search engines and found 13 types of query reformulation strategies. Compared to their results, we observe that 71.78% of the reformulations can be fitted into those reformulation strategies. In terms of how queries are structured, 17.41% of the search sessions only search for fragments of source code artifacts (e.g., class and method names) without specifying the names of programming languages, libraries, or frameworks. Based on our findings, we provide actionable suggestions for Stack Overflow moderators and outline directions for future research. For example, we encourage Stack Overflow to set up a database that includes the relations between all computer programming terminologies shared on Stack Overflow, e.g., method name, data structure name, design pattern, and IDE name. By doing so, Stack Overflow could improve the performance of search engines by considering related programming terminologies at different levels of granularity.
DOI: 10.1145/3468264.3468582

Authorship attribution of source code： a language-agnostic approach and applicability in software engineering
作者: Bogomolov, Egor and Kovalenko, Vladimir and Rebryk, Yurii and Bacchelli, Alberto and Bryksin, Timofey

关键词: Copyrights, Machine learning, Methods of data collection, Security, Software maintenance, Software process
Abstract
Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.
DOI: 10.1145/3468264.3468606

Probing model signal-awareness via prediction-preserving input minimization
作者: Suneja, Sahil and Zheng, Yunhui and Zhuang, Yufan and Laredo, Jim A. and Morari, Alessandro

关键词: machine learning, model signal-awareness, signal-aware recall
Abstract
This work explores the signal awareness of AI models for source code understanding. Using a software vulnerability detection use case, we evaluate the models’ ability to capture the correct vulnerability signals to produce their predictions. Our prediction-preserving input minimization (P2IM) approach systematically reduces the original source code to a minimal snippet which a model needs to maintain its prediction. The model’s reliance on incorrect signals is then uncovered when the vulnerability in the original code is missing in the minimal snippet, both of which the model however predicts as being vulnerable. We measure the signal awareness of models using a new metric we propose – Signal-aware Recall (SAR). We apply P2IM on three different neural network architectures across multiple datasets. The results show a sharp drop in the model’s Recall from the high 90s to sub-60s with the new metric, highlighting that the models are presumably picking up a lot of noise or dataset nuances while learning their vulnerability detection logic. Although the drop in model performance may be perceived as an adversarial attack, but this isn’t P2IM’s objective. The idea is rather to uncover the signal-awareness of a black-box model in a data-driven manner via controlled queries. SAR’s purpose is to measure the impact of task-agnostic model training, and not to suggest a shortcoming in the Recall metric. The expectation, in fact, is for SAR to match Recall in the ideal scenario where the model truly captures task-specific signals.
DOI: 10.1145/3468264.3468545

Generating efficient solvers from constraint models
作者: Lin, Shu and Meng, Na and Li, Wenxin

关键词: Combinatorial problems (CP), automated DP optimization, constraint solvers, static analysis of problem properties
Abstract
Combinatorial problems (CPs) arise in many areas, and people use constraint solvers to automatically solve these problems. However, the state-of-the-art constraint solvers (e.g., Gecode and Chuffed) have overly complicated software architectures; they compute solutions inefficiently. This paper presents a novel and model-driven approach—SoGen—to synthesize efficient problem-specific solvers from constraint models. Namely, when users model a CP with our domain-specific language PDL (short for Problem Description Language), SoGen automatically analyzes various properties of the problem (e.g., search space, value boundaries, function monotonicity, and overlapping subproblems), synthesizes an efficient solver algorithm based on those properties, and generates a C program as the problem solver. PDL is unique because it can create solvers that resolve constraints via dynamic programming (DP) search.  For evaluation, we compared the solvers generated by SoGen with two state-of-the-art constraint solvers: Gecode and Chuffed. PDL’s solvers resolved constraints more efficiently; they achieved up to 6,058x speedup over Gecode and up to 31,300x speedup over Chuffed. Additionally, we experimented with both SoGen and the state-of-the-art solver generator—Dominion. We found SoGen to generate solvers faster and the produced solvers are more efficient.
DOI: 10.1145/3468264.3468566

Replication Package for Article： “A Comprehensive Study of Deep Learning Compiler Bugs”
作者: Shen, Qingchao and Ma, Haoyang and Chen, Junjie and Tian, Yongqiang and Cheung, Shing-Chi and Chen, Xiang

关键词: DL Compiler Bugs, TVMFuzz
Abstract
This artifact has two components: the labeled dataset in our empirical study and our bug detection tool TVMFuzz. Folder named dataset includes the basic information of 603 bugs collected from GitHub by authors. Folder name TVMfuzz is a tool designed by us to fuzz the TVM, one of the most widely-used deep learning compilers. More details can be seen in README.md or https://github.com/ShenQingchao/DLCstudy
DOI: 10.1145/3468264.3468591

Replication Package for “Fair Preprocessing： Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline”
作者: Biswas, Sumon and Rajan, Hridesh

关键词: fairness, machine learning, models, pipeline, preprocessing
Abstract
The artifact contains the benchmark, source code and data used in our ESEC/FSE 2021 paper on “Fair Preprocessing”. The benchmark can be used by other researchers and practitioners to evaluate the fairness of real-world machine learning (ML) pipelines collected from Kaggle. In addition, we released our implementation of the novel metrics proposed to measure component level fairness in the pipelines. The artifact also contains five popular datasets used in fairness research.
DOI: 10.1145/3468264.3468536

Replication package for article： Fairea： A Model Behaviour Mutation Approach to Benchmarking Bias Mitigation Methods
作者: Hort, Max and Zhang, Jie M. and Sarro, Federica and Harman, Mark

关键词: software fairness
Abstract
This on-line appendix is supplementary to the paper entitled “Fairea: A Model Behaviour Mutation Approach to Benchmarking Bias Mitigation Methods”, which has been accepted at FSE’21. It contains the data used in the study, raw results, Python code for the proposed approach, and scripts to replicate our experiments.
DOI: 10.1145/3468264.3468565

Library and Demo for Article： Feature Trace Recording
作者: Bittner, Paul Maximilian and Schulthei\ss{

关键词: clone-and-own, feature location, feature traceability, software evolution, software product lines, variability mining
Abstract
The artefact mainly consists of a library written in the Haskell language that implements feature trace recording. The library is accompanied with a demo application that uses the library to reproduce our motivating example (Alice and Bob using feature trace recording in Section 2 in our paper) as well as examples of the edit patterns we used to evaluate feature trace recording (Section 5).
DOI: 10.1145/3468264.3468531

Data and script for the paper “A Longitudinal Analysis of Bloated Java Dependencies”
作者: Soto-Valero, C'{e

关键词: Dependencies, Java, Software bloat
Abstract
This repository contains the data and script for the paper “A Longitudinal Analysis of Bloated Java Dependencies”
Repository structure:

dataset

projects.csv # list of 500 projects used in the paper
commits.csv # list of commits that are analyzed
project_dependabot.json # dependabot commits for each project
project_releases.json # commits associated to a release for each project

dependency_usage_tree





depclean.json # the dependency usage tree extracted by Deplean
compile.log.zip # Maven compilation log
depClean.log.zip # Deplean log



script

create_dataset.js # ceate projects.csv and commits.csv based on project_releases.json and project_dependabot.json
read_dependency_usage_tree.js # extract the information from dependency_usage_tree and generate a csv file
analysis.py # read dependency_usage_tree.csv and generate the macro and table for the paper


DOI: 10.1145/3468264.3468589

CSO Dataset Analysis for Article： XAI Tools in the Public Sector： A Case Study on Predicting Combined Sewer Overflows
作者: Maltbie, Nicholas and Niu, Nan and Van Doren, Matthew and Johnson, Reese

关键词: AI, Explainability, Goal Question Metric, LSTM, Machine Learning, Requirements Engineering
Abstract
This repository is provided as supplementary material for the paper “XAI Tools in the Public Sector: A Case Study on Predicting Combined Sewer Overflows” by Nicholas Maltbie, Nan Niu, Reese Johnson, and Matthew VanDoren.
These are the notes for the CSO case study, how the data is prepared, ML models are tuned and created, and the final interpretability analysis.
This repository contains instructions on how to use the code required to create models for the dataset and then how to apply these models to a sample dataset and gather expandability results for our research.
DOI: 10.1145/3468264.3468547

How disabled tests manifest in test maintainability challenges?
作者: Kim, Dong Jae and Yang, Bo and Yang, Jinqiu and Chen, Tse-Hsun (Peter)

关键词: Technical Debt, Test Disabling, Test Maintenance, Test Smell
Abstract
Software testing is an essential software quality assurance practice. Testing helps expose faults earlier, allowing developers to repair the code and reduce future maintenance costs. However, repairing (i.e., making failing tests pass) may not always be done immediately. Bugs may require multiple rounds of repairs and even remain unfixed due to the difficulty of bug-fixing tasks. To help test maintenance, along with code comments, the majority of testing frameworks (e.g., JUnit and TestNG) have also introduced annotations such as @Ignore to disable failing tests temporarily. Although disabling tests may help alleviate maintenance difficulties, they may also introduce technical debt. With the faster release of applications in modern software development, disabling tests may become the salvation for many developers to meet project deliverables. In the end, disabled tests may become outdated and a source of technical debt, harming long-term maintenance. Despite its harmful implications, there is little empirical research evidence on the prevalence, evolution, and maintenance of disabling tests in practice. To fill this gap, we perform the first empirical study on test disabling practice. We develop a tool to mine 122K commits and detect 3,111 changes that disable tests from 15 open-source Java systems. Our main findings are: (1) Test disabling changes are 19% more common than regular test refactorings, such as renames and type changes. (2) Our life-cycle analysis shows that 41% of disabled tests are never brought back to evaluate software quality, and most disabled tests stay disabled for several years. (3)We unveil the motivations behind test disabling practice and the associated technical debt by manually studying evolutions of 349 unique disabled tests, achieving a 95% confidence level and a 5% confidence interval. Finally, we present some actionable implications for researchers and developers.
DOI: 10.1145/3468264.3468609

Sustainability forecasting for Apache incubator projects
作者: Yin, Likang and Chen, Zhuangzhi and Xuan, Qi and Filkov, Vladimir

关键词: Apache Incubator, OSS Sustainability, Sociotechnical System
Abstract
Although OSS development is very popular, ultimately more than 80% of OSS projects fail. Identifying the factors associated with OSS success can help in devising interventions when a project takes a downturn. OSS success has been studied from a variety of angles, more recently in empirical studies of large numbers of diverse projects, using proxies for sustainability, e.g., internal metrics related to productivity and external ones, related to community popularity. The internal socio-technical structure of projects has also been shown important, especially their dynamics. This points to another angle on evaluating software success, from the perspective of self-sustaining and self-governing communities. To uncover the dynamics of how a project at a nascent development stage gradually evolves into a sustainable one, here we apply a socio-technical network modeling perspective to a dataset of Apache Software Foundation Incubator (ASFI), sustainability-labeled projects. To identify and validate the determinants of sustainability, we undertake a mix of quantitative and qualitative studies of ASFI projects’ socio-technical network trajectories. We develop interpretable models which can forecast a project becoming sustainable with 93+% accuracy, within 8 months of incubation start. Based on the interpretable models we describe a strategy for real-time monitoring and suggesting actions, which can be used by projects to correct their sustainability trajectories.
DOI: 10.1145/3468264.3468563

Graph-based seed object synthesis for search-based unit testing
作者: Lin, Yun and Ong, You Sheng and Sun, Jun and Fraser, Gordon and Dong, Jin Song

关键词: code synthesis, object oriented, search-based, software testing
Abstract
Search-based software testing (SBST) generates tests using search algorithms guided by measurements gauging how far a test case is away from exercising a coverage goal. The effectiveness of SBST largely depends on the continuity and monotonicity of the fitness landscape decided by these measurements and the search operators. Unfortunately, the fitness landscape is challenging when the function under test takes object inputs, as classical measurement hardly provide guidance for constructing legitimate object inputs. To overcome this problem, we propose test seeds, i.e., test code skeletons of legitimate objects which enable the use of classical measurements. Given a target branch in a function under test, we first statically analyze the function to build an object construction graph that captures the relation between the operands of the target method and the states of their relevant object inputs. Based on the graph, we synthesize test template code where each “slot” is a mutation point for the search algorithm. This approach can be seamlessly integrated with existing SBST algorithms, and we implemented EvoObj on top of EvoSuite. Our experiments show that EvoObj outperforms EvoSuite with statistical significance on 2750 methods over 103 open source Java projects using state-of-the-art SBST algorithms.
DOI: 10.1145/3468264.3468619

LS-sampling： an effective local search based sampling approach for achieving high t-wise coverage
作者: Luo, Chuan and Sun, Binqi and Qiao, Bo and Chen, Junjie and Zhang, Hongyu and Lin, Jinkun and Lin, Qingwei and Zhang, Dongmei

关键词: Combinatorial Interaction Testing, Local Search, Sampling
Abstract
There has been a rapidly increasing demand for developing highly configurable software systems, which urgently calls for effective testing methods. In practice, t-wise coverage has been widely recognized as a useful metric to evaluate the quality of a test suite for testing highly configurable software systems, and achieving high t-wise coverage is important for ensuring test adequacy. However, state-of-the-art methods usually cost a fairly long time to generate large test suites for high pairwise coverage (i.e., 2-wise coverage), which would lead to ineffective and inefficient testing of highly configurable software systems. In this paper, we propose a novel local search based sampling approach dubbed LS-Sampling for achieving high t-wise coverage. Extensive experiments on a large number of public benchmarks, which are collected from real-world, highly configurable software systems, show that LS-Sampling achieves higher 2-wise and 3-wise coverage than the current state of the art. LS-Sampling is effective, since on average it achieves the 2-wise coverage of 99.64% and the 3-wise coverage of 97.87% through generating a small test suite consisting of only 100 test cases (90% smaller than the test suites generated by its state-of-the-art competitors). Furthermore, LS-Sampling is efficient, since it only requires an average execution time of less than one minute to generate a test suite with high 2-wise and 3-wise coverage.
DOI: 10.1145/3468264.3468622

Replication Package for Article： GLIB： Towards Automated Test Oracle for Graphically-Rich Applications
作者: Chen, Ke and Li, Yufei and Chen, Yingfeng and Fan, Changjie and Hu, Zhipeng and Yang, Wei

关键词: Python, PyTorch
Abstract
This artifact is the implementation for game UI glitch detection, users could either use our pre-trained model to directly detect game UI bugs or train from scratch. It also contains evaluation and salience map error localization.
DOI: 10.1145/3468264.3468586

Reassessing automatic evaluation metrics for code summarization tasks
作者: Roy, Devjeet and Fakhoury, Sarah and Arnaoudova, Venera

关键词: automatic evaluation metrics, code summarization, machine translation
Abstract
In recent years, research in the domain of source code summarization has adopted data-driven techniques pioneered in machine translation (MT). Automatic evaluation metrics such as BLEU, METEOR, and ROUGE, are fundamental to the evaluation of MT systems and have been adopted as proxies of human evaluation in the code summarization domain. However, the extent to which automatic metrics agree with the gold standard of human evaluation has not been evaluated on code summarization tasks. Despite this, marginal improvements in metric scores are often used to discriminate between the performance of competing summarization models. In this paper, we present a critical exploration of the applicability and interpretation of automatic metrics as evaluation techniques for code summarization tasks. We conduct an empirical study with 226 human annotators to assess the degree to which automatic metrics reflect human evaluation. Results indicate that metric improvements of less than 2 points do not guarantee systematic improvements in summarization quality, and are unreliable as proxies of human evaluation. When the difference between metric scores for two summarization approaches increases but remains within 5 points, some metrics such as METEOR and chrF become highly reliable proxies, whereas others, such as corpus BLEU, remain unreliable. Based on these findings, we make several recommendations for the use of automatic metrics to discriminate model performance in code summarization.
DOI: 10.1145/3468264.3468588

Toward efficient interactions between Python and native libraries
作者: Tan, Jialiang and Chen, Yu and Liu, Zhenming and Ren, Bin and Song, Shuaiwen Leon and Shen, Xipeng and Liu, Xu

关键词: PMU, Python, debug register, profiling
Abstract
Python has become a popular programming language because of its excellent programmability. Many modern software packages utilize Python for high-level algorithm design and depend on native libraries written in C/C++/Fortran for efficient computation kernels. Interaction between Python code and native libraries introduces performance losses because of the abstraction lying on the boundary of Python and native libraries. On the one side, Python code, typically run with interpretation, is disjoint from its execution behavior. On the other side, native libraries do not include program semantics to understand algorithm defects. To understand the interaction inefficiencies, we extensively study a large collection of Python software packages and categorize them according to the root causes of inefficiencies. We extract two inefficiency patterns that are common in interaction inefficiencies. Based on these patterns, we develop PieProf, a lightweight profiler, to pinpoint interaction inefficiencies in Python applications. The principle of PieProf is to measure the inefficiencies in the native execution and associate inefficiencies with high-level Python code to provide a holistic view. Guided by PieProf, we optimize 17 real-world applications, yielding speedups up to 6.3\texttimes{
DOI: 10.1145/3468264.3468541

Accelerating JavaScript Static Analysis via Dynamic Shortcuts (Artifact Evaluation)
作者: Park, Joonyoung and Park, Jihyeok and Youn, Dongjun and Ryu, Sukyoung

关键词: dynamic analysis, dynamic shortcut, JavaScript, sealed execution, static analysis
Abstract
We present dynamic shortcuts, a new technique to flexibly switch between abstract and concrete execution during JavaScript static analysis in a sound way. SAFEDS is the actual instance of dynamic shortcuts (DS) based on Jalangi2 in order to accelerate the static analyzer SAFE. It can significantly improve the analysis performance and precision by using highly-optimized commercial JavaScript engines (V8 in Node.js in our setting) and lessen the modeling efforts for opaque code. We apply our artifact for the Reusable badge and Artifacts Available badge. Our artifact provides the reproducible experimental environment, the full results of the experiments presented in the paper, and the commands of SAFEDS to analyze a new input program. A user can reproduce the experiments presented in the paper that is the comparison of analysis performances of SAFE and SAFEDS on Lodash4 tests. There are script files to juxtapose experimental results for each RQs with numbers in the paper. The README file on the root directory describes the above in detail. This package is forked from SAFE and imports Jalangi2 as a git submodule. The license of this package is under the BSD license. We added the option “ds” to the original SAFE to trigger dynamic shortcuts. When the option is turned on, SAFEDS communicates with the Node.js server in the dynamic-shortcut directory and the server runs Jalangi2 for dynamic analysis of functions in the target program on the concrete engine. The requirements of SAFEDS are inherited from SAFE and Jalangi2 and specified in the REQUIREMENTS file. The INSTALL file will guide to initialize the submodule, SAFE, and Jalangi2.
DOI: 10.1145/3468264.3468556

Skeletal approximation enumeration for SMT solver testing
作者: Yao, Peisen and Huang, Heqing and Tang, Wensheng and Shi, Qingkai and Wu, Rongxin and Zhang, Charles

关键词: SMT solver testing, metamorphic testing, mutation-based testing
Abstract
Ensuring the equality of SMT solvers is critical due to its broad spectrum of applications in academia and industry, such as symbolic execution and program verification. Existing approaches to testing SMT solvers are either too costly or find difficulties generalizing to different solvers and theories, due to the test oracle problem. To complement existing approaches and overcome their weaknesses, this paper introduces skeletal approximation enumeration (SAE), a novel lightweight and general testing technique for all first-order theories. To demonstrate its practical utility, we have applied the SAE technique to test Z3 and CVC4, two comprehensively tested, state-of-the-art SMT solvers. By the time of writing, our approach had found 71 confirmed bugs in Z3 and CVC4,55 of which had already been fixed.
DOI: 10.1145/3468264.3468540

Boosting Static Analysis Accuracy With Instrumented Test Executions (Paper Artifact)
作者: Chen, Tianyi and Heo, Kihong and Raghothaman, Mukund

关键词: alarm ranking, Bayesian inference, belief networks, dynamic analysis, Static analysis
Abstract
Artifact associated with the paper “Boosting Static Analysis Accuracy with Instrumented Test Executions”, recently accepted to FSE 2021.
DOI: 10.1145/3468264.3468626

Replication Package for Paper： Symbolic Parallel Adaptive Importance Sampling for Probabilistic Program Analysis
作者: Luo, Yicheng and Filieri, Antonio and Zhou, Yuan

关键词: importance sampling, JAX, MCMC, probabilistic programming, symbolic execution
Abstract
Replication package and reference implementation for
Yicheng Luo, Antonio Filieri, and Yuan Zhou. Symbolic Parallel Adaptive Importance Sampling for Probabilistic Program Analysis. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). arXiv:2010.05050.
DOI: 10.1145/3468264.3468593

IDE support for cloud-based static analyses
作者: Luo, Linghui and Sch"{a

关键词: IDE integration, SAST tools, cloud service, security testing, static analysis
Abstract
Integrating static analyses into continuous integration (CI) or continuous delivery (CD) has become the best practice for assuring code quality and security. Static Application Security Testing (SAST) tools fit well into CI/CD, because CI/CD allows time for deep static analyses on large code bases and prevents vulnerabilities in the early stages of the development lifecycle. In CI/CD, the SAST tools usually run in the cloud and provide findings via a web interface. Recent studies show that developers prefer seeing the findings of these tools directly in their IDEs. Most tools with IDE integration run lightweight static analyses and can give feedback at coding time, but SAST tools used in CI/CD take longer to run and usually are not able to do so. Can developers interact directly with a cloud-based SAST tool that is typically used in CI/CD through their IDE? We investigated if such a mechanism can integrate cloud-based SAST tools better into a developers’ workflow than web-based solutions. We interviewed developers to understand their expectations from an IDE solution. Guided by these interviews, we implemented an IDE prototype for an existing cloud-based SAST tool. With a usability test using this prototype, we found that the IDE solution promoted more frequent tool interactions. In particular, developers performed code scans three times more often. This indicates better integration of the cloud-based SAST tool into developers’ workflow. Furthermore, while our study did not show statistically significant improvement on developers’ code-fixing performance, it did show a promising reduction in time for fixing vulnerable code.
DOI: 10.1145/3468264.3468535

Replication package for the paper： A Bounded Symbolic-Size Model for Symbolic Execution
作者: Trabish, David and Itzhaky, Shachar and Rinetzky, Noam

关键词: Symbolic Execution
Abstract
The package contains: - Source code for our tool (including all the required compiled binaries) - Benchmarks used in the experiments - Scripts for running the experiments
DOI: 10.1145/3468264.3468596

Efficient Module-Level Dynamic Analysis for Dynamic Languages with Module Recontextualization (Lya Artifact)
作者: Vasilakis, Nikos and Ntousakis, Grigoris and Heller, Veit and Rinard, Martin C.

关键词: Analysis, Dynamic, Instrumentation, Performance, Recontextualization, Runtime, Security
Abstract
Lya uses a novel set of module transformation techniques, collectively termed module recontextualization, to bolt a high-performance analysis and instrumentation infrastructure onto a conventional production runtime. Lya achieves high performance by analyzing code at a coarser-that-usual granularity, meaning that Lya’s analyses operate at a lower resolution than conventional analysis frameworks but at a significantly better performance—enabling always-on operation on production environments. Such coarse-grained, high-performance analyses have been shown to infer useful information about the execution of multi-library programs. Examples include identifying security vulnerabilities, highlighting performance bottlenecks, and applying corrective actions.
DOI: 10.1145/3468264.3468574

Mono2Micro： a practical and effective tool for decomposing monolithic Java applications to microservices
作者: Kalia, Anup K. and Xiao, Jin and Krishna, Rahul and Sinha, Saurabh and Vukovic, Maja and Banerjee, Debasish

关键词: clustering, dynamic analysis, microservices
Abstract
In migrating production workloads to cloud, enterprises often face the daunting task of evolving monolithic applications toward a microservice architecture. At IBM, we developed a tool called Mono2Micro to assist with this challenging task. Mono2Micro performs spatio-temporal decomposition, leveraging well-defined business use cases and runtime call relations to create functionally cohesive partitioning of application classes. Our preliminary evaluation of Mono2Micro showed promising results.  How well does Mono2Micro perform against other decomposition techniques, and how do practitioners perceive the tool? This paper describes the technical foundations of Mono2Micro and presents results to answer these two questions. To answer the first question, we evaluated Mono2Micro against four existing techniques on a set of open-source and proprietary Java applications and using different metrics to assess the quality of decomposition and tool’s efficiency. Our results show that Mono2Micro significantly outperforms state-of-the-art baselines in specific metrics well-defined for the problem domain. To answer the second question, we conducted a survey of twenty-one practitioners in various industry roles who have used Mono2Micro. This study highlights several benefits of the tool, interesting practitioner perceptions, and scope for further improvements. Overall, these results show that Mono2Micro can provide a valuable aid to practitioners in creating functionally cohesive and explainable microservice decompositions.
DOI: 10.1145/3468264.3473915

Data-driven test selection at scale
作者: Mehta, Sonu and Farmahinifarahani, Farima and Bhagwan, Ranjita and Guptha, Suraj and Jafari, Sina and Kumar, Rahul and Saini, Vaibhav and Santhiar, Anirudh

关键词: continuous integration, statistical models, test selection
Abstract
Large-scale services depend on Continuous Integration/Continuous Deployment (CI/CD) processes to maintain their agility and code-quality. Change-based testing plays an important role in finding bugs, but testing after every change is prohibitively expensive at a scale where thousands of changes are committed every hour. Test selection models deal with this issue by running a subset of tests for every change. In this paper, we present a generic, language-agnostic and lightweight statistical model for test selection. Unlike existing techniques, the proposed model does not require complex feature extraction techniques. Consequently, it scales to hundreds of repositories of varying characteristics while capturing more than 99% of buggy pull requests. Additionally, to better evaluate test selection models, we propose application-specific metrics that capture both a reduction in resource cost and a reduction in pull-request turn-around time. By evaluating our model on 22 large repositories at Microsoft, we find that we can save 15%−30% of compute time while reporting back more than ≈99% of buggy pull requests.
DOI: 10.1145/3468264.3473916

Effective low capacity status prediction for cloud systems
作者: Dong, Hang and Qin, Si and Xu, Yong and Qiao, Bo and Zhou, Shandan and Yang, Xian and Luo, Chuan and Zhao, Pu and Lin, Qingwei and Zhang, Hongyu and Abuduweili, Abulikemu and Ramanujan, Sanjay and Subramanian, Karthikeyan and Zhou, Andrew and Rajmohan, Saravanakumar and Zhang, Dongmei and Moscibroda, Thomas

关键词: capacity prediction, cloud computing, feature embedding, software reliability
Abstract
In cloud systems, an accurate capacity planning is very important for cloud provider to improve service availability. Traditional methods simply predicting “when the available resources is exhausted” are not effective due to customer demand fragmentation and platform allocation constraints. In this paper, we propose a novel prediction approach which proactively predicts the level of resource allocation failures from the perspective of low capacity status. By jointly considering the data from different sources in both time series form and static form, the proposed approach can make accurate LCS predictions in a complex and dynamic cloud environment, and thereby improve the service availability of cloud systems. The proposed approach is evaluated by real-world datasets collected from a large scale public cloud platform, and the results confirm its effectiveness.
DOI: 10.1145/3468264.3473917

Automated code transformation for context propagation in Go
作者: Welc, Adam

关键词: Go, automated code transformation, context propagation, services
Abstract
Microservices architecture, which is increasingly being adopted by large technology companies, can accelerate development and deployment of backend code by partitioning a monolithic infrastructure into independent components. At the same time, microservices often compose into massively distributed systems, where analyzing the behavior of an individual service may not be enough to diagnose a performance regression or find a point of failure. Instead, a more global view of the entire computation may be required, where some form of global context is used to trace relevant information flowing through the system.  In the Go language, the recommended method of propagating this tracing context through service’s code is to pass it as the first parameter to all functions on call paths where the context is used. This kind of code transformation, in addition to modifying function calls and function signatures, may involve modifications to other language constructs – performing it manually can be very tedious, particularly for large existing services. In this paper we describe an automated code transformation tool supporting this style of context propagation. We describe the design and implementation of the tool and, based on a case study using real production services, demonstrate that the tool can on average eliminate 94% of manual effort required to propagate tracing context through the code of a given service.
DOI: 10.1145/3468264.3473918

Onion： identifying incident-indicating logs for cloud systems
作者: Zhang, Xu and Xu, Yong and Qin, Si and He, Shilin and Qiao, Bo and Li, Ze and Zhang, Hongyu and Li, Xukun and Dang, Yingnong and Lin, Qingwei and Chintalapati, Murali and Rajmohan, Saravanakumar and Zhang, Dongmei

关键词: Cloud computing, Fault diagnosis, Incident-indicating logs identification, Log analysis
Abstract
In cloud systems, incidents affect the availability of services and require quick mitigation actions. Once an incident occurs, operators and developers often examine logs to perform fault diagnosis. However, the large volume of diverse logs and the overwhelming details in log data make the manual diagnosis process time-consuming and error-prone. In this paper, we propose Onion, an automatic solution for precisely and efficiently locating incident-indicating logs, which can provide useful clues for diagnosing the incidents. We first point out three criteria for localizing incident-indicating logs, i.e., Consistency, Impact, and Bilateral-Difference. Then we propose a novel agglomeration of logs, called log clique, based on which these criteria are satisfied. To obtain log cliques, we develop an incident-aware log representation and a progressive log clustering technique. Contrast analysis is then performed on the cliques to identify the incident-indicating logs. We have evaluated Onion using well-labeled log datasets. Onion achieves an average F1-score of 0.95 and can process millions of logs in only a few minutes, demonstrating its effectiveness and efficiency. Onion has also been successfully applied to the cloud system of Microsoft. Its practicability has been confirmed through the quantitative and qualitative analysis of the real incident cases.
DOI: 10.1145/3468264.3473919

Generating metamorphic relations for cyber-physical systems with genetic programming： an industrial case study
作者: Ayerdi, Jon and Terragni, Valerio and Arrieta, Aitor and Tonella, Paolo and Sagardui, Goiuria and Arratibel, Maite

关键词: cyber physical systems, evolutionary algorithm, genetic programming, metamorphic testing, mutation testing, oracle generation, oracle improvement, quality of service
Abstract
One of the major challenges in the verification of complex industrial Cyber-Physical Systems is the difficulty of determining whether a particular system output or behaviour is correct or not, the so-called test oracle problem. Metamorphic testing alleviates the oracle problem by reasoning on the relations that are expected to hold among multiple executions of the system under test, which are known as Metamorphic Relations (MRs). However, the development of effective MRs is often challenging and requires the involvement of domain experts. In this paper, we present a case study aiming at automating this process. To this end, we implemented GAssertMRs, a tool to automatically generate MRs with genetic programming. We assess the cost-effectiveness of this tool in the context of an industrial case study from the elevation domain. Our experimental results show that in most cases GAssertMRs outperforms the other baselines, including manually generated MRs developed with the help of domain experts. We then describe the lessons learned from our experiments and we outline the future work for the adoption of this technique by industrial practitioners.
DOI: 10.1145/3468264.3473920

Domain adaptation for an automated classification of deontic modalities in software engineering contracts
作者: Joshi, Vivek and Anish, Preethu Rose and Ghaisas, Smita

关键词: BERT, BiLSTM, Business Contract, Deep Learning Models, Deontic Modality, Domain Adaptation, Regulation
Abstract
Contracts are agreements between parties engaging in economic transactions. They specify deontic modalities that the signatories should be held responsible for and state the penalties or actions to be taken if the stated agreements are not met. Additionally, contracts have also been known to be source of Software Engineering (SE) requirements. Identifying the deontic modalities in contracts can therefore add value to the Requirements Engineering (RE) phase of SE. The complex and ambiguous language of contracts make it difficult and time-consuming to identify the deontic modalities (obligations, permissions, prohibitions), embedded in the text. State-of-art neural network models are effective for text classification; however, they require substantial amounts of training data. The availability of contracts data is sparse owing to the confidentiality concerns of customers. In this paper, we leverage the linguistic and taxonomical similarities between regulations (available abundantly in the public domain) and contracts to demonstrate that it is possible to use regulations as training data for classifying deontic modalities in real-life contracts. We discuss the results of a range of experiments from the use of rule-based approach to Bidirectional Encoder Representations from Transformers (BERT) for automating the classification of deontic modalities. With BERT, we obtained an average precision and recall of 90% and 89.66% respectively.
DOI: 10.1145/3468264.3473921

How can manual testing processes be optimized? developer survey, optimization guidelines, and case studies
作者: Haas, Roman and Elsner, Daniel and Juergens, Elmar and Pretschner, Alexander and Apel, Sven

关键词: Software testing, manual testing, test optimization
Abstract
Manual software testing is tedious and costly as it involves significant human effort. Yet, it is still widely applied in industry and will be in the foreseeable future. Although there is arguably a great need for optimization of manual testing processes, research focuses mostly on optimization techniques for automated tests. Accordingly, there is no precise understanding of the practices and processes of manual testing in industry nor about pitfalls and optimization potential that is untapped. To shed light on this issue, we conducted a survey among 38 testing professionals from 16 companies, to investigate their manual testing processes and to identify potential for optimization. We synthesize guidelines when optimization techniques from automated testing can be implemented for manual testing. By means of case studies on two industrial software projects, we show that fault detection likelihood, test feedback time and test creation efforts can be improved when following our guidelines.
DOI: 10.1145/3468264.3473922

Turnover-induced knowledge loss in practice
作者: Robillard, Martin P.

关键词: Knowledge loss, documentation, knowledge management, knowledge sharing
Abstract
When contributors to a software project leave, the knowledge they hold may become lost, thus impacting code quality and team productivity. Although well-known strategies can be used to mitigate knowledge loss, these strategies have to be tailored to their target context to be effective. To help software development organizations mitigate turnover-induced knowledge loss, we sought to better understand the different contexts in which developers experience this knowledge loss, and the resulting implications. We conducted qualitative interviews with 27 professional developers and managers from three different companies that provide software products and services. Leveraging the experience of these practitioners, we contribute a framework for characterizing turnover-induced knowledge loss and descriptions of the implications of knowledge loss, synthesized into 20 observations. These observations about knowledge loss in practice are organized into four themes, validated by the participants, and discussed within the context of the research literature in software engineering.
DOI: 10.1145/3468264.3473923

One thousand and one stories： a large-scale survey of software refactoring
作者: Golubev, Yaroslav and Kurbatova, Zarina and AlOmar, Eman Abdullah and Bryksin, Timofey and Mkaouer, Mohamed Wiem

关键词: IDE Refactoring Features, Refactorings, Software Maintenance
Abstract
Despite the availability of refactoring as a feature in popular IDEs, recent studies revealed that developers are reluctant to use them, and still prefer the manual refactoring of their code. At JetBrains, our goal is to fully support refactoring features in IntelliJ-based IDEs and improve their adoption in practice. Therefore, we start by raising the following main questions. How exactly do people refactor code? What refactorings are the most popular? Why do some developers tend not to use convenient IDE refactoring tools?  In this paper, we investigate the raised questions through the design and implementation of a survey targeting 1,183 users of IntelliJ-based IDEs. Our quantitative and qualitative analysis of the survey results shows that almost two-thirds of developers spend more than one hour in a single session refactoring their code; that refactoring types vary greatly in popularity; and that a lot of developers would like to know more about IDE refactoring features but lack the means to do so. These results serve us internally to support the next generation of refactoring features, as well as can help our research community to establish new directions in the refactoring usability research.
DOI: 10.1145/3468264.3473924

A comprehensive study on learning-based PE malware family classification methods
作者: Ma, Yixuan and Liu, Shuang and Jiang, Jiajun and Chen, Guanhong and Li, Keqiu

关键词: Concept Drift, Deep Learning, Malware Classification
Abstract
Driven by the high profit, Portable Executable (PE) malware has been consistently evolving in terms of both volume and sophistication. PE malware family classification has gained great attention and a large number of approaches have been proposed. With the rapid development of machine learning techniques and the exciting results they achieved on various tasks, machine learning algorithms have also gained popularity in the PE malware family classification task. Three mainstream approaches that use learning based algorithms, as categorized by the input format the methods take, are image-based, binary-based and disassembly-based approaches. Although a large number of approaches are published, there is no consistent comparisons on those approaches, especially from the practical industry adoption perspective. Moreover, there is no comparison in the scenario of concept drift, which is a fact for the malware classification task due to the fast evolving nature of malware. In this work, we conduct a thorough empirical study on learning-based PE malware classification approaches on 4 different datasets and consistent experiment settings. Based on the experiment results and an interview with our industry partners, we find that (1) there is no individual class of methods that significantly outperforms the others; (2) All classes of methods show performance degradation on concept drift (by an average F1-score of 32.23%); and (3) the prediction time and high memory consumption hinder existing approaches from being adopted for industry usage.
DOI: 10.1145/3468264.3473925

Infiltrating security into development： exploring the world’s largest software security study
作者: Weir, Charles and Migues, Sammy and Ware, Mike and Williams, Laurie

关键词: DevSecOps, Developer centered security, SDLC, Secure software development lifecycle, Software engineering, Software security, Software security group
Abstract
Recent years have seen rapid increases in cybercrime. The use of effective software security activities plays an important part in preventing the harm involved. Objective research on industry use of software security practices is needed to help development teams, academic researchers, and educators to focus their activities.  Since 2008, a team of researchers, including two of the authors, has been gathering objective data on the use of 121 software security activities. The Building Security In Maturity Model (BSIMM) study explores the activity use of 675,000 software developers, in companies including some of the world’s largest and most security-focused.  Our analysis of the study data shows little consistent growth in security activity adoption industry-wide until 2015. Since then, the data shows a strong increasing trend, along with the adoption of new activities to support cloud-based deployment, an emphasis on component security, and a reduction in security professionals’ policing role. Exploring patterns of adoption, activities related to detecting and responding to vulnerabilities are adopted marginally earlier than activities related to preventing vulnerabilities; and activities related to particular job roles tend to be used together. We also found that 12 developer security activities are adopted early, together, and notably more often than any others.  From these results, we offer recommendations for software and security engineers, and corresponding education and research suggestions for academia. These recommendations offer a strong contribution to improving security in development teams in the future.
DOI: 10.1145/3468264.3473926

Data-driven extract method recommendations： a study at ING
作者: van der Leij, David and Binda, Jasper and van Dalen, Robbert and Vallen, Pieter and Luo, Yaping and Aniche, Maur'{\i

关键词: Machine Learning for Software Engineering, Software Engineering, Software Refactoring
Abstract
The sound identification of refactoring opportunities is still an open problem in software engineering. Recent studies have shown the effectiveness of machine learning models in recommending methods that should undergo different refactoring operations. In this work, we experiment with such approaches to identify methods that should undergo an Extract Method refactoring, in the context of ING, a large financial organization. More specifically, we (i) compare the code metrics distributions, which are used as features by the models, between open-source and ING systems, (ii) measure the accuracy of different machine learning models in recommending Extract Method refactorings, (iii) compare the recommendations given by the models with the opinions of ING experts. Our results show that the feature distributions of ING systems and open-source systems are somewhat different, that machine learning models can recommend Extract Method refactorings with high accuracy, and that experts tend to agree with most of the recommendations of the model.
DOI: 10.1145/3468264.3473927

Duplicated code pattern mining in visual programming languages
作者: Terra-Neves, Miguel and Nadkarni, Jo~{a

关键词: duplicated code, maximum common sub-graph, maximum satisfiability, visual programming
Abstract
Visual Programming Languages (VPLs), coupled with the high-level abstractions that are commonplace in visual programming environments, enable users with less technical knowledge to become proficient programmers. However, the lower skill floor required by VPLs also entails that programmers are more likely to not adhere to best practices of software development, producing systems with high technical debt, and thus poor maintainability. Duplicated code is one important example of such technical debt. In fact, we observed that the amount of duplication in the OutSystems VPL code bases can reach as high as 39%. Duplicated code detection in text-based programming languages is still an active area of research with important implications regarding software maintainability and evolution. However, to the best of our knowledge, the literature on duplicated code detection for VPLs is very limited. We propose a novel and scalable duplicated code pattern mining algorithm that leverages the visual structure of VPLs in order to not only detect duplicated code, but also highlight duplicated code patterns that explain the reported duplication. The performance of the proposed approach is evaluated on a wide range of real-world mobile and web applications developed using OutSystems.
DOI: 10.1145/3468264.3473928

Making smart contract development more secure and easier
作者: Ren, Meng and Ma, Fuchen and Yin, Zijing and Fu, Ying and Li, Huizhong and Chang, Wanli and Jiang, Yu

关键词: Domain-specific Reinforcement, Integrated Testing, Smart Contract Development
Abstract
With the rapid development of distributed applications, smart contracts have attracted more and more developers’ attentions. However, developers or domain experts have different levels of familiarity with specific programming languages, like Solidity, and those vulnerabilities hidden in the code would be exploited and result in huge property losses. Existing auxiliary tools lack security considerations. Most of them only provide word completion based on fuzzy search and detection services for limited types of vulnerabilities, which results in the manpower waste during coding and potential vulnerability threats after deployment.  In this work, we propose an integrated framework to enhance security in the two stages of recommendation and validation, assisting developers to implement more secure contracts more quickly. First, we reinforce original smart contracts with general patch patterns and secure programming standards for training, and design a real-time code suggestion algorithm to predict secure words for selection. Then, we integrate multiple widely-used testing tools to provide validation services. For evaluation, we collected 47,398 real-world contracts, and the result shows that it outperforms existing platforms and tools, improving the average word suggestion accuracy by 30%-60% and helping detect about 25%-61% more vulnerabilities. In most cases, our framework can correctly predict next words with the probability up to 82%-97% within top ten candidates. Compared with professional vulnerability mining tools, it can find more vulnerabilities and provide targeted modification suggestions without frivolous configurations. Currently, this framework has been used as the official development tool of WeBank and integrated as the recommended platform by FISCO-BCOS community.
DOI: 10.1145/3468264.3473929

Quantifying no-fault-found test failures to prioritize inspection of flaky tests at Ericsson
作者: Rehman, Maaz Hafeez Ur and Rigby, Peter C.

关键词: Flaky tests, No Fault Found, Software Testing, Statistical Modeling
Abstract
A test fails and despite an investigation by a developer there is no fault found (NFF). Large software systems are often released with known failing and flaky tests. In this work, we quantify how often a test fails and does not find a fault. We conduct a case study on 9.9 million test runs of 10k tests across four releases of a large project at Ericsson.  For each test, we mine the rate of NFF test failures over total runs for each release, i.e. NFFRate. We compare the current level of test failure with the number of NFF failures during the stabilization period of the prior release, i.e. StableNFFRate. Using the binomial distribution, we are able to determine which tests exhibit a statistically larger number of failures relative to their expected StableNFFRate. These unstable tests need to be prioritized for re-run and potentially investigated to determine if there is a fault or if the test needs to be fixed or modified.  Our work has had an impact on Ericsson’s testing practices with testers using the NFFRate to determine which tests are the “flakiest" and need to be fixed or moved into an earlier, virtualized unit test stage. Testers also use our tool and technique to prioritize the statistically unstable tests failures for investigation and to examine longterm trends of test failures that may indicate a fault.
DOI: 10.1145/3468264.3473930

When life gives you oranges： detecting and diagnosing intermittent job failures at Mozilla
作者: Lampel, Johannes and Just, Sascha and Apel, Sven and Zeller, Andreas

关键词: Software testing, continuous integration, flaky tests, intermittent failures, machine learning
Abstract
Continuous delivery of cloud systems requires constant running of jobs (build processes, tests, etc.). One issue that plagues this continuous integration (CI) process are intermittent failures - non-deterministic, false alarms that do not result from a bug in the software or job specification, but rather from issues in the underlying infrastructure. At Mozilla, such intermittent failures are called oranges as a reference to the color of the build status indicator. As such intermittent failures disrupt CI and lead to failures, they erode the developers’ trust in the jobs. We present a novel approach that automatically classifies failing jobs to determine whether job execution failures arise from an actual software bug or were caused by flakiness in the job (e.g., test) or the underlying infrastructure. For this purpose, we train classification models using job telemetry data to diagnose failure patterns involving features such as runtime, cpu load, operating system version, or specific platform with high precision. In an evaluation on a set of Mozilla CI jobs, our approach achieves precision scores of 73%, on average, across all data sets with some test suites achieving precision scores good enough for fully automated classification (i.e., precision scores of up to 100%), and recall scores of 82% on average (up to 94%).
DOI: 10.1145/3468264.3473931

FuzzBench： an open fuzzer benchmarking platform and service
作者: Metzman, Jonathan and Szekeres, L'{a

关键词: benchmarking, fuzz testing, fuzzing, software security, testing
Abstract
Fuzzing is a key tool used to reduce bugs in production software. At Google, fuzzing has uncovered tens of thousands of bugs. Fuzzing is also a popular subject of academic research. In 2020 alone, over 120 papers were published on the topic of improving, developing, and evaluating fuzzers and fuzzing techniques. Yet, proper evaluation of fuzzing techniques remains elusive. The community has struggled to converge on methodology and standard tools for fuzzer evaluation. To address this problem, we introduce FuzzBench as an open-source turnkey platform and free service for evaluating fuzzers. It aims to be easy to use, fast, reliable, and provides reproducible experiments. Since its release in March 2020, FuzzBench has been widely used both in industry and academia, carrying out more than 150 experiments for external users. It has been used by several published and in-the-work papers from academic groups, and has had real impact on the most widely used fuzzing tools in industry. The presented case studies suggest that FuzzBench is on its way to becoming a standard fuzzer benchmarking platform.
DOI: 10.1145/3468264.3473932

An empirical investigation of practical log anomaly detection for online service systems
作者: Zhao, Nengwen and Wang, Honglin and Li, Zeyan and Peng, Xiao and Wang, Gang and Pan, Zhu and Wu, Yong and Feng, Zhen and Wen, Xidao and Zhang, Wenchi and Sui, Kaixin and Pei, Dan

关键词: Log Anomaly Detection, Online Service Systems, Practical Challenges
Abstract
Log data is an essential and valuable resource of online service systems, which records detailed information of system running status and user behavior. Log anomaly detection is vital for service reliability engineering, which has been extensively studied. However, we find that existing approaches suffer from several limitations when deploying them into practice, including 1) inability to deal with various logs and complex log abnormal patterns; 2) poor interpretability; 3) lack of domain knowledge. To help understand these practical challenges and investigate the practical performance of existing work quantitatively, we conduct the first empirical study and an experimental study based on large-scale real-world data. We find that logs with rich information indeed exhibit diverse abnormal patterns (e.g., keywords, template count, template sequence, variable value, and variable distribution). However, existing approaches fail to tackle such complex abnormal patterns, producing unsatisfactory performance. Motivated by obtained findings, we propose a generic log anomaly detection system named LogAD based on ensemble learning, which integrates multiple anomaly detection approaches and domain knowledge, so as to handle complex situations in practice. About the effectiveness of LogAD, the average F1-score achieves 0.83, outperforming all baselines. Besides, we also share some success cases and lessons learned during our study. To our best knowledge, we are the first to investigate practical log anomaly detection in the real world deeply. Our work is helpful for practitioners and researchers to apply log anomaly detection to practice to enhance service reliability.
DOI: 10.1145/3468264.3473933

RAPID： checking API usage for the cloud in the cloud
作者: Emmi, Michael and Hadarean, Liana and Jhala, Ranjit and Pike, Lee and Rosner, Nicol'{a

关键词: API usage checking, software security, static analysis in the cloud
Abstract
We present RAPID, an industrial-strength analysis developed at AWS that aims to help developers by providing automatic, fast and actionable feedback about correct usage of cloud-service APIs. RAPID’s design is based on the insight that cloud service APIs are structured around short-lived request- and response-objects whose usage patterns can be specified as value-dependent type-state automata and be verified by combining local type-state with global value-flow analyses. We describe various challenges that arose to deploy RAPID at scale. Finally, we present an evaluation that validates our design choices, deployment heuristics, and shows that RAPID is able to quickly and precisely report a wide variety of useful API misuse violations in large, industrial-strength code bases.
DOI: 10.1145/3468264.3473934

An empirical study of GUI widget detection for industrial mobile games
作者: Ye, Jiaming and Chen, Ke and Xie, Xiaofei and Ma, Lei and Huang, Ruochen and Chen, Yingfeng and Xue, Yinxing and Zhao, Jianjun

关键词: Deep Learning, GUI Detection, Game Testing
Abstract
With the widespread adoption of smartphones in our daily life, mobile games experienced increasing demand over the past years. Meanwhile, the quality of mobile games has been continuously drawing more and more attention, which can greatly affect the player experience. For better quality assurance, general-purpose testing has been extensively studied for mobile apps. However, due to the unique characteristic of mobile games, existing mobile testing techniques may not be directly suitable and applicable. To better understand the challenges in mobile game testing, in this paper, we first initiate an early step to conduct an empirical study towards understanding the challenges and pain points of mobile game testing process at our industrial partner NetEase Games. Specifically, we first conduct a survey from the mobile test development team at NetEase Games via both scrum interviews and questionnaires. We found that accurate and effective GUI widget detection for mobile games could be the pillar to boost the automation of mobile game testing and other downstream analysis tasks in practice.  We then continue to perform comparative studies to investigate the effectiveness of state-of-the-art general-purpose mobile app GUI widget detection methods in the context of mobile games. To this end, we also develop a technique to automatically collect GUI widgets region information of industrial mobile games, which is equipped with a heuristic-based data cleaning method for quality refinement of the labeling results. Our evaluation shows that: (1) Existing GUI widget detection methods for general-purpose mobile apps cannot perform well on industrial mobile games. (2) Mobile game exhibits obvious difference from other general-purpose mobile apps in the perspective GUI widgets. Our further in-depth analysis reveals high diversity and density characteristics of mobile game GUI widgets could be the major reasons that post the challenges for existing methods, which calls for new research methods and better industry practices. To enable further research along this line, we construct the very first GUI widget detection benchmark, specially designed for mobile games, incorporating both our collected dataset and the state-of-the-art widget detection methods for mobile apps, which could also be the basis for further study of many downstream quality assurance tasks (e.g., testing and analysis) for mobile games.
DOI: 10.1145/3468264.3473935

Intelligent container reallocation at Microsoft 365
作者: Qiao, Bo and Yang, Fangkai and Luo, Chuan and Wang, Yanan and Li, Johnny and Lin, Qingwei and Zhang, Hongyu and Datta, Mohit and Zhou, Andrew and Moscibroda, Thomas and Rajmohan, Saravanakumar and Zhang, Dongmei

关键词: Container reallocation, local search optimization, workload balance
Abstract
The use of containers in microservices has gained popularity as it facilitates agile development, resource governance, and software maintenance. Container reallocation aims to achieve workload balance via reallocating containers over physical machines. It affects the overall performance of microservice-based systems. However, container scheduling and reallocation remain an open issue due to their complexity in real-world scenarios. In this paper, we propose a novel Multi-Phase Local Search (MPLS) algorithm to optimize container reallocation. The experimental results show that our optimization algorithm outperforms state-of-the-art methods. In practice, it has been successfully applied to Microsoft 365 system to mitigate hotspot machines and balance workloads across the entire system.
DOI: 10.1145/3468264.3473936

Organizational implications of agile adoption： a case study from the public sector
作者: Mohagheghi, Parastoo and Lassenius, Casper

关键词: Large-scale agile software development, agile adoption, agile organization, team autonomy
Abstract
While agile software development is increasingly adopted in large organizations, there is still a lack of studies on how traditionally organized enterprises adopt and scale agile forms of organization. This industrial multiple embedded case study explores how the organizational model of a large public sector entity evolved over four years to support the adoption of agile software development methods. Data was collected through semi-structured interviews and document analysis. We describe the change in three phases: pre-transformation, initial transformation, and maturing. Changes in three subcases of organizational units are further described in detail. Moving from an outsourced project-based way-of-working with separate business, IT and vendor organizations, the new organizational design emphasizes internal development capability, cross-functional autonomous teams organized around products and grouped in product areas, and continuous delivery. Starting from the IT department, the transformation expanded to the whole organization, and went beyond software development to the finance and leadership. We describe the target and intermediate organizations employed when adopting agile development methods for the whole organization and three organizational units responsible for different services. Defining suitable product boundaries, achieving alignment across teams, enhancing the competence of product owners, the coexistence of old and new types of systems, processes, and structures, and balancing the teams’ need for autonomy with the organizational needs for coordination and control are remaining challenges.
DOI: 10.1145/3468264.3473937

Towards immersive software archaeology： regaining legacy systems’ design knowledge via interactive exploration in virtual reality
作者: Hoff, Adrian and Nieke, Michael and Seidl, Christoph

关键词: Legacy Software, Software Archaeology, Software Engineering, Software Re-Engineering, Software Visualization, Virtual Reality
Abstract
Many of today’s software systems will become the legacy systems of tomorrow, comprised of outdated technology and inaccurate design documents. Preparing for their eventual re-engineering requires engineers to regain lost design knowledge and discover re-engineering opportunities. While tools and visualizations exist, comprehending an unfamiliar code base remains challenging. Hence, software archaeology suffers from a considerable entry barrier as it requires expert knowledge, significant diligence, tenacity, and stamina. In this paper, we propose a paradigm shift in how legacy systems’ design knowledge can be regained by presenting our vision for an immersive explorable software visualization in virtual reality (VR). We propose innovative concepts leveraging benefits of VR for a) immersion in an exoteric visualization metaphor, b) effective navigation and orientation, c) guiding exploration, and d) maintaining a link to the implementation. By enabling immersive and playful legacy system exploration, we strive for lowering the entry barrier, fostering long-term engagement, strengthening mental-model building, and improving knowledge retention in an effort to ease coping with the increased number of tomorrow’s legacy systems.
DOI: 10.1145/3468264.3473128

Replication Package for Article： Reducing the Search Space of Bug Inducing Commits using Failure Coverage
作者: An, Gabin and Yoo, Shin

关键词: Bug Inducing Commit, Code Coverage, Software Debugging, Test Coverage
Abstract
This artifact contains a replication package accompanying the paper “Reducing the Search Space of Bug Inducing Commits using Failure Coverage”. It contains the full experimental results and provides a docker environment in which one can easily replicate the whole experiment described in the paper. The detailed guide to replication is provided in the artifact’s README.md file. It also provides a script for analyzing the experimental results to support the reproduction of all result figures in the paper.
DOI: 10.1145/3468264.3473129

The gas triangle and its challenges to the development of blockchain-powered applications
作者: Oliva, Gustavo A. and Hassan, Ahmed E.

关键词: Decentralized Applications, Ethereum, cost-effective, gas system
Abstract
Ethereum is the most popular blockchain platform for the development of blockchain-powered applications (a.k.a, ). Developing a involves translating requests captured in the frontend of an application into contract transactions. However, transactions need to be payed for. Ethereum employs the gas system to charge transaction fees. The gas system has three key components, namely gas price, gas usage, and gas limit. We refer to these components and their interplay as the gas triangle. In this paper, we claim that the inherently complex gas triangle should not be exposed to end-users. We conduct two studies that provide empirical evidence to support our claim. In light of our results, we provide a list of recommendations to novice end-users. We conclude the paper with a list of research challenges that need to be tackled in order to support the development of next-generation that completely hide the gas triangle from end-users.
DOI: 10.1145/3468264.3473130

Selecting test inputs for DNNs using differential testing with subspecialized model instances
作者: Ma, Yu-Seung and Yoo, Shin and Kim, Taeho

关键词: Diffrential Testing, Machine Learning, Test Oracle
Abstract
Testing of Deep Learning (DL) models is difficult due to the lack of automated test oracle and the high cost of human labelling. Differential testing has been used as a surrogate oracle, but there is no systematic guide on how to choose the reference model to use for differential testing. We propose a novel differential testing approach based on subspecialized models, i.e., models that are trained on sliced training data only (hence specialized for the slice). A preliminary evaluation of our approach with an CNN-based EMNIST image classifier shows that it can achieve higher error detection rate with selected inputs compared to using more advanced ResNet and LeNet as the reference model for differential testing. Our approach also outperforms N-version testing, i.e., the use of the same DL model architecture trained separately but using the same data.
DOI: 10.1145/3468264.3473131

Term interrelations and trends in software engineering
作者: Baskararajah, Janusan and Zhang, Lei and Miranskyy, Andriy

关键词: software engineering trends, term interrelations, word embeddings
Abstract
The Software Engineering (SE) community is prolific, making it challenging for experts to keep up with the flood of new papers and for neophytes to enter the field. Therefore, we posit that the community may benefit from a tool extracting terms and their interrelations from the SE community’s text corpus and showing terms’ trends. In this paper, we build a prototyping tool using the word embedding technique. We train the embeddings on the SE Body of Knowledge handbook and 15,233 research papers’ titles and abstracts. We also create test cases necessary for validation of the training of the embeddings. We provide representative examples showing that the embeddings may aid in summarizing terms and uncovering trends in the knowledge base.
DOI: 10.1145/3468264.3473132

Software robustness： a survey, a theory, and prospects
作者: Petke, Justyna and Clark, David and Langdon, William B.

关键词: Anti-fragile, Correctness Attraction, Failed Disruption Propagation, Genetic Improvement, Information Theory, Software Robustness
Abstract
If a software execution is disrupted, witnessing the execution at a later point may see evidence of the disruption or not. If not, we say the disruption failed to propagate. One name for this phenomenon is software robustness but it appears in different contexts in software engineering with different names. Contexts include testing, security, reliability, and automated code improvement or repair. Names include coincidental correctness, correctness attraction, transient error reliability. As witnessed, it is a dynamic phenomenon but any explanation with predictive power must necessarily take a static view. As a dynamic/static phenomenon it is convenient to take a statistical view of it which we do by way of information theory. We theorise that for failed disruption propagation to occur, a necessary condition is that the code region where the disruption occurs is composed with or succeeded by a subsequent code region that suffers entropy loss over all executions. The higher is the entropy loss, the higher the likelihood that disruption in the first region fails to propagate to the downstream observation point. We survey different research silos that address this phenomenon and explain how the theory might be exploited in software engineering.
DOI: 10.1145/3468264.3473133

Towards automating code review at scale
作者: Hellendoorn, Vincent J. and Tsay, Jason and Mukherjee, Manisha and Hirzel, Martin

关键词: code review, neural networks
Abstract
As neural methods are increasingly used to support and automate software development tasks, code review is a natural next target. Yet, training models to imitate developers based on past code reviews is far from straightforward: reviews found in open-source projects vary greatly in quality, phrasing, and depth depending on the reviewer. In addition, changesets are often large, stretching the capacity of current neural models. Recent work reported modest success at predicting review resolutions, but largely side-stepped the above issues by focusing on small inputs where comments were already known to occur. This work examines the vision and challenges of automating code review at realistic scale. We collect hundreds of thousands of changesets across hundreds of projects that routinely conduct code review, many of which change thousands of tokens. We focus on predicting just the locations of comments, which are quite rare. By analyzing model performance and dataset statistics, we show that even this task is already challenging, in no small part because of tremendous variation and (apparent) randomness in code reviews. Our findings give rise to a research agenda for realistically and impactfully automating code review.
DOI: 10.1145/3468264.3473134

Learning type annotation： is big data enough?
作者: Jesse, Kevin and Devanbu, Premkumar T. and Ahmed, Toufique

关键词: Type inference, TypeScript, deep learning, transfer learning
Abstract
TypeScript is a widely used optionally-typed language where developers can adopt “pay as you go” typing: they can add types as desired, and benefit from static typing. The “type annotation tax” or manual effort required to annotate new or existing TypeScript can be reduced by a variety of automatic methods. Probabilistic machine-learning (ML) approaches work quite well. ML approaches use different inductive biases, ranging from simple token sequences to complex graphical neural network (GNN) models capturing syntax and semantic relations. More sophisticated inductive biases are hand-engineered to exploit the formal nature of software. Rather than deploying fancy inductive biases for code, can we just use “big data” to learn natural patterns relevant to typing? We find evidence suggesting that this is the case. We present TypeBert, demonstrating that even with simple token-sequence inductive bias used in BERT-style models and enough data, type-annotation performance of the most sophisticated models can be surpassed.
DOI: 10.1145/3468264.3473135

New visions on metamorphic testing after a quarter of a century of inception
作者: Chen, Tsong Yueh and Tse, T. H.

关键词: Debugging, Metamorphic relation, Metamorphic testing, Proving, Reliable test set, Test oracle, Testing
Abstract
Metamorphic testing (MT) was introduced about a quarter of a century ago. It is increasingly being accepted by researchers and the industry as a useful testing technique. The studies, research results, applications, and extensions of MT have given us many insights and visions for its future. Our visions include: MRs will be a practical means to top up test case generation techniques, beyond the alleviation of the test oracle problem; MT will not only be a standalone technique, but conveniently integrated with other methods; MT and MRs will evolve beyond software testing, or even beyond verification; MRs may be anything that you can imagine, beyond the necessary properties of algorithms; MT research will be beyond empirical studies and move toward a theoretical foundation; MT will not only bring new concepts to software testing but also new concepts to other disciplines; MRs will alleviate the reliable test set problem beyond traditional approaches. These visions may help researchers explore the challenges and opportunities for MT in the next decade.
DOI: 10.1145/3468264.3473136

Health of smart ecosystems
作者: El Moussa, Noura and Molinelli, Davide and Pezz`{e

关键词: ecosystem health, smart ecosystems, software verification, systems of systems, ultra-large systems
Abstract
Software is a core component of smart ecosystems, large ’system communities’ that emerge from the composition of autonomous, independent, and highly heterogeneous systems, like smart cities, smart grids, smart buildings. The systems that comprise smart ecosystems are not centrally owned, and mutually interact both explicitly and implicitly, leading to unavoidable contradictions and failures. The distinctive characteristics of smart ecosystems challenge software engineers with problems never addressed so far. In this paper we discuss the big challenge of defining a new concept of ’dependability’ and new approaches to reveal smart ecosystem failures.
DOI: 10.1145/3468264.3473137

LLSC： a parallel symbolic execution compiler for LLVM IR
作者: Wei, Guannan and Tan, Shangyin and Bra\v{c

关键词: compilation, program testing, staging, symbolic execution
Abstract
We present LLSC, a prototype compiler for nondeterministic parallel symbolic execution of the LLVM intermediate representation (IR). Given an LLVM IR program, LLSC generates code preserving the symbolic execution semantics and orchestrating solver invocations. The generated code runs efficiently, since the code has eliminated the interpretation overhead and explores multiple paths in parallel. To the best of our knowledge, LLSC is the first compiler for fork-based symbolic execution semantics that can generate parallel execution code.  In this demonstration paper, we present the current development and preliminary evaluation of LLSC. The principle behind LLSC is to automatically specialize a symbolic interpreter via the 1st Futamura projection, a fundamental connection between interpreters and compilers. The symbolic interpreter is written in an expressive high-level language equipped with a multi-stage programming facility. We demonstrate the run time performance through a set of benchmark programs, showing that LLSC outperforms interpretation-based symbolic execution engines in significant ways.
DOI: 10.1145/3468264.3473108

OwlEyes-online： a fully automated platform for detecting and localizing UI display issues
作者: Su, Yuhui and Liu, Zhe and Chen, Chunyang and Wang, Junjie and Wang, Qing

关键词: Deep learning, Issue detection, Mobile app, UI display, UI testing
Abstract
Graphical User Interface (GUI) provides visual bridges between software apps and end users. However, due to the compatibility of software or hardware, UI display issues such as text overlap, blurred screen, image missing always occur during GUI rendering on different devices. Because these UI display issues can be found directly by human eyes, in this paper, we implement an online UI display issue detection tool OwlEyes-Online, which provides a simple and easy-to-use platform for users to realize the automatic detection and localization of UI display issues. The OwlEyes-Online can automatically run the app and get its screenshots and XML files, and then detect the existence of issues by analyzing the screenshots. In addition, OwlEyes-Online can also find the detailed area of the issue in the given screenshots to further remind developers. Finally, OwlEyes-Online will automatically generate test reports with UI display issues detected in app screenshots and send them to users. The OwlEyes-Online was evaluated and proved to be able to accurately detect UI display issues. Tool Link: http://www.owleyes.online:7476 Github Link: https://github.com/franklinbill/owleyes Demo Video Link: https://youtu.be/002nHZBxtCY
DOI: 10.1145/3468264.3473109

Exploit those code reviews! bigger data for deeper learning
作者: Heum"{u

关键词: code review, datasets, deep learning, source code
Abstract
Modern code review (MCR) processes are prevalent in most organizations that develop software due to benefits in quality assurance and knowledge transfer. With the rise of collaborative software development platforms like GitHub and Bitbucket, today, millions of projects share not only their code but also their review data. Although researchers have tried to exploit this data for more than a decade, most of that knowledge remains a buried treasure. A crucial catalyst for many advances in deep learning, however, is the accessibility of large-scale standard datasets for different learning tasks. This paper presents the ETCR (Exploit Those Code Reviews!) infrastructure for mining MCR datasets from any GitHub project practicing pull-request-based development. We demonstrate its effectiveness with ETCR-Elasticsearch, a dataset of >231𝑘 review comments for >47𝑘 Java file revisions in >40𝑘 pull-requests from the Elasticsearch project. ETCR is designed with the challenge of deep learning in mind. Compared to previous datasets, ETCR datasets include all information for linking review comments to nodes in the respective program’s Abstract Syntax Tree.
DOI: 10.1145/3468264.3473110

BRAID： an API recommender supporting implicit user feedback
作者: Zhou, Yu and Jin, Haonan and Yang, Xinying and Chen, Taolue and Narasimhan, Krishna and Gall, Harald C.

关键词: API recommendation, Active learning, Learning to rank, Natural language processing
Abstract
Efficient application programming interface (API) recommendation is one of the most desired features of modern integrated development environments. A multitude of API recommendation approaches have been proposed. However, most of the currently available API recommenders do not support the effective integration of user feedback into the recommendation loop. In this paper, we present BRAID (Boosting RecommendAtion with Implicit FeeDback), a tool which leverages user feedback, and employs learning-to-rank and active learning techniques to boost recommendation performance. The implementation is based on the VSCode plugin architecture, which provides an integrated user interface. Essentially, BRAID is a general framework which can accommodate existing query-based API recommendation approaches as components. Comparative experiments with strong baselines demonstrate the efficacy of the tool. A video demonstrating the usage of BRAID can be found at https://youtu.be/naD0guvl8sE.
DOI: 10.1145/3468264.3473111

KGAMD： an API-misuse detector driven by fine-grained API-constraint knowledge graph
作者: Ren, Xiaoxue and Ye, Xinyuan and Xing, Zhenchang and Xia, Xin and Xu, Xiwei and Zhu, Liming and Sun, Jianling

关键词: API Misuse, Java Documentation, Knowledge Graph
Abstract
Application Programming Interfaces (APIs) typically come with usage constraints. The violations of these constraints (i.e. API misuses) can cause significant problems in software development. Existing methods mine frequent API usage patterns from codebase to detect API misuses. They make a naive assumption that API usage that deviates from the most-frequent API usage is a misuse. However, there is a big knowledge gap between API usage patterns and API usage constraints in terms of comprehensiveness, explainability and best practices. Inspired by this, we propose a novel approach named KGAMD (API-Misuse Detector Driven by Fine-Grained API-Constraint Knowledge Graph) that detects API misuses directly against the API constraint knowledge, rather than API usage pat-terns. We first construct a novel API-constraint knowledge graph from API reference documentation with open information extraction methods. This knowledge graph explicitly models two types of API-constraint relations (call-order and condition-checking) and enriches return and throw relations with return conditions and exception triggers. Then, we develop the KGAMD tool that utilizes the knowledge graph to detect API misuses. There are three types of frequent API misuses we can detect - missing calls, missing condition checking and missing exception handling, while existing detectors mostly focus on only missing calls. Our quantitative evaluation and user study demonstrate that our KGAMD is promising in helping developers avoid and debug API misuses  Demo Video: https://www.youtube.com/watch?v=TN4LtHJ-494 IntelliJ plug-in: https://github.com/goodchar/KGAMD
DOI: 10.1145/3468264.3473112

Sangrahaka： a tool for annotating and querying knowledge graphs
作者: Terdalkar, Hrishikesh and Bhattacharya, Arnab

关键词: Annotation Tool, Knowledge Graph, Querying Tool
Abstract
We present a web-based tool Sangrahaka for annotating entities and relationships from text corpora towards construction of a knowledge graph and subsequent querying using templatized natural language questions. The application is language and corpus agnostic, but can be tuned for specific needs of a language or a corpus. The application is freely available for download and installation. Besides having a user-friendly interface, it is fast, supports customization, and is fault tolerant on both client and server side. It outperforms other annotation tools in an objective evaluation metric. The framework has been successfully used in two annotation tasks. The code is available from <a>https://github.com/hrishikeshrt/sangrahaka&lt;/a&gt;.
DOI: 10.1145/3468264.3473113

Code2Que： a tool for improving question titles from mined code snippets in stack overflow
作者: Gao, Zhipeng and Xia, Xin and Lo, David and Grundy, John and Li, Yuan-Fang

关键词: Deep Learning, Question Quality, Seq2Seq Model, Stack Overflow
Abstract
Stack Overflow is one of the most popular technical Q&A sites used by software developers. Seeking help from Stack Overflow has become an essential part of software developers’ daily work for solving programming-related questions. Although the Stack Overflow community has provided quality assurance guidelines to help users write better questions, we observed that a significant number of questions submitted to Stack Overflow are of low quality. In this paper, we introduce a new web-based tool, Code2Que, which can help developers in writing higher quality questions for a given code snippet. Code2Que consists of two main stages: offline learning and online recommendation. In the offline learning phase, we first collect a set of good quality ⟨code snippet, question⟩ pairs as training samples. We then train our model on these training samples via a deep sequence-to-sequence approach, enhanced with an attention mechanism, a copy mechanism and a coverage mechanism. In the online recommendation phase, for a given code snippet, we use the offline trained model to generate question titles to assist less experienced developers in writing questions more effectively. To evaluate Code2Que, we first sampled 50 low quality ⟨code snippet, question⟩ pairs from the Python and Java datasets on Stack Overflow. Then we conducted a user study to evaluate the question titles generated by our approach as compared to human-written ones using three metrics: Clearness, Fitness and Willingness to Respond. Our experimental results show that for a large number of low-quality questions in Stack Overflow, Code2Que can improve the question titles in terms of Clearness, Fitness and Willingness measures.
DOI: 10.1145/3468264.3473114

BF-detector： an automated tool for CI build failure detection
作者: Saidani, Islem and Ouni, Ali and Chouchen, Moataz and Mkaouer, Mohamed Wiem

关键词: Build Prediction, Continuous Integration, Machine Learning, Multi-Objective Optimization, Search-Based Software Engineering
Abstract
Continuous Integration (CI) aims at supporting developers in inte-grating code changes quickly through automated building. How-ever, there is a consensus that CI build failure is a major barrierthat developers face, which prevents them from proceeding furtherwith development. In this paper, we introduceBF-Detector, anautomated tool to detect CI build failure. Based on the adaptationof Non-dominated Sorting Genetic Algorithm (NSGA-II), our toolaims at finding the best prediction rules based on two conflictingobjective functions to deal with both minority and majority classes.We evaluated the effectiveness of our tool on a benchmark of 56,019CI builds. The results reveal that our technique outperforms state-of-the-art approaches by providing a better balance between bothfailed and passed builds.BF-Detectortool is publicly available,with a demo video, at: https://github.com/stilab-ets/BF-Detector.
DOI: 10.1145/3468264.3473115

AlloyFL： a fault localization framework for Alloy
作者: Khan, Tanvir Ahmed and Sullivan, Allison and Wang, Kaiyuan

关键词: Alloy, Declarative programming, Fault localization
Abstract
Declarative models help improve the reliability of software systems: models can be used to convey requirements, analyze system designs and verify implementation properties. Alloy is a commonly used modeling language. A key strength of Alloy is the Analyzer, Alloy’s integrated development environment (IDE), which allows users to write and execute models by leveraging a fully automatic SAT based analysis engine. Unfortunately, writing correct constraints of complex properties is difficult. To help users identify fault locations, AlloyFL is a fault localization technique that takes as input a faulty Alloy model and a fault-revealing test suite. As output, AlloyFL returns a ranked list of locations from most to least suspicious. This paper describes our Java implementation of AlloyFL as an extension to the Analyzer. Our experimental results show AlloyFL is capable of detecting the location of real world faults and works in the presence of multiple faulty locations. The demo video for AlloyFL can be found at https://youtu.be/ZwgP58Nsbx8.
DOI: 10.1145/3468264.3473116

BiasRV： uncovering biased sentiment predictions at runtime
作者: Yang, Zhou and Asyrofi, Muhammad Hilmi and Lo, David

关键词: Ethical AI, Fairness, Runtime Verification, Sentiment Analysis
Abstract
Sentiment analysis (SA) systems, though widely applied in many domains, have been demonstrated to produce biased results. Some research works have been done in automatically generating test cases to reveal unfairness in SA systems, but the community still lacks tools that can monitor and uncover biased predictions at runtime. This paper fills this gap by proposing BiasRV, the first tool to raise an alarm when a deployed SA system makes a biased prediction on a given input text. To implement this feature, BiasRV dynamically extracts a template from an input text and generates gender-discriminatory mutants (semantically-equivalent texts that only differ in gender information) from the template. Based on popular metrics used to evaluate the overall fairness of an SA system, we define the distributional fairness property for an individual prediction of an SA system. This property specifies a requirement that for one piece of text, mutants from different gender classes should be treated similarly. Verifying the distributional fairness property causes much overhead to the running system. To run more efficiently, BiasRV adopts a two-step heuristic: (1) sampling several mutants from each gender and checking if the system predicts them as of the same sentiment, (2) checking distributional fairness only when sampled mutants have conflicting results. Experiments show that when compared to directly checking the distributional fairness property for each input text, our two-step heuristic can decrease the overhead used for analyzing mutants by 73.81% while only resulting in 6.7% of biased predictions being missed. Besides, BiasRV can be used conveniently without knowing the implementation of SA systems. Future researchers can easily extend BiasRV to detect more types of bias, e.g., race and occupation. The demo video for BiasRV can be viewed at https://youtu.be/WPe4Ml77d3U and the source code can be found at https://github.com/soarsmu/BiasRV.
DOI: 10.1145/3468264.3473117

ICME： an informed consent management engine for conformance in smart building environments
作者: Pathmabandu, Chehara and Grundy, John and Chhetri, Mohan Baruwal and Baig, Zubair

关键词: Compliance, Informed consent, IoT, Privacy Preservation, Privacy Rights, Privacy policies, Smart Buildings, Smart office, awareness
Abstract
Smart buildings can reveal highly sensitive insights about their inhabitants and expose them to new privacy threats and vulnerabilities. Yet, convenience overrides privacy concerns and most people remain ignorant about this issue. We propose a novel Informed Consent Management Engine (ICME) that aims to: (a) increase users’ awareness about privacy issues and data collection practices in their smart building environments, (b) provide fine-grained visibility into privacy conformance and infringement by these devices, © recommend and visualise corrective user actions through ”digital nudging”, and (d) support the monitoring and management of personal data disclosure in a shared space. We present a reference architecture for ICME that can be used by software engineers to implement diverse end-user consent management solutions for smart buildings. We also provide a proof-of-concept prototype to demonstrate how the ICME approach works in a shared smart workplace. Demo: <a>https://youtu.be/5y6CdyWAdgY&lt;/a&gt;
DOI: 10.1145/3468264.3473118

StackEmo： towards enhancing user experience by augmenting stack overflow with emojis
作者: Venigalla, Akhila Sri Manasa and Chimalakonda, Sridhar

关键词: emojis, emotion analysis, latent dirichlet allocation, stack overflow
Abstract
Many novice programmers visit Stack Overflow for purposes that include posing questions and finding answers for issues they come across in the process of programming. Many questions have more than one correct answer on Stack Overflow, which are accompanied by number of comments from the users. Comments help developers in identifying the answer that better fits their purpose. However, it is difficult to navigate through all the comments to select an answer. Adding relevant visual cues to comments could help developers in prioritizing the comments to be read. Comments logged generally include sentiments of users, which, when depicted visually, could motivate users in reading through the comments and also help them in prioritizing the comments. However, the sentiment of comments is not being explicitly depicted on the current Stack Overflow platform. While there exist many tools that augment or annotate Stack Overflow platform for developers, we are not aware of tools that annotate visual representations of sentiments to the posts. In this paper, we propose StackEmo as a Google Chrome plugin to augment comments on Stack Overflow with emojis, based on the sentiment of the comments posted. We evaluated StackEmo through an in-user likert scale based survey with 30 university students to understand user perception towards StackEmo. The results of the survey provided us insights on improving StackEmo, with 83% of the participants willing to recommend the plugin to their peers. The source code and tool are available for download on GitHub at: https://github.com/rishalab/StackEmo, and the demo can be found here on youtube: https://youtu.be/BCFlqvMhTMA.
DOI: 10.1145/3468264.3473119

AC²： towards understanding architectural changes in Python projects
作者: Rao, A. Eashaan and Vagavolu, Dheeraj and Chimalakonda, Sridhar

关键词: AC2, call graphs, collaboration graphs, python, software architecture, visualization tool
Abstract
Open source projects are adopting faster release cycles that reflect various changes in the software. Therefore, comprehending the effects of these changes as software architecture evolves over multiple releases becomes necessary. However, it is challenging to keep architecture in-check and add new changes simultaneously for every release. To this end, we propose a visualization tool called AC2, which allows users to examine the alterations in the architecture at both higher and lower levels of abstraction for Python projects. AC2 uses call graphs and collaboration graphs to show the interaction between different architectural components. The tool provides four different views to see the architectural changes. Users can examine two releases at a time to comprehend architectural changes between them. AC2 can support the maintainers and developers, observing changes in the project and their influence on the architecture, which allows them to examine its increasing complexity over many releases at component level. AC2 can be downloaded from <a>https://github.com/rishalab/AC2&lt;/a&gt; and the demo can be seen at <a>https://www.youtube.com/watch?v=GNrJfZ0RCVI&lt;/a&gt;.
DOI: 10.1145/3468264.3473120

csDetector： an open source tool for community smells detection
作者: Almarimi, Nuri and Ouni, Ali and Chouchen, Moataz and Mkaouer, Mohamed Wiem

关键词: Detection Tool, Software projects, community smells
Abstract
Community smells represent symptoms of sub-optimal organizational and social issues within software development communities that often lead to additional project costs and reduced software quality. Previous research identified a variety of community smells that are connected to sub-optimal patterns under different perspectives of organizational-social structures in the software development community. To detect community smells and understanding the characteristics of such organizational-social structures in a project, we propose csDetector, an open source tool that is able to automatically detect community smells within a project and provide relevant socio-technical metrics. csDetector uses a machine learning based detection approach that learns from various existing bad community development practices to provide automated support in detecting related community smells. We evaluate the effectiveness of csDetector on a benchmark of 143 open source projects from GitHub. Our results show that the csDetector tool can detect ten commonly occurring community smells in open software projects with an average F1 score of 84%. csDetector is publicly available, with a demo video, at: https://github.com/Nuri22/csDetector.
DOI: 10.1145/3468264.3473121

CrossVul： a cross-language vulnerability dataset with commit data
作者: Nikitopoulos, Georgios and Dritsa, Konstantina and Louridas, Panos and Mitropoulos, Dimitris

关键词: Dataset, commit messages, security patches, vulnerabilities
Abstract
Examining the characteristics of software vulnerabilities and the code that contains them can lead to the development of more secure software. We present a dataset (∼1.4 GB) containing vulnerable source code files together with the corresponding, patched versions. Contrary to other existing vulnerability datasets, ours includes vulnerable files written in more than 40 programming languages. Each file is associated to (1) a Common Vulnerability Exposures identifier (CVE ID) and (2) the repository it came from. Further, our dataset can be the basis for machine learning applications that identify defects, as we show in specific examples. We also present a supporting dataset that contains commit messages derived from Git commits that serve as security patches. This dataset can be used to train ML models that in turn, can be used to detect security patch commits as we highlight in a specific use case.
DOI: 10.1145/3468264.3473122

Slicer4J
作者: Ahmed, Khaled and Lis, Mieszko and Rubin, Julia

关键词: dynamic slicing, Java, Program analysis
Abstract
This is the implementation of Slicer4J, an accurate, low-overhead dynamic slicer for Java programs. Slicer4J automatically generates a backward dynamic slice from a user selected executed statement and variables used in the statement (slicing criterion). Slicer4J relies on soot which currently supports instrumenting programs compiled with up to Java 9.
DOI: 10.1145/3468264.3473123

CrossASR++： a modular differential testing framework for automatic speech recognition
作者: Asyrofi, Muhammad Hilmi and Yang, Zhou and Lo, David

关键词: Automatic Speech Recognition, Cross-Referencing, Test Case Generation, Text-to-Speech
Abstract
Developers need to perform adequate testing to ensure the quality of Automatic Speech Recognition (ASR) systems. However, manually collecting required test cases is tedious and time-consuming. Our recent work proposes CrossASR, a differential testing method for ASR systems. This method first utilizes Text-to-Speech (TTS) to generate audios from texts automatically and then feed these audios into different ASR systems for cross-referencing to uncover failed test cases. It also leverages a failure estimator to find failing test cases more efficiently. Such a method is inherently self-improvable: the performance can increase by leveraging more advanced TTS and ASR systems. So, in this accompanying tool demo paper, we further engineer CrossASR and propose CrossASR++, an easy-to-use ASR testing tool that can be conveniently extended to incorporate different TTS and ASR systems, and failure estimators. We also make CrossASR++ chunk texts from a given corpus dynamically and enable the estimator to work in a more effective and flexible way. We demonstrate that the new features can help CrossASR++ discover more failed test cases. Using the same TTS and ASR systems, CrossASR++ can uncover 26.2% more failed test cases for 4 ASRs than the original tool. Moreover, by simply adding one more ASR for cross-referencing, we can increase the number of failed test cases uncovered for each of the 4 ASR systems by 25.07%, 39.63%, 20.95% and 8.17% respectively. We also extend CrossASR++ with 5 additional failure estimators. Compared to worst estimator, the best one can discover 10.41% more failed test cases within the same amount of time. The demo video for CrossASR++ can be viewed at https://youtu.be/ddRk-f0QV-g and the source code can be found at https://github.com/soarsmu/CrossASRplus.
DOI: 10.1145/3468264.3473124

Frontmatter dataset
作者: Kuznetsov, Konstantin and Fu, Chen and Gao, Song and Jansen, David N. and Zhang, Lijun and Zeller, Andreas

关键词: Android, App mining, app stores, static analysis, user interfaces
Abstract
This artifact represents the Frontmatter data set containing UI models (UI hierarchy and API calls) for about 160,000 Android applications, which was obtained with help of Frontmatter tool – a static analysis framework to automatically mine both user interface models and behavior of Android apps at a large scale with high precision. For apps that could not be processed the reason is given. The applications were downloaded from the AndroZoo repository.
DOI: 10.1145/3468264.3473125

GenSys： a scalable fixed-point engine for maximal controller synthesis over infinite state spaces
作者: Samuel, Stanly and D’Souza, Deepak and Komondoor, Raghavan

关键词: constraint programming, fixed-points, logic, reactive synthesis
Abstract
The synthesis of maximally-permissive controllers in infinite-state systems has many practical applications. Such controllers directly correspond to maximal winning strategies in logically specified infinite-state two-player games. In this paper, we introduce a tool called GenSys which is a fixed-point engine for computing maximal winning strategies for players in infinite-state safety games. A key feature of GenSys is that it leverages the capabilities of existing off-the-shelf solvers to implement its fixed point engine. GenSys outperforms state-of-the-art tools in this space by a significant margin. Our tool has solved some of the challenging problems in this space, is scalable, and also synthesizes compact controllers. These controllers are comparatively small in size and easier to comprehend. GenSys is freely available for use and is available under an open-source license.
DOI: 10.1145/3468264.3473126

Analysis of specifications of multiparty sessions with dcj-lint
作者: Horlings, Erik and Jongmans, Sung-Shik

关键词: Clojure, communication protocols, model checking, specifications
Abstract
Multiparty session types constitute a method to automatically detect violations of protocol implementations relative to specifications. But, when a violation is detected, does it symptomise a bug in the implementation or in the specification? This paper presents dcj-lint: an analysis tool to detect bugs in protocol specifications, based on multiparty session types. By leveraging a custom-built temporal logic model checker, dcj-lint can be used to efficiently perform: (1) generic sanity checks, and (2) protocol-specific property analyses. In our benchmarks, dcj-lint outperforms an existing state-of-the-art model checker (up to 61x faster).
DOI: 10.1145/3468264.3473127

Documenting evidence of a reuse of ‘a systematic study of the class imbalance problem in convolutional neural networks’
作者: Yedida, Rahul and Menzies, Tim

关键词: defect prediction, oversampling, replication, reuse
Abstract
We report here the reuse of oversampling, and modifications to the basic approach, used in a recent TSE ’21 paper by YedidaMenzies. The method reused is the oversampling technique studied by Buda et al. These methods were studied in the SE domain (specifically, for defect prediction), and extended by Yedida &amp; Menzies.
DOI: 10.1145/3468264.3477212

Documenting evidence of a reuse of ‘on the number of linear regions of deep neural networks’
作者: Yedida, Rahul and Menzies, Tim

关键词: deep learning, defect prediction, replication, reuse
Abstract
We report here the reuse of theoretical insights from deep learning literature, used in a recent TSE '21 paper by Yedida &amp; Menzies. The artifact replicated is the lower bound on the number of piecewise linear regions in the decision boundary of a feedforward neural network with ReLU activations, as studied by Montufar et al. We document the reuse of Theorem 4 from Montufar et al. by Yedida &amp; Menzies.
DOI: 10.1145/3468264.3477213

Documenting evidence of a reuse of ‘a systematic literature review of techniques and metrics to reduce the cost of mutation testing’
作者: Lustosa, Andre and Menzies, Tim

关键词: mutation testing, reproduction, reuse, systematic literature review
Abstract
This submission is a report on the reuse of Pizzoleto et al.'s Systematic Literature Review by Guizzo et al.
DOI: 10.1145/3468264.3477214

Documenting evidence of a reuse of ‘RefactoringMiner 2.0’
作者: Lustosa, Andre and Menzies, Tim

关键词: bug introduction, mining software repositories, refactoring, reuse
Abstract
This submission is a report on the reuse of Tsantalis et al.'s Refactoring Miner (RMiner) package by Penta et al.
DOI: 10.1145/3468264.3477215

Documenting evidence of a reuse of ‘what is a feature? a qualitative study of features in industrial software product lines’
作者: Peng, Kewen and Menzies, Tim

关键词: Software analytics, Software configuration, Software product lines
Abstract
We report here the following example of reuse. The original paper is a prior work about features in product lines by Berger et al. The paper “Dimensions of software configuration: on the configuration context in modern software development” by Siegmund et al. reused definitions and theories about configuration features in the original paper.
DOI: 10.1145/3468264.3477216

Documenting evidence of a reuse of ‘“why should I trust you?”： explaining the predictions of any classifier’
作者: Peng, Kewen and Menzies, Tim

关键词: Actionable analysis, Software analytics
Abstract
We report here the following example of reuse. LIME is a local instance-based explanation generation framework that was originally proposed by Ribeiro et al. in their paper “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier”. The framework was reused by Peng et al. in their paper “Defect Reduction Planning (using TimeLIME)”. The paper used the original implementation of LIME as one of the core components in the proposed framework.
DOI: 10.1145/3468264.3477217

Documenting evidence of a replication of ‘populating a release history database from version control and bug tracking systems’
作者: Yang, Xueqi and Menzies, Tim

关键词: bug fixing, replication, reuse, text tagging
Abstract
We report here the use of a keyword-based and regular expression-based approach to identify bug-fixing commits by linking commit messages and issue tracker data in a recent FSE '20 paper by Penta et al. in their paper “On the Relationship between Refactoring Actions and Bugs: A Differentiated Replication”. The approach replicated is a keyword-based and regular expression-based approach as studied by Fischer et al.
DOI: 10.1145/3468264.3477218

Documenting evidence of a replication of ‘analyze this! 145 questions for data scientists in software engineering’
作者: Yang, Xueqi and Menzies, Tim

关键词: data science, replication, reuse, software analysis
Abstract
We report here the use of the 145 software engineering questions for data scientists presented in the Microsoft study in a recent FSE~'20 paper by Huijgens et al. The study by Begel et al. was replicated by Huijgens et al.
DOI: 10.1145/3468264.3477219

Documenting evidence of a reproduction of ‘is there a “golden” feature set for static warning identification? — an experimental evaluation’
作者: Yang, Xueqi and Menzies, Tim

关键词: deep learning, reproduction, reuse, static analysis
Abstract
We report here the use of the static analysis dataset generated by FindBugs in a recent EMSE '21 paper by Yang et al. The artifact reproduced is supervised models to perform static analysis based on a golden feature set as studied by Wang et al.
DOI: 10.1145/3468264.3477220

Replication Package for A Replication of ‘DeepBugs： A Learning Approach to Name-based Bug Detection’
作者: Winkler, Jordan and Agarwal, Abhimanyu and Tung, Caleb and Ugalde, Dario Rios and Jung, Young Jin and Davis, James C.

关键词: deep learnining, Defect detection, machine learning, replication
Abstract
Replication process, source code, and longer technical report for this paper.
DOI: 10.1145/3468264.3477221

Investigating documented information for accurate effort estimation in agile software development
作者: Pasuksmit, Jirat

关键词: Agile Software Development, Effort Estimation, Story Points
Abstract
An Agile development team estimates the effort of a work item to plan a sprint (i.e., an iteration in Scrum). Hence, reliable effort estimation would help the team create a reliable sprint plan. Prior studies proposed automated approaches to help the team to estimate the effort accurately. However, the effort estimated by the previously proposed approaches may become inaccurate when the related information is changed. Especially when the estimated effort is changed after the sprint has been planned, the sprint plan may be invalidated and the team might waste their time and effort spent in planning. This thesis aims to help the Agile development team improve the stability of effort estimation (in the Story Points unit) while aligning with the just-in-time practice of Agile. Hence, we first conduct empirical studies using mixed-methods approaches to investigate the potential impact of instability of Story Points. To help an Agile team to achieve reliable effort estimation with optimal effort, we will develop approaches to predict the future Story Points changes and the future information changes to help the team cope with uncertainty when finalizing the sprint plan.
DOI: 10.1145/3468264.3473106

Security guarantees for automated software testing
作者: Liyanage, Danushka

关键词: estimation, fuzzing, probability, software testing, statistics
Abstract
Before making important decisions about an ongoing fuzzing campaign, we believe an engineer may want to know: (i) the achieved level of confidence about the program’s correctness (residual risk), (ii) the expected increase in confidence about the program’s correctness if we invest more time for the current campaign (cost-benefit trade-off), and (iii) the total number of bugs that the fuzzer can find in the limit (effectiveness). The ability to accurately estimate the above quantities through observed data of the fuzzing campaign allows engineers to make required decisions with quantifiable accuracy. Currently, there are popular data-driven approaches to provide such quantitative guidance on decision making for white- and blackbox fuzzing campaigns. However, none of these prevailing techniques can guarantee unbiased estimation of residual risk, cost-benefit trade-off, or effectiveness for greybox fuzzing – the most popular automated software vulnerability discovery technique to date. Greybox fuzzers introduce an adaptive bias to existing estimators that needs to be corrected during the quantitative analysis to make accurate decisions about the campaign. In this thesis, our primary objective is to develop a rich statistical framework that supports quantitative decision-making for greybox fuzzing campaigns. We leverage this framework to introduce appropriate bias correction strategies to existing estimators and propose novel estimators that account for adaptive bias in greybox fuzzing.
DOI: 10.1145/3468264.3473097

Unveiling multiple facets of design degradation in modern code review
作者: Uch^{o

关键词: Design Degradation, Influential Aspects, Modern Code Review
Abstract
Software design is a key concern in code review through which developers actively discuss and improve each code change. Nevertheless, code review is predominantly a cooperative task influenced by both technical and social aspects. Consequently, these aspects can play a key role in how software design degrades as well as contributing to accelerating or reversing the degradation during the process of each single code change’s review. However, there is little understanding about such social and technical aspects relates to either the reduction or the increase of design degradation as the project evolves. Consequently, the scarce knowledge on this topic helps little in properly guiding developers along design-driven code reviews. Our goal in this Doctoral research is three-fold: (1) to characterize the impact of code review and their practices on design degradation over time; (2) to understand the contribution of technical and social aspects to design degradation; and (3) to propose a conceptual framework to support design-decision making during code review. Our preliminary results show that the majority of code reviews had little to no design degradation impact, and that technical and social aspects contribute to distinguishing and predicting design impactful changes.
DOI: 10.1145/3468264.3473099

Freeing hybrid distributed AI training configuration
作者: Wang, Haoran

关键词: AI Framework, Distributed Deep Learning, Performance Analysis, Systematic Partitioning
Abstract
Deep neural network (DNN) has become the leading technology to realize Artificial Intelligence (AI). As DNN models become larger and more complex, so do datasets. Being able to efficiently train DNNs in parallel has become a crucial need. Data Parallelism (DP) is the widest-used solution today to accelerate DNN training but could be inefficient when processing DNNs with large-size parameters. Hybrid Parallelism (HP), which applies different parallel strategies on different parts of DNNs, is more efficient but requires advanced configurations. Not all AI researchers are experts in parallel computing, thus automating the configuration of HP strategies is very desirable for all AI frameworks. We propose a parallel semantics analysis method, which can analyze the trade-offs among different kinds of parallelisms and systematically choose the HP strategies with good training time performance. We demonstrated experimentally 260% speedup when applying our method compared to using a conventional DP approach. With our proposal, AI researchers would be able to focus more on AI algorithm research without being disturbed by parallel analysis and engineering concerns.
DOI: 10.1145/3468264.3473104

Towards an approach for resource-driven adaptation
作者: Akiki, Paul A.

关键词: Resource-driven Adaptive System, Self-Adaptive System, Task Prioritisation
Abstract
Resource-driven systems have tasks that are bound by limited resources. These systems must adapt their tasks that cannot gain access to sufficient resources. This dissertation proposes a new resource-driven adaptation approach, which aims to support (1) task prioritisation using multiple criteria such as the time of day that a task should be executed, the role of involved users, and selection of the least costly adaptation types; (2) collaboration between a human user and a software tool for preparing adapted task behaviour to be used when resources are substituted; and (3) resource extensibility and heterogeneity. The proposed approach is being implemented and will be evaluated with scenarios from enterprise applications.
DOI: 10.1145/3468264.3473098

Deployment coordination for cross-functional DevOps teams
作者: Sokolowski, Daniel

关键词: Cloud, DevOps, Infrastructure as Code, Resource Orchestration
Abstract
Software stability and reliability are the core concerns of DevOps. They are improved by tightening the collaboration between developers and operators in cross-functional teams on the one hand and by automating operations through continuous integration (CI) and infrastructure as code (IaC) on the other hand. Ideally, teams in DevOps are fully independent. Still, their applications often depend on each other in practice, requiring them to coordinate their deployment through centralization or manual coordination. With this work, we propose and implement the novel IaC solution µs ([mju:z] ”muse”), which automates deployment coordination in a decentralized fashion. µs is the first approach that is compatible with the DevOps goals as it enables truly independent operations of the DevOps teams. We define our research problem through a questionnaire survey with IT professionals and evaluate the solution by comparing it to other modern IaC approaches, assessing its performance, and applying it to existing IaC programs.
DOI: 10.1145/3468264.3473101

Lightweight verification via specialized typecheckers
作者: Kellogg, Martin

关键词: Pluggable types, accumulation analysis, lightweight verification
Abstract
Testing and other unsound analyses are developer-friendly but cannot give guarantees that programs are free of bugs. Verification and other extant sound approaches can give guarantees but often require too much effort for everyday developers. In this work, we describe our efforts to make verification more accessible for developers by using specialized pluggable typecheckers—a relatively accessible verification technology—to solve complex problems that previously required more complex and harder-to-use verification approaches.
DOI: 10.1145/3468264.3473105

Multi-location cryptographic code repair with neural-network-based methodologies
作者: Xiao, Ya

关键词: code embedding, code suggestion, cryptographic API misuse, neural networks
Abstract
Java Cryptographic API libraries are error-prone and result in vulnerabilities. The fixes of them often require security expertise and extra consideration for cryptographic consistency at multiple code locations. My Ph.D. research aims to help developers with a multi-location cryptographic code repair system. The proposed method relies on a precise static analysis for cryptographic code and a neural network based secure code generation solution. We focus on designing neural network based techniques guided by program analysis to learn from the secure code and give accurate suggestions. First, we conducted a comprehensive measurement to compare cryptographic API embeddings guided by different program analysis strategies. Then, we identified two previously unreported programming language-specific challenges, differentiating functionally similar APIs and capturing low-frequency code patterns. We address them by a specialized multi-path code suggestion architecture, and a novel low-frequency enhanced sequence learning technique. Existing results show that our approach achieves significant improvements on top-1 accuracy compared with the state-of-the-art.Our next step is an cryptographic consistent localization that enables our multi-location code repair. We publish our data and code as a large Java cryptographic code dataset.
DOI: 10.1145/3468264.3473102

Improving the effectiveness of peer code review in identifying security defects
作者: Paul, Rajshakhar

关键词: code review, security, software development, vulnerability
Abstract
Prior studies found peer code review useful in identifying security defects. That is why most of the commercial and open-source software (OSS) projects embraced peer code review and mandated the use of it in the software development life cycle. However, despite conducting mandatory peer code review practices, many security-critical OSS projects such as Chromium, Mozilla, and Qt are reporting a high number of post-release vulnerabilities to the Common Vulnerabilities and Exposures (CVE) database. Practitioners may wonder if there is any missing piece in the puzzle that leads code reviews to miss those security defects. Therefore, the primary objective of this dissertation study is to improve the effectiveness of peer code review in identifying security defects.  To meet this goal, I plan to empirically investigate: (i) why security defects escape code reviews, (ii) what are the challenges developers face to conduct effective security code reviews, (iii) how to build effective security code review strategy, and (iv) how to make effective utilization of security experts during code reviews.
DOI: 10.1145/3468264.3473107

Reducing cost in continuous integration with a collection of build selection approaches
作者: Jin, Xianhao

关键词: build strategies, continuous integration, maintenance cost
Abstract
Continuous integration (CI) is a widely used practice in modern software engineering. Unfortunately, it is also an expensive practice — Google and Mozilla estimate their CI systems in millions of dollars. To reduce CI computation cost, I propose the strategy of build selection to selectively execute those builds whose outcomes are failing and skip those passing builds for cost-saving. In my research, I firstly designed SmartBuildSkip as my first build selection approach that can skip unfruitful builds in CI automatically. Next, I evaluated SmartBuildSkip with all CI-improving approaches for understanding the strength and weakness of existing approaches to recommend future technique design. Then I proposed PreciseBuildSkip as a build selection approach to maximize the safety of skipping builds in CI. I also combined existing approaches both within and across granularity to be applied as a new build selection approach — HybridBuildSkip to save builds in a hybrid way. Finally, I plan to propose a human study to understand how to increase developers’ trust on build selection approaches.
DOI: 10.1145/3468264.3473103

A live environment for inspection and refactoring of software systems
作者: Fernandes, Sara

关键词: code quality, code smells, liveness, refactoring, visualization
Abstract
Refactoring helps to improve the design of software systems, making them more readable, maintainable, cleaner, and easy to expand. Most of the tools that already exist on this concept allow developers to select and execute the best refactoring techniques for a particular programming context. However, they aren’t interactive and prompt enough, providing a poor programming experience. In this gap, we can introduce and combine the topic of liveness with refactoring methods. Live Refactoring allows to know continuously, while programming, the blocks of code that we should refactor and why they were classified as problematic. Therefore, it shortens the time needed to create high-quality systems, due to early and continuous refactoring feedback, support, and guidance. This paper presents our research project based on a live refactoring environment. This environment is focused on a refactoring tool that aims to explore the concept of Live Refactoring and its main components — recommendation, visualization, and application.
DOI: 10.1145/3468264.3473100

PorkFuzz： testing stateful software-defined network applications with property graphs
作者: Shou, Chaofan

关键词: coverage-guided SDN fuzzing, network validation
Abstract
This paper proposes a state-aware fuzzing framework for testing software-defined network applications. It leverages a property graph to store fuzzing results. Application developers can easily express oracles with the graph query language to test their applications. The graph representation also allows the oracles to analyze the fuzzing result efficiently.
DOI: 10.1145/3468264.3473487

A qualitative study of cleaning in Jupyter notebooks
作者: Dong, Helen

关键词: Empirical Study, Mining Repositories, Software Engineering for AI
Abstract
Data scientists commonly use computational notebooks because they provide a good environment for testing multiple models. However, once the scientist completes the code and finds the ideal model, the data scientist will have to dedicate time to clean up the code in order for others to understand it. In this paper, we perform a qualitative study on how scientists clean their code in hopes of being able to suggest a tool to automate this process. Our end goal is for tool builders to address possible gaps and provide additional aid to data scientists, who can then focus more on their actual work rather than the routine and tedious cleaning duties.
DOI: 10.1145/3468264.3473490

Automated generation of realistic test inputs for web APIs
作者: Alonso, Juan C.

关键词: RESTful API, Web of Data, automated testing, knowledge base, realistic test input
Abstract
Testing web APIs automatically requires generating input data values such as addressess, coordinates or country codes. Generating meaningful values for these types of parameters randomly is rarely feasible, which means a major obstacle for current test case generation approaches. In this paper, we present ARTE, the first semantic-based approach for the Automated generation of Realistic TEst inputs for web APIs. Specifically, ARTE leverages the specification of the API under test to search for meaningful test inputs for the API parameters in knowledge bases like DBpedia. Our approach has been integrated into RESTest, a state-of-the-art tool for API testing, achieving an unprecedented level of automation which allows to generate up to 100% more valid API calls than existing fuzzing techniques, 30% on average. Evaluation results on a set of 26 real-world APIs show that ARTE can generate realistic inputs for 7 out of every 10 parameters, outperforming related approaches.
DOI: 10.1145/3468264.3473491

Contextualizing toxicity in open source： a qualitative study
作者: Cohen, Sophie

关键词: Open source, classifier, sentiment analysis, toxicity
Abstract
In this paper, we study toxic online interactions in issue discussions of open-source communities. Our goal is to qualitatively understand how toxicity impacts an open-source community like GitHub. We are driven by users complaining about toxicity, which leads to burnout and disengagement from the site. We collect a substantial sample of toxic interactions and qualitatively analyze their characteristics to ground future discussions and intervention design.
DOI: 10.1145/3468264.3473492

Accelerating redundancy-based program repair via code representation learning and adaptive patch filtering
作者: Yang, Chen

关键词: automated program repair, patch filtering, representation learning
Abstract
Automated program repair (APR) has attracted extensive attention and many APR techniques have been proposed recently, in which redundancy-based techniques have achieved great success. However, they still suffer from the efficiency issue mainly caused by the inaccuracy of measuring code similarity, which may produce meaningless patches that hinder the generation and validation of correct patches. To solve this issue, we propose a novel method AccPR, which leverages code representation to measure code similarity and employs adaptive patch filtering to accelerate redundancy-based APR. We have implemented a prototype of AccPR and integrated it with a SOTA APR tool, SimFix, where the average improvement of efficiency is 47.85%, indicating AccPR is promising.
DOI: 10.1145/3468264.3473496

SMT solver testing with type and grammar based mutation
作者: Park, Jiwon

关键词: SMT solver, fuzzing, test case generation
Abstract
Satisfiability Modulo Theories (SMT) solvers are at the core of many software advances such as program analysis and verification which are highly safety-critical. Hence, to ensure the correctness of the solvers, there have been multiple fuzzing campaigns targeting different logics since 2009. In this paper, we propose a generative type-aware mutation strategy, which is a generalization of a type-aware operator mutation. We have realized the generative type-aware mutation and reported 158 bugs in Z3 and CVC4 including bugs from the versions released as early as 2016 in five months.
DOI: 10.1145/3468264.3473497

Overcoming metric diversity in meta-analysis for software engineering： proposed approach and a case study on its usage on the effects of software reuse
作者: Daniakin, Kirill

关键词: meta-analysis, software engineering, software reuse
Abstract
This work addresses the problem of metric diversity in meta-analysis for Software Engineering by clustering studies using input-output tables and by vote-counting. Diversity arises when researchers, measuring same phenomena, use different, and typically “incomparable” metrics making impossible a direct analysis of the effects and their sizes. Additionally, this work discusses an application of proposed approach to the case of Software Reuse.
DOI: 10.1145/3468264.3473488

A general approach to modeling Java framework behaviors
作者: Luo, Linghui

关键词: Java, call graph, framework modeling, static analysis, taint analysis
Abstract
Interprocedural static analysis tools such as security analyses need good call graphs, which are challenging to scale for framework-based applications. So most tools model rather than analyzing frameworks. These models are manually crafted to capture framework semantics crucial for the particular analysis, and are inherently incomplete. We propose a general approach to modeling Java frameworks. It is not limited to any framework or analysis tool, therefore, highly reusable. While a generic approximation can be noisy, we show our carefully-constructed one does well. Experiments on Android with a client taint analysis show that our approach produces more complete call graphs than the original analysis. As a result, the client analysis works better: both precision (from 0.83 to 0.86) and recall (from 0.20 to 0.31) are improved.
DOI: 10.1145/3468264.3473489

Discovering repetitive code changes in ML systems
作者: Dilhara, Malinda

关键词: Code change patterns, Empirical analysis, Machine learning
Abstract
Similar to software evolution in other software systems, ML software systems evolve with many repetitive changes. Despite some research and tooling for repetitive code changes that exist in Java and other languages, there is a lack of such tools for Python. Given the significant rise of ML software development, and that many ML developers are not professionally trained developers, the lack of software evolution tools for ML code is even more critical. To bring the ML developers’ toolset into the 21st century, we implemented an approach to adapt and reuse the vast ecosystem of Java static analysis tools for Python. Using this approach, we adapted two software evolution tools, RefactoringMiner and CPATMiner, to Python. With the tools, we conducted the first and most fine-grained study on code change patterns in 59 ML systems and surveyed 253 developers. We recommend empirically-justified, actionable opportunities for tool builders and release the tools for researchers.
DOI: 10.1145/3468264.3473493

Does reusing pre-trained NLP model propagate bugs?
作者: Chakraborty, Mohna

关键词: BERT, Bug, Deep Learning, NLP, Reuse
Abstract
In this digital era, the textual content has become a seemingly ubiquitous part of our life. Natural Language Processing (NLP) empowers machines to comprehend the intricacies of textual data and eases human-computer interaction. Advancement in language modeling, continual learning, availability of a large amount of linguistic data, and large-scale computational power have made it feasible to train models for downstream tasks related to text analysis, including safety-critical ones, e.g., medical, airlines, etc. Compared to other deep learning (DL) models, NLP-based models are widely reused for various tasks. However, the reuse of pre-trained models in a new setting is still a complex task due to the limitations of the training dataset, model structure, specification, usage, etc. With this motivation, we study BERT, a vastly used language model (LM), from the direction of reusing in the code. We mined 80 posts from Stack Overflow related to BERT and found 4 types of bugs observed in clients’ code. Our results show that 13.75% are fairness, 28.75% are parameter, 15% are token, and 16.25% are version-related bugs.
DOI: 10.1145/3468264.3473494

Mitigating security attacks in kubernetes manifests for security best practices violation
作者: Shamim, Shazibul Islam

关键词: attack mitigation, attacks, cloud computing, compromised user, configuration, container, denial of service, devops, devsecops, dos, kubernetes, manifests, misconfiguration, secure software engineering, security, security policies, security practices, software security
Abstract
Kubernetes is an open-source software system that helps practitioners in automatically deploying, scaling, and managing containerized applications. Information technology (IT) organizations, such as IBM, Spotify, and Capital One, use Kubernetes to manage their containers and reported benefits in the deployment process. However, recent security breaches and survey results among practitioners suggest that Kubernetes deployment can be vulnerable to attacks due to misconfiguration and not following security best practices. This research explores how malicious users can perform potential security exploits from the violations of Kubernetes security best practices. We explore how attacks can be conducted such as denial of service attacks against one of the security best practices violations in Kubernetes manifests. In addition, we are exploring potential exploits in the Kubernetes cluster to propose mitigation strategies to secure the Kubernetes cluster.
DOI: 10.1145/3468264.3473495

评论

💦非常忙碌!

逸翎清晗🌈Talk is cheap, show me the code.💎

文章
233标签
44分类
54

GitHub