ICSE 2020

Learning-to-rank vs ranking-to-learn： strategies for regression testing in continuous integration

作者: Bertolino, Antonia and Guerriero, Antonio and Miranda, Breno and Pietrantuono, Roberto and Russo, Stefano
关键词: continuous integration, machine learning, regression testing, test prioritization, test selection

Abstract

In Continuous Integration (CI), regression testing is constrained by the time between commits. This demands for careful selection and/or prioritization of test cases within test suites too large to be run entirely. To this aim, some Machine Learning (ML) techniques have been proposed, as an alternative to deterministic approaches. Two broad strategies for ML-based prioritization are learning-to-rank and what we call ranking-to-learn (i.e., reinforcement learning). Various ML algorithms can be applied in each strategy. In this paper we introduce ten of such algorithms for adoption in CI practices, and perform a comprehensive study comparing them against each other using subjects from the Apache Commons project. We analyze the influence of several features of the code under test and of the test process. The results allow to draw criteria to support testers in selecting and tuning the technique that best fits their context.

DOI: 10.1145/3377811.3380369

A cost-efficient approach to building in continuous integration

作者: Jin, Xianhao and Servant, Francisco
关键词: build prediction, continuous integration, maintenance cost

Abstract

Continuous integration (CI) is a widely used practice in modern software engineering. Unfortunately, it is also an expensive practice — Google and Mozilla estimate their CI systems in millions of dollars. In this paper, we propose a novel approach for reducing the cost of CI. The cost of CI lies in the computing power to run builds and its value mostly lies on letting developers find bugs early — when their size is still small. Thus, we target reducing the number of builds that CI executes by still executing as many failing builds as early as possible. To achieve this goal, we propose SmartBuildSkip, a technique which predicts the first builds in a sequence of build failures and the remaining build failures separately. SmartBuildSkip is customizable, allowing developers to select different preferred trade-offs of saving many builds vs. observing build failures early. We evaluate the motivating hypothesis of SmartBuildSkip, its prediction power, and its cost savings in a realistic scenario. In its most conservative configuration, SmartBuildSkip saved a median 30% of builds by only incurring a median delay of 1 build in a median of 15% failing builds.

DOI: 10.1145/3377811.3380437

Practical fault detection in puppet programs

作者: Sotiropoulos, Thodoris and Mitropoulos, Dimitris and Spinellis, Diomidis
关键词: notifiers, ordering relationships, program analysis, puppet, system calls

Abstract

Puppet is a popular computer system configuration management tool. By providing abstractions that model system resources it allows administrators to set up computer systems in a reliable, predictable, and documented fashion. Its use suffers from two potential pitfalls. First, if ordering constraints are not correctly specified whenever a Puppet resource depends on another, the non-deterministic application of resources can lead to race conditions and consequent failures. Second, if a service is not tied to its resources (through the notification construct), the system may operate in a stale state whenever a resource gets modified. Such faults can degrade a computing infrastructure’s availability and functionality.We have developed an approach that identifies these issues through the analysis of a Puppet program and its system call trace. Specifically, a formal model for traces allows us to capture the interactions of Puppet resources with the file system. By analyzing these interactions we identify (1) resources that are related to each other (e.g., operate on the same file), and (2) resources that should act as notifiers so that changes are correctly propagated. We then check the relationships from the trace’s analysis against the program’s dependency graph: a representation containing all the ordering constraints and notifications declared in the program. If a mismatch is detected, our system reports a potential fault.We have evaluated our method on a large set of popular Puppet modules, and discovered 92 previously unknown issues in 33 modules. Performance benchmarking shows that our approach can analyze in seconds real-world configurations with a magnitude measured in thousands of lines and millions of system calls.

DOI: 10.1145/3377811.3380384

Learning from, understanding, and supporting DevOps artifacts for docker

作者: Henkel, Jordan and Bird, Christian and Lahiri, Shuvendu K. and Reps, Thomas
关键词: DevOps, docker, mining, static checking

Abstract

With the growing use of DevOps tools and frameworks, there is an increased need for tools and techniques that support more than code. The current state-of-the-art in static developer assistance for tools like Docker is limited to shallow syntactic validation. We identify three core challenges in the realm of learning from, understanding, and supporting developers writing DevOps artifacts: (i) nested languages in DevOps artifacts, (ii) rule mining, and (iii) the lack of semantic rule-based analysis. To address these challenges we introduce a toolset, binnacle, that enabled us to ingest 900,000 GitHub repositories.Focusing on Docker, we extracted approximately 178,000 unique Dockerfiles, and also identified a Gold Set of Dockerfiles written by Docker experts. We addressed challenge (i) by reducing the number of effectively uninterpretable nodes in our ASTs by over 80% via a technique we call phased parsing. To address challenge (ii), we introduced a novel rule-mining technique capable of recovering two-thirds of the rules in a benchmark we curated. Through this automated mining, we were able to recover 16 new rules that were not found during manual rule collection. To address challenge (iii), we manually collected a set of rules for Dockerfiles from commits to the files in the Gold Set. These rules encapsulate best practices, avoid docker build failures, and improve image size and build latency. We created an analyzer that used these rules, and found that, on average, Dockerfiles on GitHub violated the rules five times more frequently than the Dockerfiles in our Gold Set. We also found that industrial Dockerfiles fared no better than those sourced from GitHub.The learned rules and analyzer in binnacle can be used to aid developers in the IDE when creating Dockerfiles, and in a post-hoc fashion to identify issues in, and to improve, existing Dockerfiles.

DOI: 10.1145/3377811.3380406

Adapting requirements models to varying environments

作者: Alrajeh, Dalal and Cailliau, Antoine and van Lamsweerde, Axel
关键词: context-dependent requirements, formal verification, logic-based learning, requirements adaptation, requirements evolution

Abstract

The engineering of high-quality software requirements generally relies on properties and assumptions about the environment in which the software-to-be has to operate. Such properties and assumptions, referred to as environment conditions in this paper, are highly subject to change over time or from one software variant to another. As a consequence, the requirements engineered for a specific set of environment conditions may no longer be adequate, complete and consistent for another set.The paper addresses this problem through a tool-supported requirements adaptation technique. A goal-oriented requirements modelling framework is considered to make requirements’ refinements and dependencies on environment conditions explicit. When environment conditions change, an adapted goal model is computed that is correct with respect to the new environment conditions. The space of possible adaptations is not fixed a priori; the required changes are expected to meet one or more environment-independent goal(s) to be satisfied in any version of the system. The adapted goal model is generated using a new counterexample-guided learning procedure that ensures the correctness of the updated goal model, and prefers more local adaptations and more similar goal models.

DOI: 10.1145/3377811.3380927

Comparing formal tools for system design： a judgment study

作者: Ferrari, Alessio and Mazzanti, Franco and Basile, Davide and Beek, Maurice H. ter and Fantechi, Alessandro
关键词: empirical formal methods, empirical software engineering, formal methods, formal methods diversity, formal tools, human aspects of formal design, judgment study, moving-block system, railway

Abstract

Formal methods and tools have a long history of successful applications in the design of safety-critical railway products. However, most of the experiences focused on the application of a single method at once, and little work has been performed to compare the applicability of the different available frameworks to the railway context. As a result, companies willing to introduce formal methods in their development process have little guidance on the selection of tools that could fit their needs. To address this goal, this paper presents a comparison between 9 different formal tools, namely Atelier B, CADP, FDR4, NuSMV, ProB, Simulink, SPIN, UMC, and UPPAAL SMC. We performed a judgment study, involving 17 experts with experience in formal methods applied to railways. In the study, part of the experts were required to model a railway signaling problem (a moving-block train distancing system) with the different tools, and to provide feedback on their experience. The information produced was then synthesized, and the results were validated by the remaining experts. Based on the outcome of this process, we provide a synthesis that describes when to use a certain tool, and what are the problems that may be faced by modelers. Our experience shows that the different tools serve different purposes, and multiple formal methods are required to fully cover the needs of the railway system design process.

DOI: 10.1145/3377811.3380373

Debugging inputs

作者: Kirschner, Lukas and Soremekun, Ezekiel and Zeller, Andreas
关键词: No keywords

Abstract

When a program fails to process an input, it need not be the program code that is at fault. It can also be that the input data is faulty, for instance as result of data corruption. To get the data processed, one then has to debug the input data—that is, (1) identify which parts of the input data prevent processing, and (2) recover as much of the (valuable) input data as possible. In this paper, we present a general-purpose algorithm called ddmax that addresses these problems automatically. Through experiments, ddmax maximizes the subset of the input that can still be processed by the program, thus recovering and repairing as much data as possible; the difference between the original failing input and the “maximized” passing input includes all input fragments that could not be processed. To the best of our knowledge, ddmax is the first approach that fixes faults in the input data without requiring program analysis. In our evaluation, ddmax repaired about 69% of input files and recovered about 78% of data within one minute per input.

DOI: 10.1145/3377811.3380329

Causal testing： understanding defects’ root causes

作者: Johnson, Brittany and Brun, Yuriy and Meliou, Alexandra
关键词: Holmes, automated test generation, causal testing, causality, software debugging, test fuzzing, theory of counterfactual causality

Abstract

Understanding the root cause of a defect is critical to isolating and repairing buggy behavior. We present Causal Testing, a new method of root-cause analysis that relies on the theory of counterfactual causality to identify a set of executions that likely hold key causal information necessary to understand and repair buggy behavior. Using the Defects4J benchmark, we find that Causal Testing could be applied to 71% of real-world defects, and for 77% of those, it can help developers identify the root cause of the defect. A controlled experiment with 37 developers shows that Causal Testing improves participants’ ability to identify the cause of the defect from 80% of the time with standard testing tools to 86% of the time with Causal Testing. The participants report that Causal Testing provides useful information they cannot get using tools such as JUnit. Holmes, our prototype, open-source Eclipse plugin implementation of Causal Testing, is available at http://holmes.cs.umass.edu/.

DOI: 10.1145/3377811.3380377

Impact analysis of cross-project bugs on software ecosystems

作者: Ma, Wanwangying and Chen, Lin and Zhang, Xiangyu and Feng, Yang and Xu, Zhaogui and Chen, Zhifei and Zhou, Yuming and Xu, Baowen
关键词: bug impact, cross-project bugs, dependence analysis, software ecosystems, symbolic constraints

Abstract

Software projects are increasingly forming social-technical ecosystems within which individual projects rely on the infrastructures or functional components provided by other projects, leading to complex inter-dependencies. Through inter-project dependencies, a bug in an upstream project may have profound impact on a large number of downstream projects, resulting in cross-project bugs. This emerging type of bugs has brought new challenges in bug fixing due to their unclear influence on downstream projects. In this paper, we present an approach to estimating the impact of a cross-project bug within its ecosystem by identifying the affected downstream modules (classes/methods). Note that a downstream project that uses a buggy upstream function may not be affected as the usage does not satisfy the failure inducing preconditions. For a reported bug with the known root cause function and failure inducing preconditions, we first collect the candidate downstream modules that call the upstream function through an ecosystem-wide dependence analysis. Then, the paths to the call sites of the buggy upstream function are encoded as symbolic constraints. Solving the constraints, together with the failure inducing preconditions, identifies the affected downstream modules. Our evaluation of 31 existing upstream bugs on the scientific Python ecosystem containing 121 versions of 22 popular projects (with a total of 16 millions LOC) shows that the approach is highly effective: from the 25490 candidate downstream modules that invoke the buggy upstream functions, it identifies 1132 modules where the upstream bugs can be triggered, pruning 95.6% of the candidates. The technique has no false negatives and an average false positive rate of 7.9%. Only 49 downstream modules (out of the 1132 we found) were reported before to be affected.

DOI: 10.1145/3377811.3380442

Taming behavioral backward incompatibilities via cross-project testing and analysis

作者: Chen, Lingchao and Hassan, Foyzul and Wang, Xiaoyin and Zhang, Lingming
关键词: No keywords

Abstract

In modern software development, software libraries play a crucial role in reducing software development effort and improving software quality. However, at the same time, the asynchronous upgrades of software libraries and client software projects often result in incompatibilities between different versions of libraries and client projects. When libraries evolve, it is often very challenging for library developers to maintain the so-called backward compatibility and keep all their external behavior untouched, and behavioral backward incompatibilities (BBIs) may occur. In practice, the regression test suites of library projects often fail to detect all BBIs. Therefore, in this paper, we propose DeBBI to detect BBIs via cross-project testing and analysis, i.e., using the test suites of various client projects to detect library BBIs. Since executing all the possible client projects can be extremely time consuming, DeBBI transforms the problem of cross-project BBI detection into a traditional information retrieval (IR) problem to execute the client projects with higher probability to detect BBIs earlier. Furthermore, DeBBI considers project diversity and test relevance information for even faster BBI detection. The experimental results show that DeBBI can reduce the end-to-end testing time for detecting the first and average unique BBIs by 99.1% and 70.8% for JDK compared to naive cross-project BBI detection. Also, DeBBI has been applied to other popular 3rd-party libraries. To date, DeBBI has detected 97 BBI bugs with 19 already confirmed as previously unknown bugs.

DOI: 10.1145/3377811.3380436

Watchman： monitoring dependency conflicts for Python library ecosystem

作者: Wang, Ying and Wen, Ming and Liu, Yepang and Wang, Yibo and Li, Zhenming and Wang, Chao and Yu, Hai and Cheung, Shing-Chi and Xu, Chang and Zhu, Zhiliang
关键词: Python, dependency conflicts, software ecosystem

Abstract

The PyPI ecosystem has indexed millions of Python libraries to allow developers to automatically download and install dependencies of their projects based on the specified version constraints. Despite the convenience brought by automation, version constraints in Python projects can easily conflict, resulting in build failures. We refer to such conflicts as Dependency Confict (DC) issues. Although DC issues are common in Python projects, developers lack tool support to gain a comprehensive knowledge for diagnosing the root causes of these issues. In this paper, we conducted an empirical study on 235 real-world DC issues. We studied the manifestation patterns and fixing strategies of these issues and found several key factors that can lead to DC issues and their regressions. Based on our findings, we designed and implemented Watchman, a technique to continuously monitor dependency conflicts for the PyPI ecosystem. In our evaluation, Watchman analyzed PyPI snapshots between 11 Jul 2019 and 16 Aug 2019, and found 117 potential DC issues. We reported these issues to the developers of the corresponding projects. So far, 63 issues have been confirmed, 38 of which have been quickly fixed by applying our suggested patches.

DOI: 10.1145/3377811.3380426

One size does not fit all： a grounded theory and online survey study of developer preferences for security warning types

作者: Danilova, Anastasia and Naiakshina, Alena and Smith, Matthew
关键词: code security, developer security warnings, software development

Abstract

A wide range of tools exist to assist developers in creating secure software. Many of these tools, such as static analysis engines or security checkers included in compilers, use warnings to communicate security issues to developers. The effectiveness of these tools relies on developers heeding these warnings, and there are many ways in which these warnings could be displayed. Johnson et al. [46] conducted qualitative research and found that warning presentation and integration are main issues. We built on Johnson et al.'s work and examined what developers want from security warnings, including what form they should take and how they should integrate into their workflow and work context. To this end, we conducted a Grounded Theory study with 14 professional software developers and 12 computer science students as well as a focus group with 7 academic researchers to gather qualitative insights. To back up the theory developed from the qualitative research, we ran a quantitative survey with 50 professional software developers. Our results show that there is significant heterogeneity amongst developers and that no one warning type is preferred over all others. The context in which the warnings are shown is also highly relevant, indicating that it is likely to be beneficial if IDEs and other development tools become more flexible in their warning interactions with developers. Based on our findings, we provide concrete recommendations for both future research as well as how IDEs and other security tools can improve their interaction with developers.

DOI: 10.1145/3377811.3380387

Schr"{o

作者: van der Linden, Dirk and Anthonysamy, Pauline and Nuseibeh, Bashar and Tun, Thein Than and Petre, Marian and Levine, Mark and Towse, John and Rashid, Awais
关键词: No keywords

Abstract

Research has established the wide variety of security failures in mobile apps, their consequences, and how app developers introduce or exacerbate them. What is not well known is why developers do so—what is the rationale underpinning the decisions they make which eventually strengthen or weaken app security? This is all the more complicated in modern app development’s increasingly diverse demographic: growing numbers of independent, solo, or small team developers who do not have the organizational structures and support that larger software development houses enjoy.Through two studies, we open the box on developer rationale, by performing a holistic analysis of the rationale underpinning various activities in which app developers engage when developing an app.The first study does so through a task-based study with app developers (N=44) incorporating six distinct tasks for which this developer demographic must take responsibility: setting up a development environment, reviewing code, seeking help, seeking testers, selecting an advertisement SDK, and software licensing. We found that, while on first glance in several activities participants seemed to prioritize security, only in the code task such prioritization was underpinned by a security rationale-indicating that development behavior perceived to be secure may only be an illusion until the box is opened on their rationale.The second study confirms these findings through a wider survey of app developers (N=274) investigating to what extent they find the activities of the task-based study to affect their app’s security. In line with the task-based study, we found that developers perceived actively writing code and actively using external SDKs as the only security-relevant, while similarly disregarding other activities having an impact on app security.Our results suggest the need for a stronger focus on the tasks and activities surrounding the coding task - all of which need to be underpinned by a security rationale. Without such a holistic focus, developers may write “secure code” but not produce “secure apps”.

DOI: 10.1145/3377811.3380394

作者: Ingram, Claire and Drachen, Anders
关键词: developer communities, informal networks, knowledge sharing, meetup, tacit knowledge

Abstract

Informal technology ‘meetups’ have become an important aspect of the software development community, engaging many thousands of practitioners on a regular basis. However, although local technology meetups are well-attended by developers, little is known about their motivations for participating, the type or usefulness of information that they acquire, and how local meetups might differ from and complement other available communication channels for software engineering information. We interviewed the leaders of technology-oriented Meetup groups, and collected quantitative information via a survey distributed to participants in technology-oriented groups. Our findings suggest that participants in these groups are primarily experienced software practitioners, who use Meetup for staying abreast of new developments, building local networks and achieving transfer of rich tacit knowledge with peers to improve their practice. We also suggest that face to face meetings are useful forums for exchanging tacit knowledge and contextual information needed for software engineering practice.

DOI: 10.1145/3377811.3380333

Predicting developers’ negative feelings about code review

作者: Egelman, Carolyn D. and Murphy-Hill, Emerson and Kammer, Elizabeth and Hodges, Margaret Morrow and Green, Collin and Jaspan, Ciera and Lin, James
关键词: code review, interpersonal conflict

Abstract

During code review, developers critically examine each others’ code to improve its quality, share knowledge, and ensure conformance to coding standards. In the process, developers may have negative interpersonal interactions with their peers, which can lead to frustration and stress; these negative interactions may ultimately result in developers abandoning projects. In this mixed-methods study at one company, we surveyed 1,317 developers to characterize the negative experiences and cross-referenced the results with objective data from code review logs to predict these experiences. Our results suggest that such negative experiences, which we call “pushback”, are relatively rare in practice, but have negative repercussions when they occur. Our metrics can predict feelings of pushback with high recall but low precision, making them potentially appropriate for highlighting interactions that may benefit from a self-intervention.

DOI: 10.1145/3377811.3380414

Near-duplicate detection in web app model inference

作者: Yandrapally, Rahulkrishna and Stocco, Andrea and Mesbah, Ali
关键词: model-based testing, near-duplicate detection, reverse engineering

Abstract

Automated web testing techniques infer models from a given web app, which are used for test generation. From a testing viewpoint, such an inferred model should contain the minimal set of states that are distinct, yet, adequately cover the app’s main functionalities. In practice, models inferred automatically are affected by near-duplicates, i.e., replicas of the same functional webpage differing only by small insignificant changes. We present the first study of near-duplicate detection algorithms used in within app model inference. We first characterize functional near-duplicates by classifying a random sample of state-pairs, from 493k pairs of webpages obtained from over 6,000 websites, into three categories, namely clone, near-duplicate, and distinct. We systematically compute thresholds that define the boundaries of these categories for each detection technique. We then use these thresholds to evaluate 10 near-duplicate detection techniques from three different domains, namely, information retrieval, web testing, and computer vision on nine open-source web apps. Our study highlights the challenges posed in automatically inferring a model for any given web app. Our findings show that even with the best thresholds, no algorithm is able to accurately detect all functional near-duplicates within apps, without sacrificing coverage.

DOI: 10.1145/3377811.3380416

Extracting taint specifications for JavaScript libraries

作者: Staicu, Cristian-Alexandru and Torp, Martin Toldam and Sch"{a
关键词: dynamic analysis, static analysis, taint analysis

Abstract

Modern JavaScript applications extensively depend on third-party libraries. Especially for the Node.js platform, vulnerabilities can have severe consequences to the security of applications, resulting in, e.g., cross-site scripting and command injection attacks. Existing static analysis tools that have been developed to automatically detect such issues are either too coarse-grained, looking only at package dependency structure while ignoring dataflow, or rely on manually written taint specifications for the most popular libraries to ensure analysis scalability.In this work, we propose a technique for automatically extracting taint specifications for JavaScript libraries, based on a dynamic analysis that leverages the existing test suites of the libraries and their available clients in the npm repository. Due to the dynamic nature of JavaScript, mapping observations from dynamic analysis to taint specifications that fit into a static analysis is non-trivial. Our main insight is that this challenge can be addressed by a combination of an access path mechanism that identifies entry and exit points, and the use of membranes around the libraries of interest.We show that our approach is effective at inferring useful taint specifications at scale. Our prototype tool automatically extracts 146 additional taint sinks and 7840 propagation summaries spanning 1 393 npm modules. By integrating the extracted specifications into a commercial, state-of-the-art static analysis, 136 new alerts are produced, many of which correspond to likely security vulnerabilities. Moreover, many important specifications that were originally manually written are among the ones that our tool can now extract automatically.

DOI: 10.1145/3377811.3380390

SLACC： simion-based language agnostic code clones

作者: Mathew, George and Parnin, Chris and Stolee, Kathryn T
关键词: cross-language analysis, semantic code clone detection

Abstract

Successful cross-language clone detection could enable researchers and developers to create robust language migration tools, facilitate learning additional programming languages once one is mastered, and promote reuse of code snippets over a broader codebase. However, identifying cross-language clones presents special challenges to the clone detection problem. A lack of common underlying representation between arbitrary languages means detecting clones requires one of the following solutions: 1) a static analysis framework replicated across each targeted language with annotations matching language features across all languages, or 2) a dynamic analysis framework that detects clones based on runtime behavior.In this work, we demonstrate the feasibility of the latter solution, a dynamic analysis approach called SLACC for cross-language clone detection. Like prior clone detection techniques, we use input/output behavior to match clones, though we overcome limitations of prior work by amplifying the number of inputs and covering more data types; and as a result, achieve better clusters than prior attempts. Since clusters are generated based on input/output behavior, SLACC supports cross-language clone detection. As an added challenge, we target a static typed language, Java, and a dynamic typed language, Python. Compared to HitoshiIO, a recent clone detection tool for Java, SLACC retrieves 6 times as many clusters and has higher precision (86.7% vs. 30.7%).This is the first work to perform clone detection for dynamic typed languages (precision = 87.3%) and the first to perform clone detection across languages that lack a common underlying representation (precision = 94.1%). It provides a first step towards the larger goal of scalable language migration tools.

DOI: 10.1145/3377811.3380407

Finding client-side business flow tampering vulnerabilities

作者: Kim, I Luk and Zheng, Yunhui and Park, Hogun and Wang, Weihang and You, Wei and Aafer, Yousra and Zhang, Xiangyu
关键词: JavaScript, business flow tampering, dynamic analysis, vulnerability detection

Abstract

The sheer complexity of web applications leaves open a large attack surface of business logic. Particularly, in some scenarios, developers have to expose a portion of the logic to the client-side in order to coordinate multiple parties (e.g. merchants, client users, and third-party payment services) involved in a business process. However, such client-side code can be tampered with on the fly, leading to business logic perturbations and financial loss. Although developers become familiar with concepts that the client should never be trusted, given the size and the complexity of the client-side code that may be even incorporated from third parties, it is extremely challenging to understand and pinpoint the vulnerability. To this end, we investigate client-side business flow tampering vulnerabilities and develop a dynamic analysis based approach to automatically identifying such vulnerabilities. We evaluate our technique on 200 popular real-world websites. With negligible overhead, we have successfully identified 27 unique vulnerabilities on 23 websites, such as New York Times, HBO, and YouTube, where an adversary can interrupt business logic to bypass paywalls, disable adblocker detection, earn reward points illicitly, etc.

DOI: 10.1145/3377811.3380355

Securing unsafe rust programs with XRust

作者: Liu, Peiming and Zhao, Gang and Huang, Jeff
关键词: No keywords

Abstract

Rust is a promising systems programming language that embraces both high-level memory safety and low-level resource manipulation. However, the dark side of Rust, unsafe Rust, leaves a large security hole as it bypasses the Rust type system in order to support low-level operations. Recently, several real-world memory corruption vulnerabilities have been discovered in Rust’s standard libraries.We present XRust, a new technique that mitigates the security threat of unsafe Rust by ensuring the integrity of data flow from unsafe Rust code to safe Rust code. The cornerstone of XRust is a novel heap allocator that isolates the memory of unsafe Rust from that accessed only in safe Rust, and prevents any cross-region memory corruption. Our design of XRust supports both single-and multi-threaded Rust programs. Our extensive experiments on real-world Rust applications and standard libraries show that XRust is both highly efficient and effective in practice.

DOI: 10.1145/3377811.3380325

Is rust used safely by software developers?

作者: Evans, Ana Nora and Campbell, Bradford and Soffa, Mary Lou
关键词: No keywords

Abstract

Rust, an emerging programming language with explosive growth, provides a robust type system that enables programmers to write memory-safe and data-race free code. To allow access to a machine’s hardware and to support low-level performance optimizations, a second language, Unsafe Rust, is embedded in Rust. It contains support for operations that are difficult to statically check, such as C-style pointers for access to arbitrary memory locations and mutable global variables. When a program uses these features, the compiler is unable to statically guarantee the safety properties Rust promotes. In this work, we perform a large-scale empirical study to explore how software developers are using Unsafe Rust in real-world Rust libraries and applications. Our results indicate that software engineers use the keyword unsafe in less than 30% of Rust libraries, but more than half cannot be entirely statically checked by the Rust compiler because of Unsafe Rust hidden somewhere in a library’s call chain. We conclude that although the use of the keyword unsafe is limited, the propagation of unsafeness offers a challenge to the claim of Rust as a memory-safe language. Furthermore, we recommend changes to the Rust compiler and to the central Rust repository’s interface to help Rust software developers be aware of when their Rust code is unsafe.

DOI: 10.1145/3377811.3380413

Burn after reading： a shadow stack with microsecond-level runtime rerandomization for protecting return addresses

作者: Zou, Changwei and Xue, Jingling
关键词: control flow integrity, return-oriented programming, runtime rerandomization, shadow stack

Abstract

Return-oriented programming (ROP) is an effective code-reuse attack in which short code sequences (i.e., gadgets) ending in a ret instruction are found within existing binaries and then executed by taking control of the call stack. The shadow stack, control flow integrity (CFI) and code (re)randomization are three popular techniques for protecting programs against return address overwrites. However, existing runtime rerandomization techniques operate on concrete return addresses, requiring expensive pointer tracking.By adding one level of indirection, we introduce BarRA, the first shadow stack mechanism that applies continuous runtime rerandomization to abstract return addresses for protecting their corresponding concrete return addresses (protected also by CFI), thus avoiding expensive pointer tracking. As a nice side-effect, BarRA naturally combines the shadow stack, CFI and runtime rerandomization in the same framework. The key novelty of BarRA, however, is that once some abstract return addresses are leaked, BarRA will enforce the burn-after-reading property by rerandomizing the mapping from the abstract to the concrete return address space in the order of microseconds instead of seconds required for rerandomizing a concrete return address space. As a result, BarRA can be used as a superior replacement for the shadow stack, as demonstrated by comparing both using the 19 C/C++ benchmarks in SPEC CPU2006 (totalling 2,047,447 LOC) and analyzing a proof-of-concept attack, provided that we can tolerate some slight binary code size increases (by an average of 29.44%) and are willing to use 8MB of dedicated memory for holding up to 220 return addresses (on a 64-bit platform). Under an information leakage attack (for some return addresses), the shadow stack is always vulnerable but BarRA is significantly more resilient (by reducing an attacker’s success rate to 1/220 on average). In terms of the average performance overhead introduced, both are comparable: 6.09% (BarRA) vs. 5.38% (the shadow stack).

DOI: 10.1145/3377811.3380439

SAVER： scalable, precise, and safe memory-error repair

作者: Hong, Seongjoon and Lee, Junhee and Lee, Jeongsoo and Oh, Hakjoo
关键词: debugging, memory errors, program analysis, program repair

Abstract

We present SAVER, a new memory-error repair technique for C programs. Memory errors such as memory leak, double-free, and use-after-free are highly prevalent and fixing them requires significant effort. Automated program repair techniques hold the promise of reducing this burden but the state-of-the-art is still unsatisfactory. In particular, no existing techniques are able to fix those errors in a scalable, precise, and safe way, all of which are required for a truly practical tool. SAVER aims to address these shortcomings. To this end, we propose a method based on a novel representation of the program called object flow graph, which summarizes the program’s heap-related behavior using static analysis. We show that fixing memory errors can be formulated as a graph labeling problem over object flow graph and present an efficient algorithm. We evaluated SAVER in combination with Infer, an industrial-strength static bug-finder, and show that 74% of the reported errors can be fixed automatically for a range of open-source C programs.

DOI: 10.1145/3377811.3380323

Revealing injection vulnerabilities by leveraging existing tests

作者: Hough, Katherine and Welearegai, Gebrehiwet and Hammer, Christian and Bell, Jonathan
关键词: injection attacks, taint tracking, vulnerability testing

Abstract

Code injection attacks, like the one used in the high-profile 2017 Equifax breach, have become increasingly common, now ranking #1 on OWASP’s list of critical web application vulnerabilities. Static analyses for detecting these vulnerabilities can overwhelm developers with false positive reports. Meanwhile, most dynamic analyses rely on detecting vulnerabilities as they occur in the field, which can introduce a high performance overhead in production code. This paper describes a new approach for detecting injection vulnerabilities in applications by harnessing the combined power of human developers’ test suites and automated dynamic analysis. Our new approach, Rivulet, monitors the execution of developer-written functional tests in order to detect information flows that may be vulnerable to attack. Then, Rivulet uses a white-box test generation technique to repurpose those functional tests to check if any vulnerable flow could be exploited. When applied to the version of Apache Struts exploited in the 2017 Equifax attack, Rivulet quickly identifies the vulnerability, leveraging only the tests that existed in Struts at that time. We compared Rivulet to the state-of-the-art static vulnerability detector Julia on benchmarks, finding that Rivulet outperformed Julia in both false positives and false negatives. We also used Rivulet to detect new vulnerabilities.

DOI: 10.1145/3377811.3380326

RoScript： a visual script driven truly non-intrusive robotic testing system for touch screen applications

作者: Qian, Ju and Shang, Zhengyu and Yan, Shuoyan and Wang, Yan and Chen, Lin
关键词: GUI testing, computer vision, non-intrusive, robot, test automation

Abstract

Existing intrusive test automation techniques for touch screen applications (e.g., Appium and Sikuli) are difficult to work on many closed or uncommon systems, such as a GoPro. Being non-intrusive can largely extend the application scope of the test automation techniques. To this end, this paper presents RoScript, a truly non-intrusive test-script-driven robotic testing system for test automation of touch screen applications. RoScript leverages visual test scripts to express GUI actions on a touch screen application and uses a physical robot to drive automated test execution. To reduce the test script creation cost, a non-intrusive computer vision based technique is also introduced in RoScript to automatically record touch screen actions into test scripts from videos of human actions on the device under test. RoScript is applicable to touch screen applications running on almost arbitrary platforms, whatever the underlying operating systems or GUI frameworks are. We conducted experiments applying it to automate the testing of 21 touch screen applications on 6 different devices. The results show that RoScript is highly usable. In the experiments, it successfully automated 104 test scenarios containing over 650 different GUI actions on the subject applications. RoScript accurately performed GUI actions on over 90% of the test script executions and accurately recorded about 85% of human screen click actions into test code.

DOI: 10.1145/3377811.3380431

Translating video recordings of mobile app usages into replayable scenarios

作者: Bernal-C'{a
关键词: bug reporting, object detection, screen recordings

Abstract

Screen recordings of mobile applications are easy to obtain and capture a wealth of information pertinent to software developers (e.g., bugs or feature requests), making them a popular mechanism for crowdsourced app feedback. Thus, these videos are becoming a common artifact that developers must manage. In light of unique mobile development constraints, including swift release cycles and rapidly evolving platforms, automated techniques for analyzing all types of rich software artifacts provide benefit to mobile developers. Unfortunately, automatically analyzing screen recordings presents serious challenges, due to their graphical nature, compared to other types of (textual) artifacts. To address these challenges, this paper introduces V2S, a lightweight, automated approach for translating video recordings of Android app usages into replayable scenarios. V2S is based primarily on computer vision techniques and adapts recent solutions for object detection and image classification to detect and classify user actions captured in a video, and convert these into a replayable test scenario. We performed an extensive evaluation of V2S involving 175 videos depicting 3,534 GUI-based actions collected from users exercising features and reproducing bugs from over 80 popular Android apps. Our results illustrate that V2S can accurately replay scenarios from screen recordings, and is capable of reproducing ≈89% of our collected videos with minimal overhead. A case study with three industrial partners illustrates the potential usefulness of V2S from the viewpoint of developers.

DOI: 10.1145/3377811.3380328

Unblind your apps： predicting natural-language labels for mobile GUI components by deep learning

作者: Chen, Jieshan and Chen, Chunyang and Xing, Zhenchang and Xu, Xiwei and Zhu, Liming and Li, Guoqiang and Wang, Jinshui
关键词: accessibility, content description, image-based buttons, neural networks, user interface

Abstract

According to the World Health Organization(WHO), it is estimated that approximately 1.3 billion people live with some forms of vision impairment globally, of whom 36 million are blind. Due to their disability, engaging these minority into the society is a challenging problem. The recent rise of smart mobile phones provides a new solution by enabling blind users’ convenient access to the information and service for understanding the world. Users with vision impairment can adopt the screen reader embedded in the mobile operating systems to read the content of each screen within the app, and use gestures to interact with the phone. However, the prerequisite of using screen readers is that developers have to add natural-language labels to the image-based components when they are developing the app. Unfortunately, more than 77% apps have issues of missing labels, according to our analysis of 10,408 Android apps. Most of these issues are caused by developers’ lack of awareness and knowledge in considering the minority. And even if developers want to add the labels to UI components, they may not come up with concise and clear description as most of them are of no visual issues. To overcome these challenges, we develop a deep-learning based model, called LabelDroid, to automatically predict the labels of image-based buttons by learning from large-scale commercial apps in Google Play. The experimental results show that our model can make accurate predictions and the generated labels are of higher quality than that from real Android developers.

DOI: 10.1145/3377811.3380327

SLEMI： equivalence modulo input (EMI) based mutation of CPS models for finding compiler bugs in Simulink

作者: Chowdhury, Shafiul Azam and Shrestha, Sohil Lal and Johnson, Taylor T. and Csallner, Christoph
关键词: cyber-physical systems, differential testing, equivalence modulo input, model mutation, simulink

Abstract

Finding bugs in commercial cyber-physical system development tools (or “model-based design” tools) such as MathWorks’s Simulink is important in practice, as these tools are widely used to generate embedded code that gets deployed in safety-critical applications such as cars and planes. Equivalence Modulo Input (EMI) based mutation is a new twist on differential testing that promises lower use of computational resources and has already been successful at finding bugs in compilers for procedural languages. To provide EMI-based mutation for differential testing of cyber-physical system (CPS) development tools, this paper develops several novel mutation techniques. These techniques deal with CPS language features that are not found in procedural languages, such as an explicit notion of execution time and zombie code, which combines properties of live and dead procedural code. In our experiments the most closely related work (SLforge) found two bugs in the Simulink tool. In comparison, SLEMI found a super-set of issues, including 9 confirmed as bugs by MathWorks Support.

DOI: 10.1145/3377811.3380381

DeepBillboard： systematic physical-world testing of autonomous driving systems

作者: Zhou, Husheng and Li, Wei and Kong, Zelun and Guo, Junfeng and Zhang, Yuqun and Yu, Bei and Zhang, Lingming and Liu, Cong
关键词: No keywords

Abstract

Deep Neural Networks (DNNs) have been widely applied in autonomous systems such as self-driving vehicles. Recently, DNN testing has been intensively studied to automatically generate adversarial examples, which inject small-magnitude perturbations into inputs to test DNNs under extreme situations. While existing testing techniques prove to be effective, particularly for autonomous driving, they mostly focus on generating digital adversarial perturbations, e.g., changing image pixels, which may never happen in the physical world. Thus, there is a critical missing piece in the literature on autonomous driving testing: understanding and exploiting both digital and physical adversarial perturbation generation for impacting steering decisions. In this paper, we propose a systematic physical-world testing approach, namely DeepBillboard, targeting at a quite common and practical driving scenario: drive-by billboards. DeepBillboard is capable of generating a robust and resilient printable adversarial billboard test, which works under dynamic changing driving conditions including viewing angle, distance, and lighting. The objective is to maximize the possibility, degree, and duration of the steering-angle errors of an autonomous vehicle driving by our generated adversarial billboard. We have extensively evaluated the efficacy and robustness of DeepBillboard by conducting both experiments with digital perturbations and physical-world case studies. The digital experimental results show that DeepBillboard is effective for various steering models and scenes. Furthermore, the physical case studies demonstrate that DeepBillboard is sufficiently robust and resilient for generating physical-world adversarial billboard tests for real-world driving under various weather conditions, being able to mislead the average steering angle error up to 26.44 degrees. To the best of our knowledge, this is the first study demonstrating the possibility of generating realistic and continuous physical-world tests for practical autonomous driving systems; moreover, DeepBillboard can be directly generalized to a variety of other physical entities/surfaces along the curbside, e.g., a graffiti painted on a wall.

DOI: 10.1145/3377811.3380422

Misbehaviour prediction for autonomous driving systems

作者: Stocco, Andrea and Weiss, Michael and Calzana, Marco and Tonella, Paolo
关键词: anomaly detection, deep learning, misbehaviour prediction, testing

Abstract

Deep Neural Networks (DNNs) are the core component of modern autonomous driving systems. To date, it is still unrealistic that a DNN will generalize correctly to all driving conditions. Current testing techniques consist of offline solutions that identify adversarial or corner cases for improving the training phase.In this paper, we address the problem of estimating the confidence of DNNs in response to unexpected execution contexts with the purpose of predicting potential safety-critical misbehaviours and enabling online healing of DNN-based vehicles. Our approach SelfOracle is based on a novel concept of self-assessment oracle, which monitors the DNN confidence at runtime, to predict unsupported driving scenarios in advance. SelfOracle uses autoencoder-and time series-based anomaly detection to reconstruct the driving scenarios seen by the car, and to determine the confidence boundary between normal and unsupported conditions.In our empirical assessment, we evaluated the effectiveness of different variants of SelfOracle at predicting injected anomalous driving contexts, using DNN models and simulation environment from Udacity. Results show that, overall, SelfOracle can predict 77% misbehaviours, up to six seconds in advance, outperforming the online input validation approach of DeepRoad.

DOI: 10.1145/3377811.3380353

作者: Menghi, Claudio and Nejati, Shiva and Briand, Lionel and Parache, Yago Isasi
关键词: cyber-physical systems, falsification, model testing, robustness, search-based testing

Abstract

Black-box testing has been extensively applied to test models of Cyber-Physical systems (CPS) since these models are not often amenable to static and symbolic testing and verification. Black-box testing, however, requires to execute the model under test for a large number of candidate test inputs. This poses a challenge for a large and practically-important category of CPS models, known as compute-intensive CPS (CI-CPS) models, where a single simulation may take hours to complete. We propose a novel approach, namely ARIsTEO, to enable effective and efficient testing of CI-CPS models. Our approach embeds black-box testing into an iterative approximation-refinement loop. At the start, some sampled inputs and outputs of the CI-CPS model under test are used to generate a surrogate model that is faster to execute and can be subjected to black-box testing. Any failure-revealing test identified for the surrogate model is checked on the original model. If spurious, the test results are used to refine the surrogate model to be tested again. Otherwise, the test reveals a valid failure. We evaluated ARIsTEO by comparing it with S-Taliro, an open-source and industry-strength tool for testing CPS models. Our results, obtained based on five publicly-available CPS models, show that, on average, ARIsTEO is able to find 24% more requirements violations than S-Taliro and is 31% faster than S-Taliro in finding those violations. We further assessed the effectiveness and efficiency of ARIsTEO on a large industrial case study from the satellite domain. In contrast to S-Taliro, ARIsTEO successfully tested two different versions of this model and could identify three requirements violations, requiring four hours, on average, for each violation.

DOI: 10.1145/3377811.3380370

A comprehensive study of autonomous vehicle bugs

作者: Garcia, Joshua and Feng, Yang and Shen, Junjie and Almanee, Sumaya and Xia, Yuan and Chen, and Qi Alfred
关键词: autonomous vehicles, bugs, defects, empirical software engineering

Abstract

Self-driving cars, or Autonomous Vehicles (AVs), are increasingly becoming an integral part of our daily life. About 50 corporations are actively working on AVs, including large companies such as Google, Ford, and Intel. Some AVs are already operating on public roads, with at least one unfortunate fatality recently on record. As a result, understanding bugs in AVs is critical for ensuring their security, safety, robustness, and correctness. While previous studies have focused on a variety of domains (e.g., numerical software; machine learning; and error-handling, concurrency, and performance bugs) to investigate bug characteristics, AVs have not been studied in a similar manner. Recently, two software systems for AVs, Baidu Apollo and Autoware, have emerged as frontrunners in the open-source community and have been used by large companies and governments (e.g., Lincoln, Volvo, Ford, Intel, Hitachi, LG, and the US Department of Transportation). From these two leading AV software systems, this paper describes our investigation of 16,851 commits and 499 AV bugs and introduces our classification of those bugs into 13 root causes, 20 bug symptoms, and 18 categories of software components those bugs often affect. We identify 16 major findings from our study and draw broader lessons from them to guide the research community towards future directions in software bug detection, localization, and repair.

DOI: 10.1145/3377811.3380397

Studying the use of Java logging utilities in the wild

作者: Chen, Boyuan and Jiang, Zhen Ming (Jack)
关键词: empirical software engineering, logging code, logging practices

Abstract

Software logging is widely used in practice. Logs have been used for a variety of purposes like debugging, monitoring, security compliance, and business analytics. Instead of directly invoking the standard output functions, developers usually prefer to use logging utilities (LUs) (e.g., SLF4J), which provide additional functionalities like thread-safety and verbosity level support, to instrument their source code. Many of the previous research works on software logging are focused on the log printing code. There are very few works studying the use of LUs, although new LUs are constantly being introduced by companies and researchers. In this paper, we conducted a large-scale empirical study on the use of Java LUs in the wild. We analyzed the use of 3, 856 LUs from 11,194 projects in GitHub and found that many projects have complex usage patterns for LUs. For example, 75.8% of the large-sized projects have implemented their own LUs in their projects. More than 50% of these projects use at least three LUs. We conducted further qualitative studies to better understand and characterize the complex use of LUs. Our findings show that different LUs are used for a variety of reasons (e.g., internationalization of the log messages). Some projects develop their own LUs to satisfy project-specific logging needs (e.g., defining the logging format). Multiple uses of LUs in one project are pretty common for large and very largesized projects mainly for context like enabling and configuring the logging behavior for the imported packages. Interviewing with 13 industrial developers showed that our findings are also generally true for industrial projects and are considered as very helpful for them to better configure and manage the logging behavior for their projects. The findings and the implications presented in this paper will be useful for developers and researchers who are interested in developing and maintaining LUs.

DOI: 10.1145/3377811.3380408

A study on the prevalence of human values in software engineering publications, 2015 – 2018

作者: Perera, Harsha and Hussain, Waqar and Whittle, Jon and Nurwidyantoro, Arif and Mougouei, Davoud and Shams, Rifat Ara and Oliver, Gillian
关键词: human values, paper classification, software engineering

Abstract

Failure to account for human values in software (e.g., equality and fairness) can result in user dissatisfaction and negative socio-economic impact. Engineering these values in software, however, requires technical and methodological support throughout the development life cycle. This paper investigates to what extent top Software Engineering (SE) conferences and journals have included research on human values in SE. We investigate the prevalence of human values in recent (2015 – 2018) publications in these top venues. We classify these publications, based on their relevance to different values, against a widely used value structure adopted from the social sciences. Our results show that: (a) only a small proportion of the publications directly consider values, classified as directly relevant publications; (b) for the majority of the values, very few or no directly relevant publications were found; and © the prevalence of directly relevant publications was higher in SE conferences compared to SE journals. This paper shares these and other insights that may motivate future research on human values in software engineering.

DOI: 10.1145/3377811.3380393

Explaining pair programming session dynamics from knowledge gaps

作者: Zieris, Franz and Prechelt, Lutz
关键词: No keywords

Abstract

Background: Despite a lot of research on the effectiveness of Pair Programming (PP), the question when it is useful or less useful remains unsettled.Method: We analyze recordings of many industrial PP sessions with Grounded Theory Methodology and build on prior work that identified various phenomena related to within-session knowledge build-up and transfer. We validate our findings with practitioners.Result: We identify two fundamentally different types of required knowledge and explain how different constellations of knowledge gaps in these two respects lead to different session dynamics. Gaps in project-specific systems knowledge are more hampering than gaps in general programming knowledge and are dealt with first and foremost in a PP session.Conclusion: Partner constellations with complementary knowledge make PP a particularly effective practice. In PP sessions, differences in system understanding are more important than differences in general software development knowledge.

DOI: 10.1145/3377811.3380925

Engineering gender-inclusivity into software： ten teams’ tales from the trenches

作者: Hilderbrand, Claudia and Perdriau, Christopher and Letaw, Lara and Emard, Jillian and Steine-Hanson, Zoe and Burnett, Margaret and Sarma, Anita
关键词: gendermag, inclusive software, software engineering practices

Abstract

Although the need for gender-inclusivity in software is gaining attention among SE researchers and SE practitioners, and at least one method (GenderMag) has been published to help, little has been reported on how to make such methods work in real-world settings. Real-world teams are ever-mindful of the practicalities of adding new methods on top of their existing processes. For example, how can they keep the time costs viable? How can they maximize impacts of using it? What about controversies that can arise in talking about gender? To find out how software teams “in the trenches” handle these and similar questions, we collected the GenderMag-based processes of 10 real-world software teams—more than 50 people—for periods ranging from 5 months to 3.5 years. We present these teams’ insights and experiences in the form of 9 practices, 2 potential pitfalls, and 2 open issues, so as to provide their insights to other real-world software teams trying to engineer gender-inclusivity into their software products.

DOI: 10.1145/3377811.3380371

How has forking changed in the last 20 years? a study of hard forks on GitHub

作者: Zhou, Shurui and Vasilescu, Bogdan and K"{a
关键词: No keywords

Abstract

The notion of forking has changed with the rise of distributed version control systems and social coding environments, like GitHub. Traditionally forking refers to splitting off an independent development branch (which we call hard forks); research on hard forks, conducted mostly in pre-GitHub days showed that hard forks were often seen critical as they may fragment a community Today, in social coding environments, open-source developers are encouraged to fork a project in order to contribute to the community (which we call social forks), which may have also influenced perceptions and practices around hard forks. To revisit hard forks, we identify, study, and classify 15,306 hard forks on GitHub and interview 18 owners of hard forks or forked repositories. We find that, among others, hard forks often evolve out of social forks rather than being planned deliberately and that perception about hard forks have indeed changed dramatically, seeing them often as a positive noncompetitive alternative to the original project.

DOI: 10.1145/3377811.3380412

Multiple-entry testing of Android applications by constructing activity launching contexts

作者: Yan, Jiwei and Liu, Hao and Pan, Linjie and Yan, Jun and Zhang, Jian and Liang, Bin
关键词: Android app, ICC, multiple-entry testing, static analysis

Abstract

Existing GUI testing approaches of Android apps usually test apps from a single entry. In this way, the marginal activities far away from the default entry are difficult to be covered. The marginal activities may fail to be launched due to requiring a great number of activity transitions or involving complex user operations, leading to uneven coverage on activity components. Besides, since the test space of GUI programs is infinite, it is difficult to test activities under complete launching contexts using single-entry testing approaches.In this paper, we address these issues by constructing activity launching contexts and proposing a multiple-entry testing framework. We perform an inter-procedural, flow-, context- and path-sensitive analysis to build activity launching models and generate complete launching contexts. By activity exposing and static analysis, we could launch activities directly under various contexts without performing long event sequence on GUI. Besides, to achieve an in-depth exploration, we design an adaptive exploration framework which supports the multiple-entry exploration and dynamically assigns weights to entries in each turn.Our approach is implemented in a tool called Fax, with an activity launching strategy Faxla and an exploration strategy Faxex. The experiments on 20 real-world apps show that Faxla can cover 96.4% and successfully launch 60.6% activities, based on which Faxex further achieves a relatively 19.7% improvement on method coverage compared with the most popular tool Monkey. Our tool also behaves well in revealing hidden bugs. Fax can trigger over seven hundred unique crashes, including 180 Errors and 539 Warnings, which is significantly higher than those of other tools. Among the 46 bugs reported to developers on Github, 33 have been fixed up to now.

DOI: 10.1145/3377811.3380347

ComboDroid： generating high-quality test inputs for Android apps via use case combinations

作者: Wang, Jue and Jiang, Yanyan and Xu, Chang and Cao, Chun and Ma, Xiaoxing and Lu, Jian
关键词: mobile apps, software testing

Abstract

Android apps demand high-quality test inputs, whose generation remains an open challenge. Existing techniques fall short on exploring complex app functionalities reachable only by a long, meaningful, and effective test input. Observing that such test inputs can usually be decomposed into relatively independent short use cases, this paper presents ComboDroid, a fundamentally different Android app testing framework. ComboDroid obtains use cases for manifesting a specific app functionality (either manually provided or automatically extracted), and systematically enumerates the combinations of use cases, yielding high-quality test inputs.The evaluation results of ComboDroid on real-world apps are encouraging. Our fully automatic variant outperformed the best existing technique APE by covering 4.6% more code (APE only outperformed Monkey by 2.1%), and revealed four previously unknown bugs in extensively tested subjects. Our semi-automatic variant boosts the manual use cases obtained with little manual labor, achieving a comparable coverage (only 3.2% less) with a white-box human testing expert.

DOI: 10.1145/3377811.3380382

Time-travel testing of Android apps

作者: Dong, Zhen and B"{o
关键词: No keywords

Abstract

Android testing tools generate sequences of input events to exercise the state space of the app-under-test. Existing search-based techniques systematically evolve a population of event sequences so as to achieve certain objectives such as maximal code coverage. The hope is that the mutation of fit event sequences leads to the generation of even fitter sequences. However, the evolution of event sequences may be ineffective. Our key insight is that pertinent app states which contributed to the original sequence’s fitness may not be reached by a mutated event sequence. The original path through the state space is truncated at the point of mutation.In this paper, we propose instead to evolve a population of states which can be captured upon discovery and resumed when needed. The hope is that generating events on a fit program state leads to the transition to even fitter states. For instance, we can quickly deprioritize testing the main screen state which is visited by most event sequences, and instead focus our limited resources on testing more interesting states that are otherwise difficult to reach.We call our approach time-travel testing because of this ability to travel back to any state that has been observed in the past. We implemented time-travel testing into TimeMachine, a time-travel enabled version of the successful, automated Android testing tool Monkey. In our experiments on a large number of open- and closed source Android apps, TimeMachine outperforms the state-of-the-art search-based/model-based Android testing tools Sapienz and Stoat, both in terms of coverage achieved and crashes found.

DOI: 10.1145/3377811.3380402

HeteroRefactor： refactoring for heterogeneous computing with FPGA

作者: Lau, Jason and Sivaraman, Aishwarya and Zhang, Qian and Gulzar, Muhammad Ali and Cong, Jason and Kim, Miryung
关键词: FPGA, automated refactoring, dynamic analysis, heterogeneous computing, high-level synthesis

Abstract

Heterogeneous computing with field-programmable gate-arrays (FPGAs) has demonstrated orders of magnitude improvement in computing efficiency for many applications. However, the use of such platforms so far is limited to a small subset of programmers with specialized hardware knowledge. High-level synthesis (HLS) tools made significant progress in raising the level of programming abstraction from hardware programming languages to C/C++, but they usually cannot compile and generate accelerators for kernel programs with pointers, memory management, and recursion, and require manual refactoring to make them HLS-compatible. Besides, experts also need to provide heavily handcrafted optimizations to improve resource efficiency, which affects the maximum operating frequency, parallelization, and power efficiency.We propose a new dynamic invariant analysis and automated refactoring technique, called HeteroRefactor. First, HeteroRefactor monitors FPGA-specific dynamic invariants—the required bitwidth of integer and floating-point variables, and the size of recursive data structures and stacks. Second, using this knowledge of dynamic invariants, it refactors the kernel to make traditionally HLS-incompatible programs synthesizable and to optimize the accelerator’s resource usage and frequency further. Third, to guarantee correctness, it selectively offloads the computation from CPU to FPGA, only if an input falls within the dynamic invariant. On average, for a recursive program of size 175 LOC, an expert FPGA programmer would need to write 185 more LOC to implement an HLS compatible version, while HeteroRefactor automates such transformation. Our results on Xilinx FPGA show that HeteroRefactor minimizes BRAM by 83% and increases frequency by 42% for recursive programs; reduces BRAM by 41% through integer bitwidth reduction; and reduces DSP by 50% through floating-point precision tuning.

DOI: 10.1145/3377811.3380340

HARP： holistic analysis for refactoring Python-based analytics programs

作者: Zhou, Weijie and Zhao, Yue and Zhang, Guoqiang and Shen, Xipeng
关键词: computation graph, dynamic language, machine learning program, program analysis

Abstract

Modern machine learning programs are often written in Python, with the main computations specified through calls to some highly optimized libraries (e.g., TensorFlow, PyTorch). How to maximize the computing efficiency of such programs is essential for many application domains, which has drawn lots of recent attention. This work points out a common limitation in existing efforts: they focus their views only on the static computation graphs specified by library APIs, but leave the influence from the hosting Python code largely unconsidered. The limitation often causes them to miss the big picture and hence many important optimization opportunities. This work proposes a new approach named HARP to address the problem. HARP enables holistic analysis that spans across computation graphs and their hosting Python code. HARP achieves it through a set of novel techniques: analytics-conscious speculative analysis to circumvent Python complexities, a unified representation augmented computation graphs to capture all dimensions of knowledge related with the holistic analysis, and conditioned feedback mechanism to allow risk-controlled aggressive analysis. Refactoring based on HARP gives 1.3–3X and 2.07X average speedups on a set of TensorFlow and PyTorch programs.

DOI: 10.1145/3377811.3380434

CC2Vec： distributed representations of code changes

作者: Hoang, Thong and Kang, Hong Jin and Lo, David and Lawall, Julia
关键词: No keywords

Abstract

Existing work on software patches often use features specific to a single task. These works often rely on manually identified features, and human effort is required to identify these features for each task. In this work, we propose CC2Vec, a neural network model that learns a representation of code changes guided by their accompanying log messages, which represent the semantic intent of the code changes. CC2Vec models the hierarchical structure of a code change with the help of the attention mechanism and uses multiple comparison functions to identify the differences between the removed and added code.To evaluate if CC2Vec can produce a distributed representation of code changes that is general and useful for multiple tasks on software patches, we use the vectors produced by CC2Vec for three tasks: log message generation, bug fixing patch identification, and just-in-time defect prediction. In all tasks, the models using CC2Vec outperform the state-of-the-art techniques.

DOI: 10.1145/3377811.3380361

Empirical review of automated analysis tools on 47,587 Ethereum smart contracts

作者: Durieux, Thomas and Ferreira, Jo~{a
关键词: Ethereum, blockchain, debugging, reproducible bugs, smart contracts, solidity, testing, tools

Abstract

Over the last few years, there has been substantial research on automated analysis, testing, and debugging of Ethereum smart contracts. However, it is not trivial to compare and reproduce that research. To address this, we present an empirical evaluation of 9 state-of-the-art automated analysis tools using two new datasets: i) a dataset of 69 annotated vulnerable smart contracts that can be used to evaluate the precision of analysis tools; and ii) a dataset with all the smart contracts in the Ethereum Blockchain that have Solidity source code available on Etherscan (a total of 47,518 contracts). The datasets are part of SmartBugs, a new extendable execution framework that we created to facilitate the integration and comparison between multiple analysis tools and the analysis of Ethereum smart contracts. We used SmartBugs to execute the 9 automated analysis tools on the two datasets. In total, we ran 428,337 analyses that took approximately 564 days and 3 hours, being the largest experimental setup to date both in the number of tools and in execution time. We found that only 42% of the vulnerabilities from our annotated dataset are detected by all the tools, with the tool Mythril having the higher accuracy (27%). When considering the largest dataset, we observed that 97% of contracts are tagged as vulnerable, thus suggesting a considerable number of false positives. Indeed, only a small number of vulnerabilities (and of only two categories) were detected simultaneously by four or more tools.

DOI: 10.1145/3377811.3380364

Gap between theory and practice： an empirical study of security patches in solidity

作者: Hwang, Sungjae and Ryu, Sukyoung
关键词: empirical study, security patches, smart contracts, solidity

Abstract

Ethereum, one of the most popular blockchain platforms, provides financial transactions like payments and auctions through smart contracts. Due to the immense interest in smart contracts in academia, the research community of smart contract security has made a significant improvement recently. Researchers have reported various security vulnerabilities in smart contracts, and developed static analysis tools and verification frameworks to detect them. However, it is unclear whether such great efforts from academia has indeed enhanced the security of smart contracts in reality.To understand the security level of smart contracts in the wild, we empirically studied 55,046 real-world Ethereum smart contracts written in Solidity, the most popular programming language used by Ethereum smart contract developers. We first examined how many well-known vulnerabilities the Solidity compiler has patched, and how frequently the Solidity team publishes compiler releases. Unfortunately, we observed that many known vulnerabilities are not yet patched, and some patches are not even sufficient to avoid their target vulnerabilities. Subsequently, we investigated whether smart contract developers use the most recent compiler with vulnerabilities patched. We reported that developers of more than 98% of real-world Solidity contracts still use older compilers without vulnerability patches, and more than 25% of the contracts are potentially vulnerable due to the missing security patches. To understand actual impacts of the missing patches, we manually investigated potentially vulnerable contracts that are detected by our static analyzer and identified common mistakes by Solidity developers, which may cause serious security issues such as financial loss. We detected hundreds of vulnerable contracts and about one fourth of the vulnerable contracts are used by thousands of people. We recommend the Solidity team to make patches that resolve known vulnerabilities correctly, and developers to use the latest Solidity compiler to avoid missing security patches.

DOI: 10.1145/3377811.3380424

An investigation of cross-project learning in online just-in-time software defect prediction

作者: Tabassum, Sadia and Minku, Leandro L. and Feng, Danyi and Cabral, George G. and Song, Liyan
关键词: class imbalance, concept drift, cross-project learning, online learning, software defect prediction, transfer learning, verification latency

Abstract

Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean based on machine learning classifiers. Building such classifiers requires a sufficient amount of training data that is not available at the beginning of a software project. Cross-Project (CP) JIT-SDP can overcome this issue by using data from other projects to build the classifier, achieving similar (not better) predictive performance to classifiers trained on Within-Project (WP) data. However, such approaches have never been investigated in realistic online learning scenarios, where WP software changes arrive continuously over time and can be used to update the classifiers. It is unknown to what extent CP data can be helpful in such situation. In particular, it is unknown whether CP data are only useful during the very initial phase of the project when there is little WP data, or whether they could be helpful for extended periods of time. This work thus provides the first investigation of when and to what extent CP data are useful for JIT-SDP in a realistic online learning scenario. For that, we develop three different CP JIT-SDP approaches that can operate in online mode and be updated with both incoming CP and WP training examples over time. We also collect 2048 commits from three software repositories being developed by a software company over the course of 9 to 10 months, and use 19,8468 commits from 10 active open source GitHub projects being developed over the course of 6 to 14 years. The study shows that training classifiers with incoming CP+WP data can lead to improvements in G-mean of up to 53.90% compared to classifiers using only WP data at the initial stage of the projects. For the open source projects, which have been running for longer periods of time, using CP data to supplement WP data also helped the classifiers to reduce or prevent large drops in predictive performance that may occur over time, leading to up to around 40% better G-Mean during such periods. Such use of CP data was shown to be beneficial even after a large number of WP data were received, leading to overall G-means up to 18.5% better than those of WP classifiers.

DOI: 10.1145/3377811.3380403

Understanding the automated parameter optimization on transfer learning for cross-project defect prediction： an empirical study

作者: Li, Ke and Xiang, Zilin and Chen, Tao and Wang, Shuo and Tan, Kay Chen
关键词: automated parameter optimization, classification techniques, cross-project defect prediction, transfer learning

Abstract

Data-driven defect prediction has become increasingly important in software engineering process. Since it is not uncommon that data from a software project is insufficient for training a reliable defect prediction model, transfer learning that borrows data/konwledge from other projects to facilitate the model building at the current project, namely cross-project defect prediction (CPDP), is naturally plausible. Most CPDP techniques involve two major steps, i.e., transfer learning and classification, each of which has at least one parameter to be tuned to achieve their optimal performance. This practice fits well with the purpose of automated parameter optimization. However, there is a lack of thorough understanding about what are the impacts of automated parameter optimization on various CPDP techniques. In this paper, we present the first empirical study that looks into such impacts on 62 CPDP techniques, 13 of which are chosen from the existing CPDP literature while the other 49 ones have not been explored before. We build defect prediction models over 20 real-world software projects that are of different scales and characteristics. Our findings demonstrate that: (1) Automated parameter optimization substantially improves the defect prediction performance of 77% CPDP techniques with a manageable computational cost. Thus more efforts on this aspect are required in future CPDP studies. (2) Transfer learning is of ultimate importance in CPDP. Given a tight computational budget, it is more cost-effective to focus on optimizing the parameter configuration of transfer learning algorithms (3) The research on CPDP is far from mature where it is ‘not difficult’ to find a better alternative by making a combination of existing transfer learning and classification techniques. This finding provides important insights about the future design of CPDP techniques.

DOI: 10.1145/3377811.3380360

Software visualization and deep transfer learning for effective software defect prediction

作者: Chen, Jinyin and Hu, Keke and Yu, Yue and Chen, Zhuangzhi and Xuan, Qi and Liu, Yi and Filkov, Vladimir
关键词: cross-project defect prediction, deep transfer learning, self-attention, software visualization, within-project defect prediction

Abstract

Software defect prediction aims to automatically locate defective code modules to better focus testing resources and human effort. Typically, software defect prediction pipelines are comprised of two parts: the first extracts program features, like abstract syntax trees, by using external tools, and the second applies machine learning-based classification models to those features in order to predict defective modules. Since such approaches depend on specific feature extraction tools, machine learning classifiers have to be custom-tailored to effectively build most accurate models.To bridge the gap between deep learning and defect prediction, we propose an end-to-end framework which can directly get prediction results for programs without utilizing feature-extraction tools. To that end, we first visualize programs as images, apply the self-attention mechanism to extract image features, use transfer learning to reduce the difference in sample distributions between projects, and finally feed the image files into a pre-trained, deep learning model for defect prediction. Experiments with 10 open source projects from the PROMISE dataset show that our method can improve cross-project and within-project defect prediction. Our code and data pointers are available at https://zenodo.org/record/3373409#.XV0Oy5Mza35.

DOI: 10.1145/3377811.3380389

Software documentation： the practitioners’ perspective

作者: Aghajani, Emad and Nagy, Csaba and Linares-V'{a
关键词: documentation, empirical study

Abstract

In theory, (good) documentation is an invaluable asset to any software project, as it helps stakeholders to use, understand, maintain, and evolve a system. In practice, however, documentation is generally affected by numerous shortcomings and issues, such as insufficient and inadequate content and obsolete, ambiguous information. To counter this, researchers are investigating the development of advanced recommender systems that automatically suggest high-quality documentation, useful for a given task. A crucial first step is to understand what quality means for practitioners and what information is actually needed for specific tasks.We present two surveys performed with 146 practitioners to investigate (i) the documentation issues they perceive as more relevant together with solutions they apply when these issues arise; and (ii) the types of documentation considered as important in different tasks. Our findings can help researchers in designing the next generation of documentation recommender systems.

DOI: 10.1145/3377811.3380405

DLFix： context-based code transformation learning for automated program repair

作者: Li, Yi and Wang, Shaohua and Nguyen, Tien N.
关键词: automated program repair, context-based code transformation learning, deep learning

Abstract

Automated Program Repair (APR) is very useful in helping developers in the process of software development and maintenance. Despite recent advances in deep learning (DL), the DL-based APR approaches still have limitations in learning bug-fixing code changes and the context of the surrounding source code of the bug-fixing code changes. These limitations lead to incorrect fixing locations or fixes. In this paper, we introduce DLFix, a two-tier DL model that treats APR as code transformation learning from the prior bug fixes and the surrounding code contexts of the fixes. The first layer is a tree-based RNN model that learns the contexts of bug fixes and its result is used as an additional weighting input for the second layer designed to learn the bug-fixing code transformations.We conducted several experiments to evaluate DLFix in two benchmarks: Defect4j and Bugs.jar, and a newly built bug datasets with a total of +20K real-world bugs in eight projects. We compared DLFix against a total of 13 state-of-the-art pattern-based APR tools. Our results show that DLFix can auto-fix more bugs than 11 of them, and is comparable and complementary to the top two pattern-based APR tools in which there are 7 and 11 unique bugs that they cannot detect, respectively, but we can. Importantly, DLFix is fully automated and data-driven, and does not require hard-coding of bug-fixing patterns as in those tools. We compared DLFix against 4 state-of-the-art deep learning based APR models. DLFix is able to fix 2.5 times more bugs than the best performing baseline.

DOI: 10.1145/3377811.3380345

On the efficiency of test suite based program repair： A Systematic Assessment of 16 Automated Repair Systems for Java Programs

作者: Liu, Kui and Wang, Shangwen and Koyuncu, Anil and Kim, Kisub and Bissyand'{e
关键词: efficiency, empirical assessment, patch generation, program repair

Abstract

Test-based automated program repair has been a prolific field of research in software engineering in the last decade. Many approaches have indeed been proposed, which leverage test suites as a weak, but affordable, approximation to program specifications. Although the literature regularly sets new records on the number of benchmark bugs that can be fixed, several studies increasingly raise concerns about the limitations and biases of state-of-the-art approaches. For example, the correctness of generated patches has been questioned in a number of studies, while other researchers pointed out that evaluation schemes may be misleading with respect to the processing of fault localization results. Nevertheless, there is little work addressing the efficiency of patch generation, with regard to the practicality of program repair. In this paper, we fill this gap in the literature, by providing an extensive review on the efficiency of test suite based program repair. Our objective is to assess the number of generated patch candidates, since this information is correlated to (1) the strategy to traverse the search space efficiently in order to select sensical repair attempts, (2) the strategy to minimize the test effort for identifying a plausible patch, (3) as well as the strategy to prioritize the generation of a correct patch. To that end, we perform a large-scale empirical study on the efficiency, in terms of quantity of generated patch candidates of the 16 open-source repair tools for Java programs. The experiments are carefully conducted under the same fault localization configurations to limit biases. Eventually, among other findings, we note that: (1) many irrelevant patch candidates are generated by changing wrong code locations; (2) however, if the search space is carefully triaged, fault localization noise has little impact on patch generation efficiency; (3) yet, current template-based repair systems, which are known to be most effective in fixing a large number of bugs, are actually least efficient as they tend to generate majoritarily irrelevant patch candidates.

DOI: 10.1145/3377811.3380338

Caspar： extracting and synthesizing user stories of problems from app reviews

作者: Guo, Hui and Singh, Munindar P.
关键词: No keywords

Abstract

A user’s review of an app often describes the user’s interactions with the app. These interactions, which we interpret as mini stories, are prominent in reviews with negative ratings. In general, a story in an app review would contain at least two types of events: user actions and associated app behaviors. Being able to identify such stories would enable an app’s developer in better maintaining and improving the app’s functionality and enhancing user experience.We present Caspar, a method for extracting and synthesizing user-reported mini stories regarding app problems from reviews. By extending and applying natural language processing techniques, Caspar extracts ordered events from app reviews, classifies them as user actions or app problems, and synthesizes action-problem pairs. Our evaluation shows that Caspar is effective in finding action-problem pairs from reviews. First, Caspar classifies the events with an accuracy of 82.0% on manually labeled data. Second, relative to human evaluators, Caspar extracts event pairs with 92.9% precision and 34.2% recall. In addition, we train an inference model on the extracted action-problem pairs that automatically predicts possible app problems for different use cases. Preliminary evaluation shows that our method yields promising results. Caspar illustrates the potential for a deeper understanding of app reviews and possibly other natural language artifacts arising in software engineering.

DOI: 10.1145/3377811.3380924

Detection of hidden feature requests from massive chat messages via deep siamese network

作者: Shi, Lin and Xing, Mingzhe and Li, Mingyang and Wang, Yawen and Li, Shoubin and Wang, Qing
关键词: deep learning, feature requests, requirements engineering, siamese network

Abstract

Online chatting is gaining popularity and plays an increasingly significant role in software development. When discussing functionalities, developers might reveal their desired features to other developers. Automated mining techniques towards retrieving feature requests from massive chat messages can benefit the requirements gathering process. But it is quite challenging to perform such techniques because detecting feature requests from dialogues requires a thorough understanding of the contextual information, and it is also extremely expensive on annotating feature-request dialogues for learning. To bridge that gap, we recast the traditional text classification task of mapping single dialog to its class into the task of determining whether two dialogues are similar or not by incorporating few-shot learning. We propose a novel approach, named FRMiner, which can detect feature-request dialogues from chat messages via deep Siamese network. We design a BiLSTM-based dialog model that can learn the contextual information of a dialog in both forward and reverse directions. Evaluation on the real-world projects shows that our approach achieves average precision, recall and F1-score of 88.52%, 88.50% and 88.51%, which confirms that our approach could effectively detect hidden feature requests from chat messages, thus can facilitate gathering comprehensive requirements from the crowd in an automated way.

DOI: 10.1145/3377811.3380356

A tale from the trenches： cognitive biases and software development

作者: Chattopadhyay, Souti and Nelson, Nicholas and Au, Audrey and Morales, Natalia and Sanchez, Christopher and Pandita, Rahul and Sarma, Anita
关键词: cognitive bias, field study, interviews, software development

Abstract

Cognitive biases are hard-wired behaviors that influence developer actions and can set them on an incorrect course of action, necessitating backtracking. While researchers have found that cognitive biases occur in development tasks in controlled lab studies, we still don’t know how these biases affect developers’ everyday behavior. Without such an understanding, development tools and practices remain inadequate. To close this gap, we conducted a 2-part field study to examine the extent to which cognitive biases occur, the consequences of these biases on developer behavior, and the practices and tools that developers use to deal with these biases. About 70% of observed actions that were reversed were associated with at least one cognitive bias. Further, even though developers recognized that biases frequently occur, they routinely are forced to deal with such issues with ad hoc processes and sub-optimal tool support. As one participant (IP12) lamented: There is no salvation!

DOI: 10.1145/3377811.3380330

Recognizing developers’ emotions while programming

作者: Girardi, Daniela and Novielli, Nicole and Fucci, Davide and Lanubile, Filippo
关键词: biometric sensors, emotion awareness, emotion detection, empirical software engineering, human factors in software engineering

Abstract

Developers experience a wide range of emotions during programming tasks, which may have an impact on job performance. In this paper, we present an empirical study aimed at (i) investigating the link between emotion and progress, (ii) understanding the triggers for developers’ emotions and the strategies to deal with negative ones, (iii) identifying the minimal set of non-invasive biometric sensors for emotion recognition during programming tasks. Results confirm previous findings about the relation between emotions and perceived productivity. Furthermore, we show that developers’ emotions can be reliably recognized using only a wristband capturing the electrodermal activity and heart-related metrics.

DOI: 10.1145/3377811.3380374

Neurological divide： an fMRI study of prose and code writing

作者: Krueger, Ryan and Huang, Yu and Liu, Xinyu and Santander, Tyler and Weimer, Westley and Leach, Kevin
关键词: attention, keyboard, medical imaging, memory, spatial, synthesis

Abstract

Software engineering involves writing new code or editing existing code. Recent efforts have investigated the neural processes associated with reading and comprehending code — however, we lack a thorough understanding of the human cognitive processes underlying code writing. While prose reading and writing have been studied thoroughly, that same scrutiny has not been applied to code writing. In this paper, we leverage functional brain imaging to investigate neural representations of code writing in comparison to prose writing. We present the first human study in which participants wrote code and prose while undergoing a functional magnetic resonance imaging (fMRI) brain scan, making use of a full-sized fMRI-safe QWERTY keyboard.We find that code writing and prose writing are significantly dissimilar neural tasks. While prose writing entails significant left hemisphere activity associated with language, code writing involves more activations of the right hemisphere, including regions associated with attention control, working memory, planning and spatial cognition. These findings are unlike existing work in which code and prose comprehension were studied. By contrast, we present the first evidence suggesting that code and prose writing are quite dissimilar at the neural level.

DOI: 10.1145/3377811.3380348

Here we go again： why is it difficult for developers to learn another programming language?

作者: Shrestha, Nischal and Botta, Colton and Barik, Titus and Parnin, Chris
关键词: interference theory, learning, program comprehension, programming environments, programming languages

Abstract

Once a programmer knows one language, they can leverage concepts and knowledge already learned, and easily pick up another programming language. But is that always the case? To understand if programmers have difficulty learning additional programming languages, we conductedan empirical study of Stack Overflow questions across 18 different programming languages. We hypothesized that previous knowledge could potentially interfere with learning a new programming language. From our inspection of 450 Stack Overflow questions, we found 276 instances of interference that occurred due to faulty assumptions originating from knowledge about a different language. To understand why these difficulties occurred, we conducted semi-structured interviews with 16 professional programmers. The interviews revealed that programmers make failed attempts to relate a new programming language with what they already know. Our findings inform design implications for technical authors, toolsmiths, and language designers, such as designing documentation and automated tools that reduce interference, anticipating uncommon language transitions during language design, and welcoming programmers not just into a language, but its entire ecosystem.

DOI: 10.1145/3377811.3380352

Importance-driven deep learning system testing

作者: Gerasimou, Simos and Eniser, Hasan Ferit and Sen, Alper and Cakan, Alper
关键词: deep learning systems, safety-critical systems, test adequacy

Abstract

Deep Learning (DL) systems are key enablers for engineering intelligent applications due to their ability to solve complex tasks such as image recognition and machine translation. Nevertheless, using DL systems in safety- and security-critical applications requires to provide testing evidence for their dependable operation. Recent research in this direction focuses on adapting testing criteria from traditional software engineering as a means of increasing confidence for their correct behaviour. However, they are inadequate in capturing the intrinsic properties exhibited by these systems. We bridge this gap by introducing DeepImportance, a systematic testing methodology accompanied by an Importance-Driven (IDC) test adequacy criterion for DL systems. Applying IDC enables to establish a layer-wise functional understanding of the importance of DL system components and use this information to assess the semantic diversity of a test set. Our empirical evaluation on several DL systems, across multiple DL datasets and with state-of-the-art adversarial generation techniques demonstrates the usefulness and effectiveness of DeepImportance and its ability to support the engineering of more robust DL systems.

DOI: 10.1145/3377811.3380391

ReluDiff： differential verification of deep neural networks

作者: Paulsen, Brandon and Wang, Jingbo and Wang, Chao
关键词: No keywords

Abstract

As deep neural networks are increasingly being deployed in practice, their efficiency has become an important issue. While there are compression techniques for reducing the network’s size, energy consumption and computational requirement, they only demonstrate empirically that there is no loss of accuracy, but lack formal guarantees of the compressed network, e.g., in the presence of adversarial examples. Existing verification techniques such as Reluplex, ReluVal, and DeepPoly provide formal guarantees, but they are designed for analyzing a single network instead of the relationship between two networks. To fill the gap, we develop a new method for differential verification of two closely related networks. Our method consists of a fast but approximate forward interval analysis pass followed by a backward pass that iteratively refines the approximation until the desired property is verified. We have two main innovations. During the forward pass, we exploit structural and behavioral similarities of the two networks to more accurately bound the difference between the output neurons of the two networks. Then in the backward pass, we leverage the gradient differences to more accurately compute the most beneficial refinement. Our experiments show that, compared to state-of-the-art verification tools, our method can achieve orders-of-magnitude speedup and prove many more properties than existing tools.

DOI: 10.1145/3377811.3380337

Dissector： input validation for deep learning applications by crossing-layer dissection

作者: Wang, Huiyan and Xu, Jingwei and Xu, Chang and Ma, Xiaoxing and Lu, Jian
关键词: deep learning, fault tolerance, input validation

Abstract

Deep learning (DL) applications are becoming increasingly popular. Their reliabilities largely depend on the performance of DL models integrated in these applications as a central classifying module. Traditional techniques need to retrain the models or rebuild and redeploy the applications for coping with unexpected conditions beyond the models’ handling capabilities. In this paper, we take a fault tolerance approach, Dissector, to distinguishing those inputs that represent unexpected conditions (beyond-inputs) from normal inputs that are still within the models’ handling capabilities (within-inputs), thus keeping the applications still function with expected reliabilities. The key insight of Dissector is that a DL model should interpret a within-input with increasing confidence, while a beyond-input would probably cause confused guesses in the prediction process. Dissector works in an application-specific way, adaptive to DL models used in applications, and extremely efficiently, scalable to large-size datasets from complex scenarios. The experimental evaluation shows that Dissector outperformed state-of-the-art techniques in the effectiveness (AUC: avg. 0.8935 and up to 0.9894) and efficiency (runtime overhead: only 3.3–5.8 milliseconds). Besides, it also exhibited encouraging usefulness in defensing against adversarial inputs (AUC: avg. 0.9983) and improving a DL model’s actual accuracy in use (up to 16% for CIFAR-100 and 20% for ImageNet).

DOI: 10.1145/3377811.3380379

Towards characterizing adversarial defects of deep learning software from the lens of uncertainty

作者: Zhang, Xiyue and Xie, Xiaofei and Ma, Lei and Du, Xiaoning and Hu, Qiang and Liu, Yang and Zhao, Jianjun and Sun, Meng
关键词: adversarial attack, deep learning, software testing, uncertainty

Abstract

Over the past decade, deep learning (DL) has been successfully applied to many industrial domain-specific tasks. However, the current state-of-the-art DL software still suffers from quality issues, which raises great concern especially in the context of safety- and security-critical scenarios. Adversarial examples (AEs) represent a typical and important type of defects needed to be urgently addressed, on which a DL software makes incorrect decisions. Such defects occur through either intentional attack or physical-world noise perceived by input sensors, potentially hindering further industry deployment. The intrinsic uncertainty nature of deep learning decisions can be a fundamental reason for its incorrect behavior. Although some testing, adversarial attack and defense techniques have been recently proposed, it still lacks a systematic study to uncover the relationship between AEs and DL uncertainty.In this paper, we conduct a large-scale study towards bridging this gap. We first investigate the capability of multiple uncertainty metrics in differentiating benign examples (BEs) and AEs, which enables to characterize the uncertainty patterns of input data. Then, we identify and categorize the uncertainty patterns of BEs and AEs, and find that while BEs and AEs generated by existing methods do follow common uncertainty patterns, some other uncertainty patterns are largely missed. Based on this, we propose an automated testing technique to generate multiple types of uncommon AEs and BEs that are largely missed by existing techniques. Our further evaluation reveals that the uncommon data generated by our method is hard to be defended by the existing defense techniques with the average defense success rate reduced by 35%. Our results call for attention and necessity to generate more diverse data for evaluating quality assurance solutions of DL software.

DOI: 10.1145/3377811.3380368

Gang of eight： a defect taxonomy for infrastructure as code scripts

作者: Rahman, Akond and Farhana, Effat and Parnin, Chris and Williams, Laurie
关键词: bug, category, configuration as code, configuration scripts, defect, devops, infrastructure as code, puppet, software quality, taxonomy

Abstract

Defects in infrastructure as code (IaC) scripts can have serious consequences, for example, creating large-scale system outages. A taxonomy of IaC defects can be useful for understanding the nature of defects, and identifying activities needed to fix and prevent defects in IaC scripts. The goal of this paper is to help practitioners improve the quality of infrastructure as code (IaC) scripts by developing a defect taxonomy for IaC scripts through qualitative analysis. We develop a taxonomy of IaC defects by applying qualitative analysis on 1,448 defect-related commits collected from open source software (OSS) repositories of the Openstack organization. We conduct a survey with 66 practitioners to assess if they agree with the identified defect categories included in our taxonomy. We quantify the frequency of identified defect categories by analyzing 80,425 commits collected from 291 OSS repositories spanning across 2005 to 2019.Our defect taxonomy for IaC consists of eight categories, including a category specific to IaC called idempotency (i.e., defects that lead to incorrect system provisioning when the same IaC script is executed multiple times). We observe the surveyed 66 practitioners to agree most with idempotency. The most frequent defect category is configuration data i.e., providing erroneous configuration data in IaC scripts. Our taxonomy and the quantified frequency of the defect categories may help in advancing the science of IaC script quality.

DOI: 10.1145/3377811.3380409

MemLock： memory usage guided fuzzing

作者: Wen, Cheng and Wang, Haijun and Li, Yuekang and Qin, Shengchao and Liu, Yang and Xu, Zhiwu and Chen, Hongxu and Xie, Xiaofei and Pu, Geguang and Liu, Ting
关键词: fuzz testing, memory consumption, software vulnerability

Abstract

Uncontrolled memory consumption is a kind of critical software security weaknesses. It can also become a security-critical vulnerability when attackers can take control of the input to consume a large amount of memory and launch a Denial-of-Service attack. However, detecting such vulnerability is challenging, as the state-of-the-art fuzzing techniques focus on the code coverage but not memory consumption. To this end, we propose a memory usage guided fuzzing technique, named MemLock, to generate the excessive memory consumption inputs and trigger uncontrolled memory consumption bugs. The fuzzing process is guided with memory consumption information so that our approach is general and does not require any domain knowledge. We perform a thorough evaluation for MemLock on 14 widely-used real-world programs. Our experiment results show that MemLock substantially outperforms the state-of-the-art fuzzing techniques, including AFL, AFLfast, PerfFuzz, FairFuzz, Angora and QSYM, in discovering memory consumption bugs. During the experiments, we discovered many previously unknown memory consumption bugs and received 15 new CVEs.

DOI: 10.1145/3377811.3380396

sFuzz： an efficient adaptive fuzzer for solidity smart contracts

作者: Nguyen, Tai D. and Pham, Long H. and Sun, Jun and Lin, Yun and Minh, Quang Tran
关键词: No keywords

Abstract

Smart contracts are Turing-complete programs that execute on the infrastructure of the blockchain, which often manage valuable digital assets. Solidity is one of the most popular programming languages for writing smart contracts on the Ethereum platform. Like traditional programs, smart contracts may contain vulnerabilities. Unlike traditional programs, smart contracts cannot be easily patched once they are deployed. It is thus important that smart contracts are tested thoroughly before deployment. In this work, we present an adaptive fuzzer for smart contracts on the Ethereum platform called sFuzz. Compared to existing Solidity fuzzers, sFuzz combines the strategy in the AFL fuzzer and an efficient lightweight multi-objective adaptive strategy targeting those hard-to-cover branches. sFuzz has been applied to more than 4 thousand smart contracts and the experimental results show that (1) sFuzz is efficient, e.g., two orders of magnitude faster than state-of-the-art tools; (2) sFuzz is effective in achieving high code coverage and discovering vulnerabilities; and (3) the different fuzzing strategies in sFuzz complement each other.

DOI: 10.1145/3377811.3380334

Targeted greybox fuzzing with static lookahead analysis

作者: W"{u
关键词: No keywords

Abstract

Automatic test generation typically aims to generate inputs that explore new paths in the program under test in order to find bugs. Existing work has, therefore, focused on guiding the exploration toward program parts that are more likely to contain bugs by using an offline static analysis.In this paper, we introduce a novel technique for targeted greybox fuzzing using an online static analysis that guides the fuzzer toward a set of target locations, for instance, located in recently modified parts of the program. This is achieved by first semantically analyzing each program path that is explored by an input in the fuzzer’s test suite. The results of this analysis are then used to control the fuzzer’s specialized power schedule, which determines how often to fuzz inputs from the test suite. We implemented our technique by extending a state-of-the-art, industrial fuzzer for Ethereum smart contracts and evaluate its effectiveness on 27 real-world benchmarks. Using an online analysis is particularly suitable for the domain of smart contracts since it does not require any code instrumentation—adding instrumentation to contracts changes their semantics. Our experiments show that targeted fuzzing significantly outperforms standard greybox fuzzing for reaching 83% of the challenging target locations (up to 14x of median speed-up).

DOI: 10.1145/3377811.3380388

Planning for untangling： predicting the difficulty of merge conflicts

作者: Brindescu, Caius and Ahmed, Iftekhar and Leano, Rafael and Sarma, Anita
关键词: empirical analysis, merge conflict difficulty prediction, merge conflict resolution

Abstract

Merge conflicts are inevitable in collaborative software development and are disruptive. When they occur, developers have to stop their current work, understand the conflict and the surrounding code, and plan an appropriate resolution. However, not all conflicts are equally problematic—some can be easily fixed, while others might be complicated enough to need multiple people. Currently, there is not much support to help developers plan their conflict resolution. In this work, we aim to predict the difficulty of a merge conflict so as to help developers plan their conflict resolution. The ability to predict the difficulty of a merge conflict and to identify the underlying factors for its difficulty can help tool builders improve their conflict detection tools to prioritize and warn developers of difficult conflicts. In this work, we investigate the characteristics of difficult merge conflicts, and automatically classify them. We analyzed 6,380 conflicts across 128 java projects and found that merge conflict difficulty can be accurately predicted (AUC of 0.76) through machine learning algorithms, such as bagging.

DOI: 10.1145/3377811.3380344

Conquering the extensional scalability problem for value-flow analysis frameworks

作者: Shi, Qingkai and Wu, Rongxin and Fan, Gang and Zhang, Charles
关键词: compositional program analysis, demand-driven analysis, static bug finding, value-flow analysis

Abstract

Modern static analyzers often need to simultaneously check a few dozen or even hundreds of value-flow properties, causing serious scalability issues when high precision is required. A major factor to this deficiency, as we observe, is that the core static analysis engine is oblivious of the mutual synergy among the properties being checked, thus inevitably losing many optimization opportunities. Our work is to leverage the inter-property awareness and to capture redundancies and inconsistencies when many properties are considered at the same time. We have evaluated our approach by checking twenty value-flow properties in standard benchmark programs and ten real-world software systems. The results demonstrate that our approach is more than 8\texttimes{

DOI: 10.1145/3377811.3380346

Tailoring programs for static analysis via program transformation

作者: van Tonder, Rijnard and Le Goues, Claire
关键词: No keywords

Abstract

Static analysis is a proven technique for catching bugs during software development. However, analysis tooling must approximate, both theoretically and in the interest of practicality. False positives are a pervading manifestation of such approximations—tool configuration and customization is therefore crucial for usability and directing analysis behavior. To suppress false positives, developers readily disable bug checks or insert comments that suppress spurious bug reports. Existing work shows that these mechanisms fall short of developer needs and present a significant pain point for using or adopting analyses. We draw on the insight that an analysis user always has one notable ability to influence analysis behavior regardless of analyzer options and implementation: modifying their program. We present a new technique for automated, generic, and temporary code changes that tailor to suppress spurious analysis errors. We adopt a rule-based approach where simple, declarative templates describe general syntactic changes for code patterns that are known to be problematic for the analyzer. Our technique promotes program transformation as a general primitive for improving the fidelity of analysis reports (we treat any given analyzer as a black box). We evaluate using five different static analyzers supporting three different languages (C, Java, and PHP) on large, real world programs (up to 800KLOC). We show that our approach is effective in sidestepping long-standing and complex issues in analysis implementations.

DOI: 10.1145/3377811.3380343

Pipelining bottom-up data flow analysis

作者: Shi, Qingkai and Zhang, Charles
关键词: IFDS/IDE, bottom-up analysis, compositional program analysis, data flow analysis, modular program analysis

Abstract

Bottom-up program analysis has been traditionally easy to parallelize because functions without caller-callee relations can be analyzed independently. However, such function-level parallelism is significantly limited by the calling dependence - functions with caller-callee relations have to be analyzed sequentially because the analysis of a function depends on the analysis results, a.k.a., function summaries, of its callees. We observe that the calling dependence can be relaxed in many cases and, as a result, the parallelism can be improved. In this paper, we present Coyote, a framework of bottom-up data flow analysis, in which the analysis task of each function is elaborately partitioned into multiple sub-tasks to generate pipelineable function summaries. These sub-tasks are pipelined and run in parallel, even though the calling dependence exists. We formalize our idea under the IFDS/IDE framework and have implemented an application to checking null-dereference bugs and taint issues in C/C++ programs. We evaluate Coyote on a series of standard benchmark programs and open-source software systems, which demonstrates significant speedup over a conventional parallel design.

DOI: 10.1145/3377811.3380425

A novel approach to tracing safety requirements and state-based design models

作者: Alenazi, Mounifah and Niu, Nan and Savolainen, Juha
关键词: mutation analysis, process mining, requirements engineering, systems modeling language (SysML), traceability

Abstract

Traceability plays an essential role in assuring that software and systems are safe to use. Automated requirements traceability faces the low precision challenge due to a large number of false positives being returned and mingled with the true links. To overcome this challenge, we present a mutation-driven method built on the novel idea of proactively creating many seemingly correct tracing targets (i.e., mutants of a state machine diagram), and then exploiting model checking within process mining to automatically verify whether the safety requirement’s properties hold in the mutants. A mutant is killed if its model checking fails; otherwise, it is survived. We leverage the underlying killed-survived distinction, and develop a correlation analysis procedure to identify the traceability links. Experimental evaluation results on two automotive systems with 27 safety requirements show considerable precision improvements compared with the state-of-the-art.

DOI: 10.1145/3377811.3380332

Establishing multilevel test-to-code traceability links

作者: White, Robert and Krinke, Jens and Tan, Raymond
关键词: No keywords

Abstract

Test-to-code traceability links model the relationships between test artefacts and code artefacts. When utilised during the development process, these links help developers to keep test code in sync with tested code, reducing the rate of test failures and missed faults. Test-to-code traceability links can also help developers to maintain an accurate mental model of the system, reducing the risk of architectural degradation when making changes. However, establishing and maintaining these links manually places an extra burden on developers and is error-prone. This paper presents TCtracer, an approach and implementation for the automatic establishment of test-to-code traceability links. Unlike existing work, TCtracer operates at both the method level and the class level, allowing us to establish links between tests and functions, as well as between test classes and tested classes. We improve over existing techniques by combining an ensemble of new and existing techniques and exploiting a synergistic flow of information between the method and class levels. An evaluation of TCtracer using four large, well-studied open source systems demonstrates that, on average, we can establish test-to-function links with a mean average precision (MAP) of 78% and test-class-to-class links with an MAP of 93%.

DOI: 10.1145/3377811.3380921

Improving the effectiveness of traceability link recovery using hierarchical bayesian networks

作者: Moran, Kevin and Palacio, David N. and Bernal-C'{a
关键词: information retrieval, probabilistic modeling, software traceability

Abstract

Traceability is a fundamental component of the modern software development process that helps to ensure properly functioning, secure programs. Due to the high cost of manually establishing trace links, researchers have developed automated approaches that draw relationships between pairs of textual software artifacts using similarity measures. However, the effectiveness of such techniques are often limited as they only utilize a single measure of artifact similarity and cannot simultaneously model (implicit and explicit) relationships across groups of diverse development artifacts.In this paper, we illustrate how these limitations can be overcome through the use of a tailored probabilistic model. To this end, we design and implement a HierarchiCal PrObabilistic Model for SoftwarE Traceability (Comet) that is able to infer candidate trace links. Comet is capable of modeling relationships between artifacts by combining the complementary observational prowess of multiple measures of textual similarity. Additionally, our model can holistically incorporate information from a diverse set of sources, including developer feedback and transitive (often implicit) relationships among groups of software artifacts, to improve inference accuracy. We conduct a comprehensive empirical evaluation of Comet that illustrates an improvement over a set of optimally configured baselines of ≈14% in the best case and ≈5% across all subjects in terms of average precision. The comparative effectiveness of Comet in practice, where optimal configuration is typically not possible, is likely to be higher. Finally, we illustrate Comet’s potential for practical applicability in a survey with developers from Cisco Systems who used a prototype Comet Jenkins plugin.

DOI: 10.1145/3377811.3380418

How Android developers handle evolution-induced API compatibility issues： a large-scale study

作者: Xia, Hao and Zhang, Yuan and Zhou, Yingtian and Chen, Xiaoting and Wang, Yang and Zhang, Xiangyu and Cui, Shuaishuai and Hong, Geng and Zhang, Xiaohan and Yang, Min and Yang, Zhemin
关键词: API evolution, Android app analysis, compatibility issues

Abstract

As Android platform evolves in a fast pace, API-related compatibility issues become a significant challenge for developers. To handle an incompatible API invocation, developers mainly have two choices: merely performing sufficient checks to avoid invoking incompatible APIs on platforms that do not support them, or gracefully providing replacement implementations on those incompatible platforms. As providing more consistent app behaviors, the latter one is more recommended and more challenging to adopt. However, it is still unknown how these issues are handled in the real world, do developers meet difficulties and what can we do to help them.In light of this, this paper performs the first large-scale study on the current practice of handling evolution-induced API compatibility issues in about 300,000 Android market apps, and more importantly, their solutions (if exist). Actually, it is in general very challenging to determine if developers have put in counter-measure for a compatibility issue, as different APIs have diverse behaviors, rendering various repair. To facilitate a large-scale study, this paper proposes RAPID, an automated tool to determine whether a compatibility issue has been addressed or not, by incorporating both static analysis and machine learning techniques. Results show that our trained classifier is quite effective by achieving a F1-score of 95.21% and 91.96% in the training stage and the validation stage respectively. With the help of RAPID, our study yields many interesting findings, e.g. developers are not willing to provide alternative implementations when handling incompatible API invocations (only 38.4%); for those incompatible APIs that Google gives replacement recommendations, the ratio of providing alternative implementations is significantly higher than those without recommendations; developers find more ways to repair compatibility issues than Google’s recommendations and the knowledge acquired from these experienced developers would be extremely useful to novice developers and may significantly improve the current status of compatibility issue handling.

DOI: 10.1145/3377811.3380357

An empirical study on API parameter rules

作者: Zhong, Hao and Meng, Na and Li, Zexuan and Jia, Li
关键词: No keywords

Abstract

Developers build programs based on software libraries to reduce coding effort. If a program inappropriately sets an API parameter, the program may exhibit unexpected runtime behaviors. To help developers correctly use library APIs, researchers built tools to mine API parameter rules. However, it is still unknown (1) what types of parameter rules there are, and (2) how these rules distribute inside documents and source files. In this paper, we conducted an empirical study to investigate the above-mentioned questions. To analyze as many parameter rules as possible, we took a hybrid approach that combines automatic localization of constrained parameters with manual inspection. Our automatic approach—PaRu—locates parameters that have constraints either documented in Javadoc (i.e., document rules) or implied by source code (i.e., code rules). Our manual inspection (1) identifies and categorizes rules for the located parameters, and (2) establishes mapping between document and code rules. By applying PaRu to 9 widely used libraries, we located 5,334 parameters with either document or code rules. Interestingly, there are only 187 parameters that have both types of rules, and 79 pairs of these parameter rules are unmatched. Additionally, PaRu extracted 1,688 rule sentences from Javadoc and code. We manually classified these sentences into six categories, two of which are overlooked by prior approaches. We found that 86.2% of parameters have only code rules; 10.3% of parameters have only document rules; and only 3.5% of parameters have both document and code rules. Our research reveals the challenges for automating parameter rule extraction. Based on our findings, we discuss the potentials of prior approaches and present our insights for future tool design.

DOI: 10.1145/3377811.3380922

When APIs are intentionally bypassed： an exploratory study of API workarounds

作者: Lamothe, Maxime and Shang, Weiyi
关键词: No keywords

Abstract

Application programming interfaces (APIs) have become ubiquitous in software development. However, external APIs are not guaranteed to contain every desirable feature, nor are they immune to software defects. Therefore, API users will sometimes be faced with situations where a current API does not satisfy all of their requirements, but migrating to another API is costly. In these cases, due to the lack of communication channels between API developers and users, API users may intentionally bypass an existing API after inquiring into workarounds for their API problems with online communities. This mechanism takes the API developer out of the conversation, potentially leaving API defects unreported and desirable API features undiscovered. In this paper we explore API workaround inquiries from API users on Stack Overflow. We uncover general reasons why API users inquire about API workarounds, and general solutions to API workaround requests. Furthermore, using workaround implementations in Stack Overflow answers, we develop three API workaround implementation patterns. We identify instances of these patterns in real-life open source projects and determine their value for API developers from their responses to feature requests based on the identified API workarounds.

DOI: 10.1145/3377811.3380433

Demystify official API usage directives with crowdsourced API misuse scenarios, erroneous code examples and patches

作者: Ren, Xiaoxue and Sun, Jiamou and Xing, Zhenchang and Xia, Xin and Sun, Jianling
关键词: API misuse, API usage directive, open information extraction, stack overflow

Abstract

API usage directives in official API documentation describe the contracts, constraints and guidelines for using APIs in natural language. Through the investigation of API misuse scenarios on Stack Overflow, we identify three barriers that hinder the understanding of the API usage directives, i.e., lack of specific usage context, indirect relationships to cooperative APIs, and confusing APIs with subtle differences. To overcome these barriers, we develop a text mining approach to discover the crowdsourced API misuse scenarios on Stack Overflow and extract from these scenarios erroneous code examples and patches, as well as related API and confusing APIs to construct demystification reports to help developers understand the official API usage directives described in natural language. We apply our approach to API usage directives in official Android API documentation and android-tagged discussion threads on Stack Overflow. We extract 159,116 API misuse scenarios for 23,969 API usage directives of 3138 classes and 7471 methods, from which we generate the demystification reports. Our manual examination confirms that the extracted information in the generated demystification reports are of high accuracy. By a user study of 14 developers on 8 API-misuse related error scenarios, we show that our demystification reports help developer understand and debug API-misuse related program errors faster and more accurately, compared with reading only plain API usage-directive sentences.

DOI: 10.1145/3377811.3380430

Simulee： detecting CUDA synchronization bugs via memory-access modeling

作者: Wu, Mingyuan and Ouyang, Yicheng and Zhou, Husheng and Zhang, Lingming and Liu, Cong and Zhang, Yuqun
关键词: No keywords

Abstract

While CUDA has become a mainstream parallel computing platform and programming model for general-purpose GPU computing, how to effectively and efficiently detect CUDA synchronization bugs remains a challenging open problem. In this paper, we propose the first lightweight CUDA synchronization bug detection framework, namely Simulee, to model CUDA program execution by interpreting the corresponding LLVM bytecode and collecting the memory-access information for automatically detecting general CUDA synchronization bugs. To evaluate the effectiveness and efficiency of Simulee, we construct a benchmark with 7 popular CUDA-related projects from GitHub, upon which we conduct an extensive set of experiments. The experimental results suggest that Simulee can detect 21 out of the 24 manually identified bugs in our preliminary study and also 24 previously unknown bugs among all projects, 10 of which have already been confirmed by the developers. Furthermore, Simulee significantly outperforms state-of-the-art approaches for CUDA synchronization bug detection.

DOI: 10.1145/3377811.3380358

White-box fairness testing through adversarial sampling

作者: Zhang, Peixin and Wang, Jingyi and Sun, Jun and Dong, Guoliang and Wang, Xinyu and Wang, Xingen and Dong, Jin Song and Dai, Ting
关键词: No keywords

Abstract

Although deep neural networks (DNNs) have demonstrated astonishing performance in many applications, there are still concerns on their dependability. One desirable property of DNN for applications with societal impact is fairness (i.e., non-discrimination). In this work, we propose a scalable approach for searching individual discriminatory instances of DNN. Compared with state-of-the-art methods, our approach only employs lightweight procedures like gradient computation and clustering, which makes it significantly more scalable than existing methods. Experimental results show that our approach explores the search space more effectively (9 times) and generates much more individual discriminatory instances (25 times) using much less time (half to 1/7).

DOI: 10.1145/3377811.3380331

Structure-invariant testing for machine translation

作者: He, Pinjia and Meister, Clara and Su, Zhendong
关键词: machine translation, metamorphic testing, structural invariance

Abstract

In recent years, machine translation software has increasingly been integrated into our daily lives. People routinely use machine translation for various applications, such as describing symptoms to a foreign doctor and reading political news in a foreign language. However, the complexity and intractability of neural machine translation (NMT) models that power modern machine translation make the robustness of these systems difficult to even assess, much less guarantee. Machine translation systems can return inferior results that lead to misunderstanding, medical misdiagnoses, threats to personal safety, or political conflicts. Despite its apparent importance, validating the robustness of machine translation systems is very difficult and has, therefore, been much under-explored.To tackle this challenge, we introduce structure-invariant testing (SIT), a novel metamorphic testing approach for validating machine translation software. Our key insight is that the translation results of “similar” source sentences should typically exhibit similar sentence structures. Specifically, SIT (1) generates similar source sentences by substituting one word in a given sentence with semantically similar, syntactically equivalent words; (2) represents sentence structure by syntax parse trees (obtained via constituency or dependency parsing); (3) reports sentence pairs whose structures differ quantitatively by more than some threshold. To evaluate SIT, we use it to test Google Translate and Bing Microsoft Translator with 200 source sentences as input, which led to 64 and 70 buggy issues with 69.5% and 70% top-1 accuracy, respectively. The translation errors are diverse, including under-translation, over-translation, incorrect modification, word/phrase mistranslation, and unclear logic.

DOI: 10.1145/3377811.3380339

Automatic testing and improvement of machine translation

作者: Sun, Zeyu and Zhang, Jie M. and Harman, Mark and Papadakis, Mike and Zhang, Lu
关键词: machine translation, testing and repair, translation consistency

Abstract

This paper presents TransRepair, a fully automatic approach for testing and repairing the consistency of machine translation systems. TransRepair combines mutation with metamorphic testing to detect inconsistency bugs (without access to human oracles). It then adopts probability-reference or cross-reference to post-process the translations, in a grey-box or black-box manner, to repair the inconsistencies. Our evaluation on two state-of-the-art translators, Google Translate and Transformer, indicates that TransRepair has a high precision (99%) on generating input pairs with consistent translations. With these tests, using automatic consistency metrics and manual assessment, we find that Google Translate and Transformer have approximately 36% and 40% inconsistency bugs. Black-box repair fixes 28% and 19% bugs on average for Google Translate and Transformer. Grey-box repair fixes 30% bugs on average for Transformer. Manual inspection indicates that the translations repaired by our approach improve consistency in 87% of cases (degrading it in 2%), and that our repairs have better translation acceptability in 27% of the cases (worse in 8%).

DOI: 10.1145/3377811.3380420

TRADER： trace divergence analysis and embedding regulation for debugging recurrent neural networks

作者: Tao, Guanhong and Ma, Shiqing and Liu, Yingqi and Xu, Qiuling and Zhang, Xiangyu
关键词: No keywords

Abstract

Recurrent Neural Networks (RNN) can deal with (textual) input with various length and hence have a lot of applications in software systems and software engineering applications. RNNs depend on word embeddings that are usually pre-trained by third parties to encode textual inputs to numerical values. It is well known that problematic word embeddings can lead to low model accuracy. In this paper, we propose a new technique to automatically diagnose how problematic embeddings impact model performance, by comparing model execution traces from correctly and incorrectly executed samples. We then leverage the diagnosis results as guidance to harden/repair the embeddings. Our experiments show that TRADER can consistently and effectively improve accuracy for real world models and datasets by 5.37% on average, which represents substantial improvement in the literature of RNN models.

DOI: 10.1145/3377811.3380423

Typestate-guided fuzzer for discovering use-after-free vulnerabilities

作者: Wang, Haijun and Xie, Xiaofei and Li, Yi and Wen, Cheng and Li, Yuekang and Liu, Yang and Qin, Shengchao and Chen, Hongxu and Sui, Yulei
关键词: fuzzing, typestate-guided fuzzing, use-after-free vulnerabilities

Abstract

Existing coverage-based fuzzers usually use the individual control flow graph (CFG) edge coverage to guide the fuzzing process, which has shown great potential in finding vulnerabilities. However, CFG edge coverage is not effective in discovering vulnerabilities such as use-after-free (UaF). This is because, to trigger UaF vulnerabilities, one needs not only to cover individual edges, but also to traverse some (long) sequence of edges in a particular order, which is challenging for existing fuzzers. To this end, we propose to model UaF vulnerabilities as typestate properties, and develop a typestate-guided fuzzer, named UAFL, for discovering vulnerabilities violating typestate properties. Given a typestate property, we first perform a static typestate analysis to find operation sequences potentially violating the property. Our fuzzing process is then guided by the operation sequences in order to progressively generate test cases triggering property violations. In addition, we also employ an information flow analysis to improve the efficiency of the fuzzing process. We have performed a thorough evaluation of UAFL on 14 widely-used real-world programs. The experiment results show that UAFL substantially outperforms the state-of-the-art fuzzers, including AFL, AFLFast, FairFuzz, MOpt, Angora and QSYM, in terms of the time taken to discover vulnerabilities. We have discovered 10 previously unknown vulnerabilities, and received 5 new CVEs.

DOI: 10.1145/3377811.3380386

JVM fuzzing for JIT-induced side-channel detection

作者: Brennan, Tegan and Saha, Seemanta and Bultan, Tevfik
关键词: No keywords

Abstract

Timing side channels arise in software when a program’s execution time can be correlated with security-sensitive program input. Recent results on software side-channel detection focus on analysis of program’s source code. However, runtime behavior, in particular optimizations introduced during just-in-time (JIT) compilation, can impact or even introduce timing side channels in programs. In this paper, we present a technique for automatically detecting such JIT-induced timing side channels in Java programs. We first introduce patterns to detect partitions of secret input potentially separable by side channels. Then we present an automated approach for exploring behaviors of the Java Virtual Machine (JVM) to identify states where timing channels separating these partitions arise. We evaluate our technique on three datasets used in recent work on side-channel detection. We find that many code variants labeled “safe” with respect to side-channel vulnerabilities are in fact vulnerable to JIT-induced timing side channels. Our results directly contradict the conclusions of four separate state-of-the-art program analysis tools for side-channel detection and demonstrate that JIT-induced side channels are prevalent and can be detected automatically.

DOI: 10.1145/3377811.3380432

Ankou： guiding grey-box fuzzing towards combinatorial difference

作者: Man`{e
关键词: fuzz testing, grey-box fuzzing, guided fuzzing, principal component analysis, software testing

Abstract

Grey-box fuzzing is an evolutionary process, which maintains and evolves a population of test cases with the help of a fitness function. Fitness functions used by current grey-box fuzzers are not informative in that they cannot distinguish different program executions as long as those executions achieve the same coverage. The problem is that current fitness functions only consider a union of data, but not their combination. As such, fuzzers often get stuck in a local optimum during their search. In this paper, we introduce Ankou, the first grey-box fuzzer that recognizes different combinations of execution information, and present several scalability challenges encountered while designing and implementing Ankou. Our experimental results show that Ankou is 1.94\texttimes{

DOI: 10.1145/3377811.3380421

BCFA： bespoke control flow analysis for CFA at scale

作者: Ramu, Ramanathan and Upadhyaya, Ganesha B and Nguyen, Hoan Anh and Rajan, Hridesh
关键词: No keywords

Abstract

Many data-driven software engineering tasks such as discovering programming patterns, mining API specifications, etc., perform source code analysis over control flow graphs (CFGs) at scale. Analyzing millions of CFGs can be expensive and performance of the analysis heavily depends on the underlying CFG traversal strategy. State-of-the-art analysis frameworks use a fixed traversal strategy. We argue that a single traversal strategy does not fit all kinds of analyses and CFGs and propose bespoke control flow analysis (BCFA). Given a control flow analysis (CFA) and a large number of CFGs, BCFA selects the most efficient traversal strategy for each CFG. BCFA extracts a set of properties of the CFA by analyzing the code of the CFA and combines it with properties of the CFG, such as branching factor and cyclicity, for selecting the optimal traversal strategy. We have implemented BCFA in Boa, and evaluated BCFA using a set of representative static analyses that mainly involve traversing CFGs and two large datasets containing 287 thousand and 162 million CFGs. Our results show that BCFA can speedup the large scale analyses by 1%-28%. Further, BCFA has low overheads; less than 0.2%, and low misprediction rate; less than 0.01%.

DOI: 10.1145/3377811.3380435

On the recall of static call graph construction in practice

作者: Sui, Li and Dietrich, Jens and Tahir, Amjed and Fourtounis, George
关键词: Java, call graph construction, soundness, static program analysis, test case generation, testing

Abstract

Static analyses have problems modelling dynamic language features soundly while retaining acceptable precision. The problem is well-understood in theory, but there is little evidence on how this impacts the analysis of real-world programs. We have studied this issue for call graph construction on a set of 31 real-world Java programs using an oracle of actual program behaviour recorded from executions of built-in and synthesised test cases with high coverage, have measured the recall that is being achieved by various static analysis algorithms and configurations, and investigated which language features lead to static analysis false negatives.We report that (1) the median recall is 0.884 suggesting that standard static analyses have significant gaps with respect to the proportion of the program modelled (2) built-in tests are significantly better to expose dynamic program behaviour than synthesised tests (3) adding precision to the static analysis has little impact on recall indicating that those are separate concerns (4) state-of-the-art support for dynamic language features can significantly improve recall (the median observed is 0.935), but it comes with a hefty performance penalty, and (5) the main sources of unsoundness are not reflective method invocations, but objects allocated or accessed via native methods, and invocations initiated by the JVM, without matching call sites in the program under analysis.These results provide some novel insights into the interaction between static and dynamic program analyses that can be used to assess the utility of static analysis results and to guide the development of future static and hybrid analyses.

DOI: 10.1145/3377811.3380441

Heaps’n leaks： how heap snapshots improve Android taint analysis

作者: Benz, Manuel and Kristensen, Erik Krogh and Luo, Linghui and Borges, Nataniel P. and Bodden, Eric and Zeller, Andreas
关键词: heap snapshot, points-to analysis, soot, taint analysis

Abstract

The assessment of information flows is an essential part of analyzing Android apps, and is frequently supported by static taint analysis. Its precision, however, can suffer from the analysis not being able to precisely determine what elements a pointer can (and cannot) point to. Recent advances in static analysis suggest that incorporating dynamic heap snapshots, taken at one point at runtime, can significantly improve general static analysis. In this paper, we investigate to what extent this also holds for taint analysis, and how various design decisions, such as when and how many snapshots are collected during execution, and how exactly they are used, impact soundness and precision. We have extended FlowDroid to incorporate heap snapshots, yielding our prototype Heapster, and evaluated it on DroidMacroBench, a novel benchmark comprising real-world Android apps that we also make available as an artifact. The results show (1) the use of heap snapshots lowers analysis time and memory consumption while increasing precision; (2) a very good trade-off between precision and recall is achieved by a mixed mode in which the analysis falls back to static points-to relations for objects for which no dynamic data was recorded; and (3) while a single heap snapshot (ideally taken at the end of the execution) suffices to improve performance and precision, a better trade-off can be obtained by using multiple snapshots.

DOI: 10.1145/3377811.3380438

Big code != big vocabulary： open-vocabulary models for source code

作者: Karampatsis, Rafael-Michael and Babii, Hlib and Robbes, Romain and Sutton, Charles and Janes, Andrea
关键词: byte-pair encoding, naturalness of code, neural language models

Abstract

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported.All datasets, code, and trained models used in this work are publicly available.

DOI: 10.1145/3377811.3380342

Improving data scientist efficiency with provenance

作者: Hu, Jingmei and Joung, Jiwon and Jacobs, Maia and Gajos, Krzysztof Z. and Seltzer, Margo I.
关键词: data analysis, dependency tracking, incremental execution, provenance

Abstract

Data scientists frequently analyze data by writing scripts. We conducted a contextual inquiry with interdisciplinary researchers, which revealed that parameter tuning is a highly iterative process and that debugging is time-consuming. As analysis scripts evolve and become more complex, analysts have difficulty conceptualizing their workflow. In particular, after editing a script, it becomes difficult to determine precisely which code blocks depend on the edit. Consequently, scientists frequently re-run entire scripts instead of re-running only the necessary parts. We present ProvBuild, a tool that leverages language-level provenance to streamline the debugging process by reducing programmer cognitive load and decreasing subsequent runtimes, leading to an overall reduction in elapsed debugging time. ProvBuild uses provenance to track dependencies in a script. When an analyst debugs a script, ProvBuild generates a simplifed script that contains only the information necessary to debug a particular problem. We demonstrate that debugging the simplified script lowers a programmer’s cognitive load and permits faster re-execution when testing changes. The combination of reduced cognitive load and shorter runtime reduces the time necessary to debug a script. We quantitatively and qualitatively show that even though ProvBuild introduces overhead during a script’s first execution, it is a more efficient way for users to debug and tune complex workflows. ProvBuild demonstrates a novel use of language-level provenance, in which it is used to proactively improve programmer productively rather than merely providing a way to retroactively gain insight into a body of code.

DOI: 10.1145/3377811.3380366

Managing data constraints in database-backed web applications

作者: Yang, Junwen and Sethi, Utsav and Yan, Cong and Cheung, Alvin and Lu, Shan
关键词: No keywords

Abstract

Database-backed web applications manipulate large amounts of persistent data, and such applications often contain constraints that restrict data length, data value, and other data properties. Such constraints are critical in ensuring the reliability and usability of these applications. In this paper, we present a comprehensive study on where data constraints are expressed, what they are about, how often they evolve, and how their violations are handled. The results show that developers struggle with maintaining consistent data constraints and checking them across different components and versions of their web applications, leading to various problems. Guided by our study, we developed checking tools and API enhancements that can automatically detect such problems and improve the quality of such applications.

DOI: 10.1145/3377811.3380375

Taxonomy of real faults in deep learning systems

作者: Humbatova, Nargiz and Jahangirova, Gunel and Bavota, Gabriele and Riccio, Vincenzo and Stocco, Andrea and Tonella, Paolo
关键词: deep learning, real faults, software testing, taxonomy

Abstract

The growing application of deep neural networks in safety-critical domains makes the analysis of faults that occur in such systems of enormous importance. In this paper we introduce a large taxonomy of faults in deep learning (DL) systems. We have manually analysed 1059 artefacts gathered from GitHub commits and issues of projects that use the most popular DL frameworks (TensorFlow, Keras and PyTorch) and from related Stack Overflow posts. Structured interviews with 20 researchers and practitioners describing the problems they have encountered in their experience have enriched our taxonomy with a variety of additional faults that did not emerge from the other two sources. Our final taxonomy was validated with a survey involving an additional set of 21 developers, confirming that almost all fault categories (13/15) were experienced by at least 50% of the survey participants.

DOI: 10.1145/3377811.3380395

Testing DNN image classifiers for confusion & bias errors

作者: Tian, Yuchi and Zhong, Ziyuan and Ordonez, Vicente and Kaiser, Gail and Ray, Baishakhi
关键词: DNNs, bias, deep learning, image classifiers, whitebox testing

Abstract

Image classifiers are an important component of today’s software, from consumer and business applications to safety-critical domains. The advent of Deep Neural Networks (DNNs) is the key catalyst behind such wide-spread success. However, wide adoption comes with serious concerns about the robustness of software systems dependent on DNNs for image classification, as several severe erroneous behaviors have been reported under sensitive and critical circumstances. We argue that developers need to rigorously test their software’s image classifiers and delay deployment until acceptable. We present an approach to testing image classifier robustness based on class property violations.We found that many of the reported erroneous cases in popular DNN image classifiers occur because the trained models confuse one class with another or show biases towards some classes over others. These bugs usually violate some class properties of one or more of those classes. Most DNN testing techniques focus on perimage violations, so fail to detect class-level confusions or biases.We developed a testing technique to automatically detect class-based confusion and bias errors in DNN-driven image classification software. We evaluated our implementation, DeepInspect, on several popular image classifiers with precision up to 100% (avg. 72.6%) for confusion errors, and up to 84.3% (avg. 66.8%) for bias errors. DeepInspect found hundreds of classification mistakes in widely-used models, many exposing errors indicating confusion or bias.

DOI: 10.1145/3377811.3380400

Repairing deep neural networks： fix patterns and challenges

作者: Islam, Md Johirul and Pan, Rangeet and Nguyen, Giang and Rajan, Hridesh
关键词: bug fix, bug fix patterns, bugs, deep neural networks

Abstract

Significant interest in applying Deep Neural Network (DNN) has fueled the need to support engineering of software that uses DNNs. Repairing software that uses DNNs is one such unmistakable SE need where automated tools could be beneficial; however, we do not fully understand challenges to repairing and patterns that are utilized when manually repairing DNNs. What challenges should automated repair tools address? What are the repair patterns whose automation could help developers? Which repair patterns should be assigned a higher priority for building automated bug repair tools? This work presents a comprehensive study of bug fix patterns to address these questions. We have studied 415 repairs from Stack Overflow and 555 repairs from GitHub for five popular deep learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to understand challenges in repairs and bug repair patterns. Our key findings reveal that DNN bug fix patterns are distinctive compared to traditional bug fix patterns; the most common bug fix patterns are fixing data dimension and neural network connectivity; DNN bug fixes have the potential to introduce adversarial vulnerabilities; DNN bug fixes frequently introduce new bugs; and DNN bug localization, reuse of trained model, and coping with frequent releases are major challenges faced by developers when fixing bugs. We also contribute a benchmark of 667 DNN (bug, repair) instances.

DOI: 10.1145/3377811.3380378

Fuzz testing based data augmentation to improve robustness of deep neural networks

作者: Gao, Xiang and Saha, Ripon K. and Prasad, Mukul R. and Roychoudhury, Abhik
关键词: DNN, data augmentation, genetic algorithm, robustness

Abstract

Deep neural networks (DNN) have been shown to be notoriously brittle to small perturbations in their input data. This problem is analogous to the over-fitting problem in test-based program synthesis and automatic program repair, which is a consequence of the incomplete specification, i.e., the limited tests or training examples, that the program synthesis or repair algorithm has to learn from. Recently, test generation techniques have been successfully employed to augment existing specifications of intended program behavior, to improve the generalizability of program synthesis and repair. Inspired by these approaches, in this paper, we propose a technique that re-purposes software testing methods, specifically mutation-based fuzzing, to augment the training data of DNNs, with the objective of enhancing their robustness. Our technique casts the DNN data augmentation problem as an optimization problem. It uses genetic search to generate the most suitable variant of an input data to use for training the DNN, while simultaneously identifying opportunities to accelerate training by skipping augmentation in many instances. We instantiate this technique in two tools, Sensei and Sensei-SA, and evaluate them on 15 DNN models spanning 5 popular image data-sets. Our evaluation shows that Sensei can improve the robust accuracy of the DNN, compared to the state of the art, on each of the 15 models, by upto 11.9% and 5.5% on average. Further, Sensei-SA can reduce the average DNN training time by 25%, while still improving robust accuracy.

DOI: 10.1145/3377811.3380415

An empirical study on program failures of deep learning jobs

作者: Zhang, Ru and Xiao, Wencong and Zhang, Hongyu and Liu, Yu and Lin, Haoxiang and Yang, Mao
关键词: deep learning jobs, empirical study, program failures

Abstract

Deep learning has made significant achievements in many application areas. To train and test models more efficiently, enterprise developers submit and run their deep learning programs on a shared, multi-tenant platform. However, some of the programs fail after a long execution time due to code/script defects, which reduces the development productivity and wastes expensive resources such as GPU, storage, and network I/O.This paper presents the first comprehensive empirical study on program failures of deep learning jobs. 4960 real failures are collected from a deep learning platform in Microsoft. We manually examine their failure messages and classify them into 20 categories. In addition, we identify the common root causes and bug-fix solutions on a sample of 400 failures. To better understand the current testing and debugging practices for deep learning, we also conduct developer interviews. Our major findings include: (1) 48.0% of the failures occur in the interaction with the platform rather than in the execution of code logic, mostly due to the discrepancies between local and platform execution environments; (2) Deep learning specific failures (13.5%) are mainly caused by inappropriate model parameters/structures and framework API misunderstanding; (3) Current debugging practices are not efficient for fault localization in many cases, and developers need more deep learning specific tools. Based on our findings, we further suggest possible research topics and tooling support that could facilitate future deep learning development.

DOI: 10.1145/3377811.3380362

Primers or reminders? the effects of existing review comments on code review

作者: Spadini, Davide and \c{C
关键词: availability heuristic, code review, priming

Abstract

In contemporary code review, the comments put by reviewers on a specific code change are immediately visible to the other reviewers involved. Could this visibility prime new reviewers’ attention (due to the human’s proneness to availability bias), thus biasing the code review outcome? In this study, we investigate this topic by conducting a controlled experiment with 85 developers who perform a code review and a psychological experiment. With the psychological experiment, we find that ≈70% of participants are prone to availability bias. However, when it comes to the code review, our experiment results show that participants are primed only when the existing code review comment is about a type of bug that is not normally considered; when this comment is visible, participants are more likely to find another occurrence of this type of bug. Moreover, this priming effect does not influence reviewers’ likelihood of detecting other types of bugs. Our findings suggest that the current code review practice is effective because existing review comments about bugs in code changes are not negative primers, rather positive reminders for bugs that would otherwise be overlooked during code review. Data and materials: https://doi.org/10.5281/zenodo.3653856

DOI: 10.1145/3377811.3380385

Mitigating turnover with code review recommendation： balancing expertise, workload, and knowledge distribution

作者: Mirsaeedi, Ehsan and Rigby, Peter C.
关键词: code review, knowledge distribution, recommenders, tool support, turnover

Abstract

Developer turnover is inevitable on software projects and leads to knowledge loss, a reduction in productivity, and an increase in defects. Mitigation strategies to deal with turnover tend to disrupt and increase workloads for developers. In this work, we suggest that through code review recommendation we can distribute knowledge and mitigate turnover with minimal impacton the development process. We evaluate review recommenders in the context of ensuring expertise during review, Expertise, reducing the review workload of the core team, CoreWorkload, and reducing the Files at Risk to turnover, FaR. We find that prior work that assigns reviewers based on file ownership concentrates knowledge on a small group of core developers increasing risk of knowledge loss from turnover by up to 65%. We propose learning and retention aware review recommenders that when combined are effective at reducing the risk of turnover by -29% but they unacceptably reduce the overall expertise during reviews by -26%. We develop the Sofia recommender that suggests experts when none of the files under review are hoarded by developers, but distributes knowledge when files are at risk. In this way, we are able to simultaneously increase expertise during review with a ΔExpertise of 6%, with a negligible impact on workload of ΔCoreWorkload of 0.09%, and reduce the files at risk by ΔFaR -28%. Sofia is integrated into GitHub pull requests allowing developers to select an appropriate expert or “learner” based on the context of the review. We release the Sofia bot as well as the code and data for replication purposes.

DOI: 10.1145/3377811.3380335

How do companies collaborate in open source ecosystems? an empirical study of OpenStack

作者: Zhang, Yuxia and Zhou, Minghui and Stol, Klaas-Jan and Wu, Jianyu and Jin, Zhi
关键词: OSS ecosystem, company participation, open collaboration, open source software, openstack, software development

Abstract

Open Source Software (OSS) has come to play a critical role in the software industry. Some large ecosystems enjoy the participation of large numbers of companies, each of which has its own focus and goals. Indeed, companies that otherwise compete, may become collaborators within the OSS ecosystem they participate in. Prior research has largely focused on commercial involvement in OSS projects, but there is a scarcity of research focusing on company collaborations within OSS ecosystems. Some of these ecosystems have become critical building blocks for organizations worldwide; hence, a clear understanding of how companies collaborate within large ecosystems is essential. This paper presents the results of an empirical study of the OpenStack ecosystem, in which hundreds of companies collaborate on thousands of project repositories to deliver cloud distributions. Based on a detailed analysis, we identify clusters of collaborations, and identify four strategies that companies adopt to engage with the OpenStack ecosystem. We alsofind that companies may engage in intentional or passive collaborations, or may work in an isolated fashion. Further, wefi nd that a company’s position in the collaboration network is positively associated with its productivity in OpenStack. Our study sheds light on how large OSS ecosystems work, and in particular on the patterns of collaboration within one such large ecosystem.

DOI: 10.1145/3377811.3380376

How to not get rich： an empirical study of donations in open source

作者: Overney, Cassandra and Meinicke, Jens and K"{a
关键词: No keywords

Abstract

Open source is ubiquitous and many projects act as critical infrastructure, yet funding and sustaining the whole ecosystem is challenging. While there are many different funding models for open source and concerted efforts through foundations, donation platforms like PayPal, Patreon, and OpenCollective are popular and low-bar platforms to raise funds for open-source development. With a mixed-method study, we investigate the emerging and largely unexplored phenomenon of donations in open source. Specifically, we quantify how commonly open-source projects ask for donations, statistically model characteristics of projects that ask for and receive donations, analyze for what the requested funds are needed and used, and assess whether the received donations achieve the intended outcomes. We find 25,885 projects asking for donations on GitHub, often to support engineering activities; however, we also find no clear evidence that donations influence the activity level of a project. In fact, we find that donations are used in a multitude of ways, raising new research questions about effective funding.

DOI: 10.1145/3377811.3380410

Scaling open source communities： an empirical study of the Linux kernel

作者: Tan, Xin and Zhou, Minghui and Fitzgerald, Brian
关键词: Linux kernel, maintainer, multiple committers, open source communities, scalability, sustainability, workload

Abstract

Large-scale open source communities, such as the Linux kernel, have gone through decades of development, substantially growing in scale and complexity. In the traditional workflow, maintainers serve as “gatekeepers” for the subsystems that they maintain. As the number of patches and authors significantly increases, maintainers come under considerable pressure, which may hinder the operation and even the sustainability of the community. A few subsystems have begun to use new workflows to address these issues. However, it is unclear to what extent these new workflows are successful, or how to apply them. Therefore, we conduct an empirical study on the multiple-committer model (MCM) that has provoked extensive discussion in the Linux kernel community. We explore the effect of the model on the i915 subsystem with respect to four dimensions: pressure, latency, complexity, and quality assurance. We find that after this model was adopted, the burden of the i915 maintainers was significantly reduced. Also, the model scales well to allow more committers. After analyzing the online documents and interviewing the maintainers of i915, we propose that overloaded subsystems which have trustworthy candidate committers are suitable for adopting the model. We further suggest that the success of the model is closely related to a series of measures for risk mitigation—sufficient precommit testing, strict review process, and the use of tools to simplify work and reduce errors. We employ a network analysis approach to locate candidate committers for the target subsystems and validate this approach and contextual success factors through email interviews with their maintainers. To the best of our knowledge, this is the first study focusing on how to scale open source communities. We expect that our study will help the rapidly growing Linux kernel and other similar communities to adapt to changes and remain sustainable.

DOI: 10.1145/3377811.3380920

SpecuSym： speculative symbolic execution for cache timing leak detection

作者: Guo, Shengjian and Chen, Yueqi and Li, Peng and Cheng, Yueqiang and Wang, Huibo and Wu, Meng and Zuo, Zhiqiang
关键词: cache, side-channel leak, speculative execution, symbolic execution, timing

Abstract

CPU cache is a limited but crucial storage component in modern processors, whereas the cache timing side-channel may inadvertently leak information through the physically measurable timing variance. Speculative execution, an essential processor optimization, and a source of such variances, can cause severe detriment on deliberate branch mispredictions. Despite static analysis could qualitatively verify the timing-leakage-free property under speculative execution, it is incapable of producing endorsements including inputs and speculated flows to diagnose leaks in depth. This work proposes a new symbolic execution based method, SpecuSym, for precisely detecting cache timing leaks introduced by speculative execution. Given a program (leakage-free in non-speculative execution), SpecuSym systematically explores the program state space, models speculative behavior at conditional branches, and accumulates the cache side effects along with subsequent path explorations. During the dynamic execution, SpecuSym constructs leak predicates for memory visits according to the specified cache model and conducts a constraint-solving based cache behavior analysis to inspect the new cache behaviors. We have implemented SpecuSym atop KLEE and evaluated it against 15 open-source benchmarks. Experimental results show that SpecuSym successfully detected from 2 to 61 leaks in 6 programs under 3 different cache settings and identified false positives in 2 programs reported by recent work.

DOI: 10.1145/3377811.3380428

Symbolic verification of message passing interface programs

作者: Yu, Hengbiao and Chen, Zhenbang and Fu, Xianjin and Wang, Ji and Su, Zhendong and Sun, Jun and Huang, Chun and Dong, Wei
关键词: message passing interface, model checking, symbolic execution, symbolic verification, synergy

Abstract

Message passing is the standard paradigm of programming in high-performance computing. However, verifying Message Passing Interface (MPI) programs is challenging, due to the complex program features (such as non-determinism and non-blocking operations). In this work, we present MPI symbolic verifier (MPI-SV), the first symbolic execution based tool for automatically verifying MPI programs with non-blocking operations. MPI-SV combines symbolic execution and model checking in a synergistic way to tackle the challenges in MPI program verification. The synergy improves the scalability and enlarges the scope of verifiable properties. We have implemented MPI-SV1 and evaluated it with 111 real-world MPI verification tasks. The pure symbolic execution-based technique successfully verifies 61 out of the 111 tasks (55%) within one hour, while in comparison, MPI-SV verifies 100 tasks (90%). On average, compared with pure symbolic execution, MPI-SV achieves 19x speedups on verifying the satisfaction of the critical property and 5x speedups on finding violations.

DOI: 10.1145/3377811.3380419

Efficient generation of error-inducing floating-point inputs via symbolic execution

作者: Guo, Hui and Rubio-Gonz'{a
关键词: catastrophic cancellation, floating-point, roundoff errors, symbolic execution, testing

Abstract

Floating point is widely used in software to emulate arithmetic over reals. Unfortunately, floating point leads to rounding errors that propagate and accumulate during execution. Generating inputs to maximize the numerical error is critical when evaluating the accuracy of floating-point code. In this paper, we formulate the problem of generating high error-inducing floating-point inputs as a code coverage maximization problem solved using symbolic execution. Specifically, we define inaccuracy checks to detect large precision loss and cancellation. We inject these checks at strategic program locations to construct specialized branches that, when covered by a given input, are likely to lead to large errors in the result. We apply symbolic execution to generate inputs that exercise these specialized branches, and describe optimizations that make our approach practical. We implement a tool named FPGen and present an evaluation on 21 numerical programs including matrix computation and statistics libraries. We show that FPGen exposes errors for 20 of these programs and triggers errors that are, on average, over 2 orders of magnitude larger than the state of the art.

DOI: 10.1145/3377811.3380359

HyDiff： hybrid differential software analysis

作者: Noller, Yannic and P\u{a
关键词: differential program analysis, fuzzing, symbolic execution

Abstract

Detecting regression bugs in software evolution, analyzing side-channels in programs and evaluating robustness in deep neural networks (DNNs) can all be seen as instances of differential software analysis, where the goal is to generate diverging executions of program paths. Two executions are said to be diverging if the observable program behavior differs, e.g., in terms of program output, execution time, or (DNN) classification. The key challenge of differential software analysis is to simultaneously reason about multiple program paths, often across program variants.This paper presents HyDiff, the first hybrid approach for differential software analysis. HyDiff integrates and extends two very successful testing techniques: Feedback-directed greybox fuzzing for efficient program testing and shadow symbolic execution for systematic program exploration. HyDiff extends greybox fuzzing with divergence-driven feedback based on novel cost metrics that also take into account the control flow graph of the program. Furthermore HyDiff extends shadow symbolic execution by applying four-way forking in a systematic exploration and still having the ability to incorporate concrete inputs in the analysis. HyDiff applies divergence revealing heuristics based on resource consumption and control-flow information to efficiently guide the symbolic exploration, which allows its efficient usage beyond regression testing applications. We introduce differential metrics such as output, decision and cost difference, as well as patch distance, to assist the fuzzing and symbolic execution components in maximizing the execution divergence.We implemented our approach on top of the fuzzer AFL and the symbolic execution framework Symbolic PathFinder. Weillustrate HyDiff on regression and side-channel analysis for Java bytecode programs, and further show how to use HyDiff for robustness analysis of neural networks.

DOI: 10.1145/3377811.3380363

Seenomaly： vision-based linting of GUI animation effects against design-don’t guidelines

作者: Zhao, Dehai and Xing, Zhenchang and Chen, Chunyang and Xu, Xiwei and Zhu, Liming and Li, Guoqiang and Wang, Jinshui
关键词: GUI animation, design guidelines, lint, unsupervised learning

Abstract

GUI animations, such as card movement, menu slide in/out, snackbar display, provide appealing user experience and enhance the usability of mobile applications. These GUI animations should not violate the platform’s UI design guidelines (referred to as design-don’t guideline in this work) regarding component motion and interaction, content appearing and disappearing, and elevation and shadow changes. However, none of existing static code analysis, functional GUI testing and GUI image comparison techniques can “see” the GUI animations on the scree, and thus they cannot support the linting of GUI animations against design-don’t guidelines. In this work, we formulate this GUI animation linting problem as a multi-class screencast classification task, but we do not have sufficient labeled GUI animations to train the classifier. Instead, we propose an unsupervised, computer-vision based adversarial autoencoder to solve this linting problem. Our autoencoder learns to group similar GUI animations by “seeing” lots of unlabeled real-application GUI animations and learning to generate them. As the first work of its kind, we build the datasets of synthetic and real-world GUI animations. Through experiments on these datasets, we systematically investigate the learning capability of our model and its effectiveness and practicality for linting GUI animations, and identify the challenges in this linting problem for future work.

DOI: 10.1145/3377811.3380411

Low-overhead deadlock prediction

作者: Cai, Yan and Meng, Ruijie and Palsberg, Jens
关键词: deadlock detection, lock reachability graph, multithreaded programs

Abstract

Multithreaded programs can have deadlocks, even after deployment, so users may want to run deadlock tools on deployed programs. However, current deadlock predictors such as MagicLock and UnDead have large overheads that make them impractical for end-user deployment and confine their use to development time. Such overhead stems from running an exponential-time algorithm on a large execution trace. In this paper, we present the first low-overhead deadlock predictor, called AirLock, that is fit for both in-house testing and deployed programs. AirLock maintains a small predictive lock reachability graph, searches the graph for cycles, and runs an exponential-time algorithm only for each cycle. This approach lets AirLock find the same deadlocks as MagicLock and UnDead but with much less overhead because the number of cycles is small in practice. Our experiments with real-world benchmarks show that the average time overhead of AirLock is 3.5%, which is three orders of magnitude less than that of MagicLock and UnDead. AirLock’s low overhead makes it suitable for use with fuzz testers like AFL and on-the-fly after deployment.

DOI: 10.1145/3377811.3380367

An empirical assessment of security risks of global Android banking apps

作者: Chen, Sen and Fan, Lingling and Meng, Guozhu and Su, Ting and Xue, Minhui and Xue, Yinxing and Liu, Yang and Xu, Lihua
关键词: empirical study, mobile banking apps, vulnerability, weakness

Abstract

Mobile banking apps, belonging to the most security-critical app category, render massive and dynamic transactions susceptible to security risks. Given huge potential financial loss caused by vulnerabilities, existing research lacks a comprehensive empirical study on the security risks of global banking apps to provide useful insights and improve the security of banking apps.Since data-related weaknesses in banking apps are critical and may directly cause serious financial loss, this paper first revisits the state-of-the-art available tools and finds that they have limited capability in identifying data-related security weaknesses of banking apps. To complement the capability of existing tools in data-related weakness detection, we propose a three-phase automated security risk assessment system, named Ausera, which leverages static program analysis techniques and sensitive keyword identification. By leveraging Ausera, we collect 2,157 weaknesses in 693 real-world banking apps across 83 countries, which we use as a basis to conduct a comprehensive empirical study from different aspects, such as global distribution and weakness evolution during version updates. We find that apps owned by subsidiary banks are always less secure than or equivalent to those owned by parent banks. In addition, we also track the patching of weaknesses and receive much positive feedback from banking entities so as to improve the security of banking apps in practice. We further find that weaknesses derived from outdated versions of banking apps or third-party libraries are highly prone to being exploited by attackers. To date, we highlight that 21 banks have confirmed the weaknesses we reported (including 126 weaknesses in total). We also exchange insights with 7 banks, such as HSBC in UK and OCBC in Singapore, via in-person or online meetings to help them improve their apps. We hope that the insights developed in this paper will inform the communities about the gaps among multiple stakeholders, including banks, academic researchers, and third-party security companies.

DOI: 10.1145/3377811.3380417

Accessibility issues in Android apps： state of affairs, sentiments, and ways forward

作者: Alshayban, Abdulaziz and Ahmed, Iftekhar and Malek, Sam
关键词: No keywords

Abstract

Mobile apps are an integral component of our daily life. Ability to use mobile apps is important for everyone, but arguably even more so for approximately 15% of the world population with disabilities. This paper presents the results of a large-scale empirical study aimed at understanding accessibility of Android apps from three complementary perspectives. First, we analyze the prevalence of accessibility issues in over 1, 000 Android apps. We find that almost all apps are riddled with accessibility issues, hindering their use by disabled people. We then investigate the developer sentiments through a survey aimed at understanding the root causes of so many accessibility issues. We find that in large part developers are unaware of accessibility design principles and analysis tools, and the organizations in which they are employed do not place a premium on accessibility. We finally investigate user ratings and comments on app stores. We find that due to the disproportionately small number of users with disabilities, user ratings and app popularity are not indicative of the extent of accessibility issues in apps. We conclude the paper with several observations that form the foundation for future research and development.

DOI: 10.1145/3377811.3380392

Collaborative bug finding for Android apps

作者: Tan, Shin Hwei and Li, Ziqiang
关键词: Android apps, collaborative programming, recommendation system, test generation

Abstract

Many automated test generation techniques have been proposed for finding crashes in Android apps. Despite recent advancement in these approaches, a study shows that Android app developers prefer reading test cases written in natural language. Meanwhile, there exist redundancies in bug reports (written in natural language) across different apps that have not been previously reused. We propose collaborative bug finding, a novel approach that uses bugs in other similar apps to discover bugs in the app under test. We design three settings with varying degrees of interactions between programmers: (1) bugs from programmers who develop a different app, (2) bugs from manually searching for bug reports in GitHub repositories, (3) bugs from a bug recommendation system, Bugine. Our studies of the first two settings in a software testing course show that collaborative bug finding helps students who are novice Android app testers to discover 17 new bugs. As students admit that searching for relevant bug reports could be time-consuming, we introduce Bugine, an approach that automatically recommends relevant GitHub issues for a given app. Bugine uses (1) natural language processing to find GitHub issues that mention common UI components shared between the app under test and other apps in our database, and (2) a ranking algorithm to select GitHub issues that are of the best quality. Our results show that Bugine is able to find 34 new bugs. In total, collaborative bug finding helps us find 51 new bugs, in which eight have been confirmed and 11 have been fixed by the developers. These results confirm our intuition that our proposed technique is useful in discovering new bugs for Android apps.

DOI: 10.1145/3377811.3380349

POSIT： simultaneously tagging natural and programming languages

作者: P^{a
关键词: code-switching, language identification, mixed-code, part-of-speech tagging

Abstract

Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy.

DOI: 10.1145/3377811.3380440

CPC： automatically classifying and propagating natural language comments via program analysis

作者: Zhai, Juan and Xu, Xiangzhe and Shi, Yu and Tao, Guanhong and Pan, Minxue and Ma, Shiqing and Xu, Lei and Zhang, Weifeng and Tan, Lin and Zhang, Xiangyu
关键词: No keywords

Abstract

Code comments provide abundant information that have been leveraged to help perform various software engineering tasks, such as bug detection, specification inference, and code synthesis. However, developers are less motivated to write and update comments, making it infeasible and error-prone to leverage comments to facilitate software engineering tasks. In this paper, we propose to leverage program analysis to systematically derive, refine, and propagate comments. For example, by propagation via program analysis, comments can be passed on to code entities that are not commented such that code bugs can be detected leveraging the propagated comments. Developers usually comment on different aspects of code elements like methods, and use comments to describe various contents, such as functionalities and properties. To more effectively utilize comments, a fine-grained and elaborated taxonomy of comments and a reliable classifier to automatically categorize a comment are needed. In this paper, we build a comprehensive taxonomy and propose using program analysis to propagate comments. We develop a prototype CPC, and evaluate it on 5 projects. The evaluation results demonstrate 41573 new comments can be derived by propagation from other code locations with 88% accuracy. Among them, we can derive precise functional comments for 87 native methods that have neither existing comments nor source code. Leveraging the propagated comments, we detect 37 new bugs in open source large projects, 30 of which have been confirmed and fixed by developers, and 304 defects in existing comments (by looking at inconsistencies between existing and propagated comments), including 12 incomplete comments and 292 wrong comments. This demonstrates the effectiveness of our approach. Our user study confirms propagated comments align well with existing comments in terms of quality.

DOI: 10.1145/3377811.3380427

Suggesting natural method names to check name consistencies

作者: Nguyen, Son and Phan, Hung and Le, Trinh and Nguyen, Tien N.
关键词: deep learning, naturalness of source code, program entity name suggestion

Abstract

Misleading names of the methods in a project or the APIs in a software library confuse developers about program functionality and API usages, leading to API misuses and defects. In this paper, we introduce MNire, a machine learning approach to check the consistency between the name of a given method and its implementation. MNire first generates a candidate name and compares the current name against it. If the two names are sufficiently similar, we consider the method as consistent. To generate the method name, we draw our ideas and intuition from an empirical study on the nature of method names in a large dataset. Our key finding is that high proportions of the tokens of method names can be found in the three contexts of a given method including its body, the interface (the method’s parameter types and return type), and the enclosing class’ name. Even when such tokens are not there, MNire uses the contexts to predict the tokens due to the high likelihoods of their co-occurrences. Our unique idea is to treat the name generation as an abstract summarization on the tokens collected from the names of the program entities in the three above contexts.We conducted several experiments to evaluate MNire in method name consistency checking and in method name recommending on large datasets with +14M methods. In detecting inconsistency method names, MNire improves the state-of-the-art approach by 10.4% and 11% relatively in recall and precision, respectively. In method name recommendation, MNire improves relatively over the state-of-the-art technique, code2vec, in both recall (18.2% higher) and precision (11.1% higher). To assess MNire’s usefulness, we used it to detect inconsistent methods and suggest new names in several active, GitHub projects. We made 50 pull requests (PRs) and received 42 responses. Among them, five PRs were merged into the main branch, and 13 were approved for later merging. In total, in 31/42 cases, the developer teams agree that our suggested names are more meaningful than the current names, showing MNire’s usefulness.

DOI: 10.1145/3377811.3380926

Retrieval-based neural source code summarization

作者: Zhang, Jian and Wang, Xu and Zhang, Hongyu and Sun, Hailong and Liu, Xudong
关键词: deep neural network, information retrieval, source code summarization

Abstract

Source code summarization aims to automatically generate concise summaries of source code in natural language texts, in order to help developers better understand and maintain source code. Traditional work generates a source code summary by utilizing information retrieval techniques, which select terms from original source code or adapt summaries of similar code snippets. Recent studies adopt Neural Machine Translation techniques and generate summaries from code snippets using encoder-decoder neural networks. The neural-based approaches prefer the high-frequency words in the corpus and have trouble with the low-frequency ones. In this paper, we propose a retrieval-based neural source code summarization approach where we enhance the neural model with the most similar code snippets retrieved from the training set. Our approach can take advantages of both neural and retrieval-based techniques. Specifically, we first train an attentional encoder-decoder model based on the code snippets and the summaries in the training set; Second, given one input code snippet for testing, we retrieve its two most similar code snippets in the training set from the aspects of syntax and semantics, respectively; Third, we encode the input and two retrieved code snippets, and predict the summary by fusing them during decoding. We conduct extensive experiments to evaluate our approach and the experimental results show that our proposed approach can improve the state-of-the-art methods.

DOI: 10.1145/3377811.3380383

On learning meaningful assert statements for unit test cases

作者: Watson, Cody and Tufano, Michele and Moran, Kevin and Bavota, Gabriele and Poshyvanyk, Denys
关键词: No keywords

Abstract

Software testing is an essential part of the software lifecycle and requires a substantial amount of time and effort. It has been estimated that software developers spend close to 50% of their time on testing the code they write. For these reasons, a long standing goal within the research community is to (partially) automate software testing. While several techniques and tools have been proposed to automatically generate test methods, recent work has criticized the quality and usefulness of the assert statements they generate. Therefore, we employ a Neural Machine Translation (NMT) based approach called Atlas (AuTomatic Learning of Assert Statements) to automatically generate meaningful assert statements for test methods. Given a test method and a focal method (i.e., the main method under test), Atlas can predict a meaningful assert statement to assess the correctness of the focal method. We applied Atlas to thousands of test methods from GitHub projects and it was able to predict the exact assert statement manually written by developers in 31% of the cases when only considering the top-1 predicted assert. When considering the top-5 predicted assert statements, Atlas is able to predict exact matches in 50% of the cases. These promising results hint to the potential usefulness of our approach as (i) a complement to automatic test case generation techniques, and (ii) a code completion support for developers, who can benefit from the recommended assert statements while writing test code.

DOI: 10.1145/3377811.3380429

Quickly generating diverse valid test inputs with reinforcement learning

作者: Reddy, Sameer and Lemieux, Caroline and Padhye, Rohan and Sen, Koushik
关键词: No keywords

Abstract

Property-based testing is a popular approach for validating the logic of a program. An effective property-based test quickly generates many diverse valid test inputs and runs them through a parameterized test driver. However, when the test driver requires strict validity constraints on the inputs, completely random input generation fails to generate enough valid inputs. Existing approaches to solving this problem rely on whitebox or greybox information collected by instrumenting the input generator and/or test driver. However, collecting such information reduces the speed at which tests can be executed. In this paper, we propose and study a black-box approach for generating valid test inputs. We first formalize the problem of guiding random input generators towards producing a diverse set of valid inputs. This formalization highlights the role of a guide which governs the space of choices within a random input generator. We then propose a solution based on reinforcement learning (RL), using a tabular, on-policy RL approach to guide the generator. We evaluate this approach, RLCheck, against pure random input generation as well as a state-of-the-art greybox evolutionary algorithm, on four real-world benchmarks. We find that in the same time budget, RLCheck generates an order of magnitude more diverse valid inputs than the baselines.

DOI: 10.1145/3377811.3380399

An evidence-based inquiry into the use of grey literature in software engineering

作者: Zhang, He and Zhou, Xin and Huang, Xin and Huang, Huang and Babar, Muhammad Ali
关键词: empirical software engineering, evidence-based software engineering, grey literature, survey, systematic (literature) review

Abstract

Context: Following on other scientific disciplines, such as health sciences, the use of Grey Literature (GL) has become widespread in Software Engineering (SE) research. Whilst the number of papers incorporating GL in SE is increasing, there is little empirically known about different aspects of the use of GL in SE research.Method: We used a mixed-methods approach for this research. We carried out a Systematic Literature Review (SLR) of the use of GL in SE, and surveyed the authors of the selected papers included in the SLR (as GL users) and the invited experts in SE community on the use of GL in SE research. Results: We systematically selected and reviewed 102 SE secondary studies that incorporate GL in SE research, from which we identified two groups based on their reporting: 1) 76 reviews only claim their use of GL; 2) 26 reviews report the results by including GL. We also obtained 20 replies from the GL users and 24 replies from the invited SE experts. Conclusion: There is no common understanding of the meaning of GL in SE. Researchers define the scopes and the definitions of GL in a variety of ways. We found five main reasons of using GL in SE research. The findings have enabled us to propose a conceptual model for how GL works in SE research lifecycle. There is an apparent need for research to develop guidelines for using GL in SE and for assessing quality of GL. The current work can provide a panorama of the state-of-the-art of using GL in SE for the follow-up research, as to determine the important position of GL in SE research.

DOI: 10.1145/3377811.3380336

Towards the use of the readily available tests from the release pipeline as performance tests： are we there yet?

作者: Ding, Zishuo and Chen, Jinfu and Shang, Weiyi
关键词: performance issues, performance testing, software performance

Abstract

Performance is one of the important aspects of software quality. Performance issues exist widely in software systems, and the process of fixing the performance issues is an essential step in the release cycle of software systems. Although performance testing is widely adopted in practice, it is still expensive and time-consuming. In particular, the performance testing is usually conducted after the system is built in a dedicated testing environment. The challenges of performance testing make it difficult to fit into the common DevOps process in software development. On the other hand, there exist a large number of tests readily available, that are executed regularly within the release pipeline during software development. In this paper, we perform an exploratory study to determine whether such readily available tests are capable of serving as performance tests. In particular, we would like to see whether the performance of these tests can demonstrate performance improvements obtained from fixing real-life performance issues. We collect 127 performance issues from Hadoop and Cassandra, and evaluate the performance of the readily available tests from the commits before and after the performance issue fixes. We find that most of the improvements from the fixes to performance issues can be demonstrated using the readily available tests in the release pipeline. However, only a very small portion of the tests can be used for demonstrating the improvements. By manually examining the tests, we identify eight reasons that a test cannot demonstrate performance improvements even though it covers the changed source code of the issue fix. Finally, we build random forest classifiers determining the important metrics influencing the readily available tests (not) being able to demonstrate performance improvements from issue fixes. We find that the test code itself and the source code covered by the test are important factors, while the factors related to the code changes in the performance issues fixes have a low importance. Practitioners may focus on designing and improving the tests, instead of fine-tuning tests for different performance issues fixes. Our findings can be used as a guideline for practitioners to reduce the amount of effort spent on leveraging and designing tests that run in the release pipeline for performance assurance activities.

DOI: 10.1145/3377811.3380351

Verifying object construction

作者: Kellogg, Martin and Ran, Manli and Sridharan, Manu and Sch"{a
关键词: AMI sniping, autovalue, builder pattern, lightweight verification, lombok, pluggable type systems

Abstract

In object-oriented languages, constructors often have a combination of required and optional formal parameters. It is tedious and inconvenient for programmers to write a constructor by hand for each combination. The multitude of constructors is error-prone for clients, and client code is difficult to read due to the large number of constructor arguments. Therefore, programmers often use design patterns that enable more flexible object construction—the builder pattern, dependency injection, or factory methods.However, these design patterns can be too flexible: not all combinations of logical parameters lead to the construction of well-formed objects. When a client uses the builder pattern to construct an object, the compiler does not check that a valid set of values was provided. Incorrect use of builders can lead to security vulnerabilities, run-time crashes, and other problems.This work shows how to statically verify uses of object construction, such as the builder pattern. Using a simple specification language, programmers specify which combinations of logical arguments are permitted. Our compile-time analysis detects client code that may construct objects unsafely. Our analysis is based on a novel special case of typestate checking, accumulation analysis, that modularly reasons about accumulations of method calls. Because accumulation analysis does not require precise aliasing information for soundness, our analysis scales to industrial programs. We evaluated it on over 9 million lines of code, discovering defects which included previously-unknown security vulnerabilities and potential null-pointer violations in heavily-used open-source codebases. Our analysis has a low false positive rate and low annotation burden.Our implementation and experimental data are publicly available.

DOI: 10.1145/3377811.3380341

Automatically testing string solvers

作者: Bugariu, Alexandra and M"{u
关键词: SMT solvers, automatic testing, soundness testing, string solvers

Abstract

SMT solvers are at the basis of many applications, such as program verification, program synthesis, and test case generation. For all these applications to provide reliable results, SMT solvers must answer queries correctly. However, since they are complex, highly-optimized software systems, ensuring their correctness is challenging. In particular, state-of-the-art testing techniques do not reliably detect when an SMT solver is unsound.In this paper, we present an automatic approach for generating test cases that reveal soundness errors in the implementations of string solvers, as well as potential completeness and performance issues. We synthesize input formulas that are satisfiable or unsatisfiable by construction and use this ground truth as test oracle. We automatically apply satisfiability-preserving transformations to generate increasingly-complex formulas, which allows us to detect many errors with simple inputs and, thus, facilitates debugging.The experimental evaluation shows that our technique effectively reveals bugs in the implementation of widely-used SMT solvers and applies also to other types of solvers, such as automata-based solvers. We focus on strings here, but our approach carries over to other theories and their combinations.

DOI: 10.1145/3377811.3380398

A study on the lifecycle of flaky tests

作者: Lam, Wing and Mu\c{s
关键词: empirical study, flaky test, lifecycle

Abstract

During regression testing, developers rely on the pass or fail outcomes of tests to check whether changes broke existing functionality. Thus, flaky tests, which nondeterministically pass or fail on the same code, are problematic because they provide misleading signals during regression testing. Although flaky tests are the focus of several existing studies, none of them study (1) the reoccurrence, runtimes, and time-before-fix of flaky tests, and (2) flaky tests in-depth on proprietary projects.This paper fills this knowledge gap about flaky tests and investigates whether prior categorization work on flaky tests also apply to proprietary projects. Specifically, we study the lifecycle of flaky tests in six large-scale proprietary projects at Microsoft. We find, as in prior work, that asynchronous calls are the leading cause of flaky tests in these Microsoft projects. Therefore, we propose the first automated solution, called Flakiness and Time Balancer (FaTB), to reduce the frequency of flaky-test failures caused by asynchronous calls. Our evaluation of five such flaky tests shows that FaTB can reduce the running times of these tests by up to 78% without empirically affecting the frequency of their flaky-test failures. Lastly, our study finds several cases where developers claim they “fixed” a flaky test but our empirical experiments show that their changes do not fix or reduce these tests’ frequency of flaky-test failures. Future studies should be more cautious when basing their results on changes that developers claim to be “fixes”.

DOI: 10.1145/3377811.3381749

Testing file system implementations on layered models

作者: Chen, Dongjie and Jiang, Yanyan and Xu, Chang and Ma, Xiaoxing and Lu, Jian
关键词: file system, model-based testing, workload generation

Abstract

Generating high-quality system call sequences is not only important to testing file system implementations, but also challenging due to the astronomically large input space. This paper introduces a new approach to the workload generation problem by building layered models and abstract workloads refinement. This approach is instantiated as a three-layer file system model for file system workload generation. In a short-period experiment run, sequential workloads (system call sequences) manifested over a thousand crashes in mainline Linux Kernel file systems, with 12 previously unknown bugs being reported. We also provide evidence that such workloads benefit other domain-specific testing techniques including crash consistency testing and concurrency testing.

DOI: 10.1145/3377811.3380350

Co-evolving code with evolving metamodels

作者: Khelladi, Djamel Eddine and Combemale, Benoit and Acher, Mathieu and Barais, Olivier and J'{e
关键词: No keywords

Abstract

Metamodels play a significant role to describe and analyze the relations between domain concepts. They are also cornerstone to build a software language (SL) for a domain and its associated tooling. Metamodel definition generally drives code generation of a core API. The latter is further enriched by developers with additional code implementing advanced functionalities, e.g., checkers, recommenders, etc. When a SL is evolved to the next version, the metamodels are evolved as well before to re-generate the core API code. As a result, the developers added code both in the core API and the SL toolings may be impacted and thus may need to be co-evolved accordingly. Many approaches support the co-evolution of various artifacts when metamodels evolve. However, not the co-evolution of code. This paper fills this gap. We propose a semi-automatic co-evolution approach based on change propagation. The premise is that knowledge of the metamodel evolution changes can be propagated by means of resolutions to drive the code co-evolution. Our approach leverages on the abstraction level of metamodels where a given metamodel element has often different usages in the code. It supports alternative co-evaluations to meet different developers needs. Our work is evaluated on three Eclipse SL implementations, namely OCL, Modisco, and Papyrus over several evolved versions of metamodels and code. In response to five different evolved metamodels, we co-evolved 976 impacts over 18 projects.A comparison of our co-evolved code with the versioned ones shows the usefulness of our approach. Our approach was able to reach a weighted average of 87.4% and 88.9% respectively of precision and recall while supporting useful alternative co-evolution that developers have manually performed.

DOI: 10.1145/3377811.3380324

Lazy product discovery in huge configuration spaces

作者: Lienhardt, Michael and Damiani, Ferruccio and Johnsen, Einar Broch and Mauro, Jacopo
关键词: Linux distribution, composition, configurable software, feature models, software product lines, variability modeling

Abstract

Highly-configurable software systems can have thousands of interdependent configuration options across different subsystems. In the resulting configuration space, discovering a valid product configuration for some selected options can be complex and error prone. The configuration space can be organized using a feature model, fragmented into smaller interdependent feature models reflecting the configuration options of each subsystem.We propose a method for lazy product discovery in large fragmented feature models with interdependent features. We formalize the method and prove its soundness and completeness. The evaluation explores an industrial-size configuration space. The results show that lazy product discovery has significant performance benefits compared to standard product discovery, which in contrast to our method requires all fragments to be composed to analyze the feature model. Furthermore, the method succeeds when more efficient, heuristics-based engines fail to find a valid configuration.

DOI: 10.1145/3377811.3380372

Reducing run-time adaptation space via analysis of possible utility bounds

作者: Stevens, Clay and Bagheri, Hamid
关键词: formal methods, multi-objective optimization, run-time adaptation, self-adaptive systems

Abstract

Self-adaptive systems often employ dynamic programming or similar techniques to select optimal adaptations at run-time. These techniques suffer from the “curse of dimensionality”, increasing the cost of run-time adaptation decisions. We propose a novel approach that improves upon the state-of-the-art proactive self-adaptation techniques to reduce the number of possible adaptations that need be considered for each run-time adaptation decision. The approach, realized in a tool called Thallium, employs a combination of automated formal modeling techniques to (i) analyze a structural model of the system showing which configurations are reachable from other configurations and (ii) compute the utility that can be generated by the optimal adaptation over a bounded horizon in both the best- and worst-case scenarios. It then constructs triangular possibility values using those optimized bounds to automatically compare adjacent adaptations for each configuration, keeping only the alternatives with the best range of potential results. The experimental results corroborate Thallium’s ability to significantly reduce the number of states that need to be considered with each adaptation decision, freeing up vital resources at run-time.

DOI: 10.1145/3377811.3380365

Context-aware in-process crowdworker recommendation

作者: Wang, Junjie and Yang, Ye and Wang, Song and Hu, Yuanzhe and Wang, Dandan and Wang, Qing
关键词: No keywords

Abstract

Identifying and optimizing open participation is essential to the success of open software development. Existing studies highlighted the importance of worker recommendation for crowdtesting tasks in order to detect more bugs with fewer workers. However, these studies mainly focus on one-time recommendations with respect to the initial context at the beginning of a new task. This paper argues the need for in-process crowdtesting worker recommendation. We motivate this study through a pilot study, revealing the prevalence of long-sized non-yielding windows, i.e., no new bugs are revealed in consecutive test reports during the process of a crowdtesting task. This indicates the potential opportunity for accelerating crowdtesting by recommending appropriate workers in a dynamic manner, so that the non-yielding windows could be shortened.To that end, this paper proposes a context-aware in-process crowdworker recommendation approach, iRec, to detect more bugs earlier and potentially shorten the non-yielding windows. It consists of three main components: 1) the modeling of dynamic testing context, 2) the learning-based ranking component, and 3) the diversity-based re-ranking component. The evaluation is conducted on 636 crowdtesting tasks from one of the largest crowdtesting platforms, and results show the potential of iRec in improving the cost-effectiveness of crowdtesting by saving the cost and shortening the testing process.

DOI: 10.1145/3377811.3380380

A large-scale empirical study on vulnerability distribution within projects and the lessons learned

作者: Liu, Bingchang and Meng, Guozhu and Zou, Wei and Gong, Qi and Li, Feng and Lin, Min and Sun, Dandan and Huo, Wei and Zhang, Chao
关键词: empirical study, vulnerability distribution

Abstract

The number of vulnerabilities increases rapidly in recent years, due to advances in vulnerability discovery solutions. It enables a thorough analysis on the vulnerability distribution and provides support for correlation analysis and prediction of vulnerabilities. Previous research either focuses on analyzing bugs rather than vulnerabilities, or only studies general vulnerability distribution among projects rather than the distribution within each project. In this paper, we collected a large vulnerability dataset, consisting of all known vulnerabilities associated with five representative open source projects, by utilizing automated crawlers and spending months of manual efforts. We then analyzed the vulnerability distribution within each project over four dimensions, including files, functions, vulnerability types and responsible developers. Based on the results analysis, we presented 12 practical insights on the distribution of vulnerabilities. Finally, we applied such insights on several vulnerability discovery solutions (including static analysis and dynamic fuzzing), and helped them find 10 zero-day vulnerabilities in target projects, showing that our insights are useful.

DOI: 10.1145/3377811.3380923

Unsuccessful story about few shot malware family classification and siamese network to the rescue

作者: Bai, Yude and Xing, Zhenchang and Li, Xiaohong and Feng, Zhiyong and Ma, Duoyuan
关键词: few shot learning, malware family classification, siamese network

Abstract

To battle the ever-increasing Android malware, malware family classification, which classifies malware with common features into a malware family, has been proposed as an effective malware analysis method. Several machine-learning based approaches have been proposed for the task of malware family classification. Our study shows that malware families suffer from several data imbalance, with many families with only a small number of malware applications (referred to as few shot malware families in this work). Unfortunately, this issue has been overlooked in existing approaches. Although existing approaches achieve high classification performance at the overall level and for large malware families, our experiments show that they suffer from poor performance and generalizability for few shot malware families, and traditionally downsampling method cannot solve the problem. To address the challenge in few shot malware family classification, we propose a novel siamese-network based learning method, which allows us to train an effective MultiLayer Perceptron (MLP) network for embedding malware applications into a real-valued, continuous vector space by contrasting the malware applications from the same or different families. In the embedding space, the performance of malware family classification can be significantly improved for all scales of malware families, especially for few shot malware families, which also leads to the significant performance improvement at the overall level.

DOI: 10.1145/3377811.3380354

How does misconfiguration of analytic services compromise mobile privacy?

作者: Zhang, Xueling and Wang, Xiaoyin and Slavin, Rocky and Breaux, Travis and Niu, Jianwei
关键词: analytic services, configuration, mobile application, privacy, program analysis

Abstract

Mobile application (app) developers commonly utilize analytic services to analyze their app users’ behavior to support debugging, improve service quality, and facilitate advertising. Anonymization and aggregation can reduce the sensitivity of such behavioral data, therefore analytic services often encourage the use of such protections. However, these protections are not directly enforced so it is possible for developers to misconfigure the analytic services and expose personal information, which may cause greater privacy risks. Since people use apps in many aspects of their daily lives, such misconfigurations may lead to the leaking of sensitive personal information such as a users’ real-time location, health data, or dating preferences. To study this issue and identify potential privacy risks due to such misconfigurations, we developed a semi-automated approach, Privacy-Aware Analytics Misconfiguration Detector (PAMDroid), which enables our empirical study on mis-configurations of analytic services. This paper describes a study of 1,000 popular apps using top analytic services in which we found misconfigurations in 120 apps. In 52 of the 120 apps, misconfigurations lead to a violation of either the analytic service providers’ terms of service or the app’s own privacy policy.

DOI: 10.1145/3377811.3380401

Interpreting cloud computer vision pain-points： a mining study of stack overflow

作者: Cummaudo, Alex and Vasa, Rajesh and Barnett, Scott and Grundy, John and Abdelrazek, Mohamed
关键词: computer vision, documentation, empirical study, intelligent services, pain points, stack overflow

Abstract

Intelligent services are becoming increasingly more pervasive; application developers want to leverage the latest advances in areas such as computer vision to provide new services and products to users, and large technology firms enable this via RESTful APIs. While such APIs promise an easy-to-integrate on-demand machine intelligence, their current design, documentation and developer interface hides much of the underlying machine learning techniques that power them. Such APIs look and feel like conventional APIs but abstract away data-driven probabilistic behaviour—the implications of a developer treating these APIs in the same way as other, traditional cloud services, such as cloud storage, is of concern. The objective of this study is to determine the various pain-points developers face when implementing systems that rely on the most mature of these intelligent services, specifically those that provide computer vision. We use Stack Overflow to mine indications of the frustrations that developers appear to face when using computer vision services, classifying their questions against two recent classification taxonomies (documentation-related and general questions). We find that, unlike mature fields like mobile development, there is a contrast in the types of questions asked by developers. These indicate a shallow understanding of the underlying technology that empower such systems. We discuss several implications of these findings via the lens of learning taxonomies to suggest how the software engineering community can improve these services and comment on the nature by which developers use them.

DOI: 10.1145/3377811.3380404

SLEMI： finding simulink compiler bugs through equivalence modulo input (EMI)

Abstract

This demo presents usage and implementation details of SLEMI. SLEMI is the first tool to automatically find compiler bugs in the widely used cyber-physical system development tool Simulink via Equivalence Modulo Input (EMI). EMI is a recent twist on differential testing that promises more efficiency. SLEMI implements several novel mutation techniques that deal with CPS language features that are not found in procedural languages. This demo also introduces a new EMI-based mutation strategy that has already found a new confirmed bug in Simulink version R2018a. To increase SLEMI’s efficiency further, this paper presents parallel generation of random, valid Simulink models. A video demo of SLEMI is available at https://www.youtube.com/watch?v=oliPgOLT6eY.

DOI: 10.1145/3377812.3382147

PROMISE： high-level mission specification for multiple robots

作者: Garc'{\i
关键词: No keywords

Abstract

Service robots, a type of robots that perform useful tasks for humans, are foreseen to be broadly used in the near future in both social and industrial scenarios. Those robots will be required to operate in dynamic environments, collaborating among them or with users. Specifying the list of requested tasks to be achieved by a robotic team is far from being trivial. Therefore, mission specification languages and tools need to be expressive enough to allow the specification of complex missions (e.g., detailing recovery actions), while being reachable by domain experts who might not be knowledgeable of programming languages. To support domain experts, we developed PROMISE, a Domain-Specific Language that allows mission specification for multiple robots in a user-friendly, yet rigorous manner. PROMISE is built as an Eclipse plugin that provides a textual and a graphical interface for mission specification. Our tool is in turn integrated into a software framework, which provides functionalities as: (1) automatic generation from specification, (2) sending of missions to the robotic team; and (3) interpretation and management of missions during execution time. PROMISE and its framework implementation have been validated through simulation and real-world experiments with four different robotic models.Video: https://youtu.be/RMtqwY2GOlQ

DOI: 10.1145/3377812.3382143

SMRL： a metamorphic security testing tool for web systems

作者: Mai, Phu X. and Goknil, Arda and Pastore, Fabrizio and Briand, Lionel C.
关键词: No keywords

Abstract

We present a metamorphic testing tool that alleviates the oracle problem in security testing. The tool enables engineers to specify metamorphic relations that capture security properties of Web systems. It automatically tests Web systems to detect vulnerabilities based on those relations. We provide a domain-specific language accompanied by an Eclipse editor to facilitate the specification of metamorphic relations. The tool automatically collects the input data and transforms the metamorphic relations into executable Java code in order to automatically perform security testing based on the collected data. The tool has been successfully evaluated on a commercial system and a leading open source system (Jenkins). Demo video: https://youtu.be/9kx6u9LsGxs.

DOI: 10.1145/3377812.3382152

WasmView： visual testing for webassembly applications

作者: Romano, Alan and Wang, Weihang
关键词: No keywords

Abstract

WebAssembly is the newest language to arrive on the web. It features a binary code format to serve as a compilation target for native languages such as C, C++, and Rust and allows native applications to be ported for web usage. In the current implementation, WebAssembly requires interaction with JavaScript at a minimum to initialize and additionally to interact with Web APIs. As a result, understanding the function calls between WebAssembly and JavaScript is crucial for testing, debugging, and maintaining applications utilizing this new language. To this end, we develop a tool, WasmView, to visualize function calls made between WebAssembly and JavaScript in a web application. WasmView also records the stack traces and Screenshots of applications. This tool can help in supporting visual testing for interactive applications and assisting refactoring for code updates. The demo video for WasmView can be viewed at https://youtu.be/kjKxL7L7zxI and the source code can be found at https://github.com/wasmview/wasmview.github.io.

DOI: 10.1145/3377812.3382155

AppTestMigrator： a tool for automated test migration for Android apps

作者: Behrang, Farnaz and Orso, Alessandro
关键词: GUI testing, mobile apps, test migration

Abstract

The use of mobile apps is increasingly widespread, and much effort is put into testing these apps to make sure they behave as intended. In this demo, we present AppTestMigrator, a technique and tool for migrating test cases between apps with similar functionality. The intuition behind AppTestMigrator is that many apps share similarities in their functionality, and these similarities often result in conceptually similar user interfaces (through which that functionality is accessed). AppTestMigrator attempts to automatically transform the sequence of events and oracles in a test case for an app (source app) to events and oracles for another app (target app). The results of our preliminary evaluation show the effectiveness of AppTestMigrator in migrating test cases between mobile apps with similar functionality.Video URL: https://youtu.be/WQnfEcwYqa4

DOI: 10.1145/3377812.3382149

Seraph： enabling cross-platform security analysis for EVM and WASM smart contracts

作者: Yang, Zhiqiang and Liu, Han and Li, Yue and Zheng, Huixuan and Wang, Lei and Chen, Bangdao
关键词: connector API, smart contracts, symbolic semantic graph

Abstract

As blockchain becomes increasingly popular across various industries in recent years, many companies started designing and developing their own smart contract platforms to enable better services on blockchain. While smart contracts are notoriously known to be vulnerable to external attacks, such platform diversity further amplified the security challenge. To mitigate this problem, we designed the very first cross-platform security analyzer called Seraph for smart contracts. Specifically, Seraph enables automated security analysis for different platforms built on two mainstream virtual machine architectures, i.e., EVM and WASM. To this end, Seraph introduces a set of general connector API to abstract interactions between the virtual machine and blockchain, e.g., load and update storage data on blockchain. Moreover, we proposed the symbolic semantic graph to model critical dependencies and decoupled security analysis from contract code as well. Our preliminary evaluation on four existing smart contract platforms demonstrated the potential of Seraph in finding security threats both flexibly and accurately. A video of Seraph is available at https://youtu.be/wxixZkVqUsc.

DOI: 10.1145/3377812.3382157

The SmartSHARK ecosystem for software repository mining

作者: Trautsch, Alexander and Trautsch, Fabian and Herbold, Steffen and Ledel, Benjamin and Grabowski, Jens
关键词: No keywords

Abstract

Software repository mining is the foundation for many empirical software engineering studies. The collection and analysis of detailed data can be challenging, especially if data shall be shared to enable replicable research and open science practices. SmartSHARK is an ecosystem that supports replicable and reproducible research based on software repository mining.

DOI: 10.1145/3377812.3382139

DeepMutation： a neural mutation tool

作者: Tufano, Michele and Kimko, Jason and Wang, Shiya and Watson, Cody and Bavota, Gabriele and Di Penta, Massimiliano and Poshyvanyk, Denys
关键词: mutation testing, neural networks, software testing

Abstract

Mutation testing can be used to assess the fault-detection capabilities of a given test suite. To this aim, two characteristics of mutation testing frameworks are of paramount importance: (i) they should generate mutants that are representative of real faults; and (ii) they should provide a complete tool chain able to automatically generate, inject, and test the mutants. To address the first point, we recently proposed an approach using a Recurrent Neural Network Encoder-Decoder architecture to learn mutants from ~787k faults mined from real programs. The empirical evaluation of this approach confirmed its ability to generate mutants representative of real faults. In this paper, we address the second point, presenting DeepMutation, a tool wrapping our deep learning model into a fully automated tool chain able to generate, inject, and test mutants learned from real faults.Video: https://sites.google.com/view/learning-mutation/deepmutation

DOI: 10.1145/3377812.3382146

TimeTracer： a tool for back in time traceability replaying

作者: Mayr-Dorn, Christoph and Vierhauser, Michael and Keplinger, Felix and Bichler, Stefan and Egyed, Alexander
关键词: Jira, history, process artifacts, replay, tracing

Abstract

Ensuring correct trace links between different types of artifacts (requirements, architecture, or code) is crucial for compliance in safety-critical domains, for consistency checking, or change impact assessment. The point in time when trace links are created, however, (i.e., immediately during development or weeks/months later) has a significant impact on its quality. Assessing quality thus relies on obtaining a historical view on artifacts and their trace links at a certain point in the past which provides valuable insights on when, how, and by whom, trace links were created. This work presents TimeTracer, a tool that allows engineers to go back in time - not just to view the history of artifacts but also the history of trace links associated with these artifacts. TimeTracer allows easy integration with different development support tools such as Jira; and it stores artifacts, traces, and changes thereof in a unified artifact model.

DOI: 10.1145/3377812.3382141

SimilarAPI： mining analogical APIs for library migration

作者: Chen, Chunyang
关键词: analogical API, skip thoughts, word embedding

Abstract

Establishing API mappings between libraries is a prerequisite step for library migration tasks. Manually establishing API mappings is tedious due to the large number of APIs to be examined, and existing methods based on supervised learning requires unavailable already-ported or functionality similar applications. Therefore, we propose an unsupervised deep learning based approach to embed both API usage semantics and API description (name and document) semantics into vector space for inferring likely analogical API mappings between libraries. We implement a proof-of-concept website SimilarAPI (https://similarapi.appspot.com) which can recommend analogical APIs for 583,501 APIs of 111 pairs of analogical Java libraries with diverse functionalities. Video: https://youtu.be/EAwD6l24vLQ

DOI: 10.1145/3377812.3382140

FeatureNET： diversity-driven generation of deep learning models

作者: Ghamizi, Salah and Cordy, Maxime and Papadakis, Mike and Traon, Yves Le
关键词: AutoML, NAS, configuration search, neural architecture search

Abstract

We present FeatureNET, an open-source Neural Architecture Search (NAS) tool1 that generates diverse sets of Deep Learning (DL) models. FeatureNET relies on a meta-model of deep neural networks, consisting of generic configurable entities. Then, it uses tools developed in the context of software product lines to generate diverse (maximize the differences between the generated) DL models. The models are translated to Keras and can be integrated into typical machine learning pipelines. FeatureNET allows researchers to generate seamlessly a large variety of models. Thereby, it helps choosing appropriate DL models and performing experiments with diverse models (mitigating potential threats to validity). As a NAS method, FeatureNET successfully generates models performing equally well with handcrafted models.

DOI: 10.1145/3377812.3382153

EvalDNN： a toolbox for evaluating deep neural network models

作者: Tian, Yongqiang and Zeng, Zhihua and Wen, Ming and Liu, Yepang and Kuo, Tzu-yang and Cheung, Shing-Chi
关键词: deep learning model, evaluation

Abstract

Recent studies have shown that the performance of deep learning models should be evaluated using various important metrics such as robustness and neuron coverage, besides the widely-used prediction accuracy metric. However, major deep learning frameworks currently only provide APIs to evaluate a model’s accuracy. In order to comprehensively assess a deep learning model, framework users and researchers often need to implement new metrics by themselves, which is a tedious job. What is worse, due to the large number of hyper-parameters and inadequate documentation, evaluation results of some deep learning models are hard to reproduce, especially when the models and metrics are both new.To ease the model evaluation in deep learning systems, we have developed EvalDNN, a user-friendly and extensible toolbox supporting multiple frameworks and metrics with a set of carefully designed APIs. Using EvalDNN, evaluation of a pre-trained model with respect to different metrics can be done with a few lines of code. We have evaluated EvalDNN on 79 models from TensorFlow, Keras, GluonCV, and PyTorch. As a result of our effort made to reproduce the evaluation results of existing work, we release a performance benchmark of popular models, which can be a useful reference to facilitate future research. The tool and benchmark are available at https://github.com/yqtianust/EvalDNN and https://yqtianust.github.io/EvalDNN-benchmark/, respectively. A demo video of EvalDNN is available at: https://youtu.be/v69bNJN2bJc.

DOI: 10.1145/3377812.3382133

FuRong： fusing report of automated Android testing on multi-devices

作者: Tian, Yuanhan and Yu, Shengcheng and Fang, Chunrong and Li, Peiyuan
关键词: Android testing, automated testing, bug classification, bug report

Abstract

Automated testing has been widely used to ensure the quality of Android applications. However, incomprehensible testing results make it difficult for developers to understand and fix potential bugs. This paper proposes FuRong, a novel tool, to fuse bug reports of high-readability and strong-guiding-ability via analyzing the automated testing results on multi-devices. FuRong builds a bug model with complete context information, such as screenshots, operation sequences, and logs from multi-devices, and then leverages pre-trained Decision Tree classifier (with 18 bug category labels) to classify bugs. FuRong deduplicates the classified bugs via Levenshtein distance and finally generates the easy-to-understand report, not only context information of bugs, where possible causes and fix suggestions for each bug category are also provided. An empirical study of 8 open-source Android applications with automated testing on 20 devices has been conducted, the results show the effectiveness of FuRong, which has a bug classification precision of 93.4% and a bug classification accuracy of 87.9%. Video URL: https://youtu.be/LUkFTc32B6k

DOI: 10.1145/3377812.3382138

Phoenix： a tool for automated data-driven synthesis of repairs for static analysis violations

作者: Yoshida, Hiroaki and Bavishi, Rohan and Hotta, Keisuke and Nemoto, Yusuke and Prasad, Mukul R. and Kikuchi, Shinji
关键词: program repair, programming-by-example, static analysis

Abstract

One of the major drawbacks of traditional automatic program repair (APR) techniques is their dependence on a test suite as a repair specification. In practice, it is often hard to obtain specification-quality test suites. This limits the performance and hence the viability of such test-suite-based approaches. On the other hand, static-analysis-based bug finding tools are increasingly being adopted in industry but still facing challenges since the reported violations are viewed as not easily actionable. In previous work, we proposed a novel technique that solves both these challenges through a technique for automatically generating high-quality patches for static analysis violations by learning from previous repair examples. In this paper, we present a tool Phoenix, implementing this technique. We describe the architecture, user interfaces, and salient features of Phoenix, and specific practical use cases of its technology. A video demonstrating Phoenix is available at https://phoenix-tool.github.io/demo-video.html.

DOI: 10.1145/3377812.3382150

VITALSE： visualizing eye tracking and biometric data

作者: Roy, Devjeet and Fakhoury, Sarah and Arnaoudova, Venera
关键词: No keywords

Abstract

Recent research in empirical software engineering is applying techniques from neurocognitive science and breaking new grounds in the ways that researchers can model and analyze the cognitive processes of developers as they interact with software artifacts. However, given the novelty of this line of research, only one tool exists to help researchers represent and analyze this kind of multi-modal biometric data. While this tool does help with visualizing temporal eyetracking and physiological data, it does not allow for the mapping of physiological data to source code elements, instead projecting information over images of code. One drawback of this is that researchers are still unable to meaningfully combine and map physiological and eye tracking data to source code artifacts. The use of images also bars the support of long or multiple code files, which prevents researchers from analyzing data from experiments conducted in realistic settings. To address these drawbacks, we propose VITALSE, a tool for the interactive visualization of combined multi-modal biometric data for software engineering tasks. VITALSE provides interactive and customizable temporal heatmaps created with synchronized eyetracking and biometric data. The tool supports analysis on multiple files, user defined annotations for points of interest over source code elements, and high level customizable metric summaries for the provided dataset. VITALSE, a video demonstration, and sample data to demonstrate its capabilities can be found at http://www.vitalse.app.

DOI: 10.1145/3377812.3382154

BigTest： a symbolic execution based systematic test generation tool for Apache spark

作者: Gulzar, Muhammad Ali and Musuvathi, Madanlal and Kim, Miryung
关键词: data intensive scalable computing, dataflow programs, map reduce, symbolic execution, test generation

Abstract

Data-intensive scalable computing (DISC) systems such as Google’s MapReduce, Apache Hadoop, and Apache Spark are prevalent in many production services. Despite their popularity, the quality of DISC applications suffers due to a lack of exhaustive and automated testing. Current practices of testing DISC applications are limited to using a small random sample of the entire input dataset which merely exposes any program faults. Unlike SQL queries, testing DISC applications has new challenges due to a composition of both dataflow and relational operators, and user-defined functions (UDF) that could be arbitrarily long and complex.To address this problem, we demonstrate a new white-box testing framework called BigTest that takes an Apache Spark program as input and automatically generates synthetic, concrete data for effective and efficient testing. BigTest combines the symbolic execution of UDFs with the logical specifications of dataflow and relational operators to explore all paths in a DISC application. Our experiments show that BigTest is capable of generating test data that can reveal up to 2X more faults than the entire data set with 194X less testing time. We implement BigTest in a Java-based command line tool with a pre-compile binary jar. It exposes a configuration file in which a user can edit preferences, including the path of a target program, the upper bound of loop exploration, and a choice of theorem solver. The demonstration video of BigTest is available at https://youtu.be/OeHhoKiDYso and BigTest is available at https://github.com/maligulzar/BigTest.

DOI: 10.1145/3377812.3382145

PG-KLEE： trading soundness for coverage

作者: Rutledge, Richard and Orso, Alessandro
关键词: input generation, program analysis, symbolic execution

Abstract

Comprehensive test inputs are an essential ingredient for dynamic software analysis techniques, yet are typically impossible to obtain and maintain. Automated input generation techniques can supplant manual effort in many contexts, but they also exhibit inherent limitations in practical applications. Therefore, the best approach to input generation for a given application task necessarily entails compromise. Most symbolic execution approaches maintain soundness by sacrificing completeness. In this paper, we take the opposite approach and demonstrate PG-KLEE, an input generation tool that over-approximates program behavior to achieve complete coverage. We also summarize some empirical results that validate our claims. Our technique is detailed in an earlier paper [16], and the source code of PG-KLEE is available from [2].Video URL: https://youtu.be/b1ajzW6YWds

DOI: 10.1145/3377812.3382148

RTj： a Java framework for detecting and refactoring rotten green test cases

作者: Martinez, Matias and Etien, Anne and Ducasse, St'{e
关键词: No keywords

Abstract

Rotten green tests are passing tests which have at least one assertion that is not executed. They give developers a false sense of trust in the code. In this paper, we present RTj, a framework that analyzes test cases from Java projects with the goal of detecting and refactoring rotten test cases. RTj automatically discovered 418 rotten tests from 26 open-source Java projects hosted on GitHub. Using RTj, developers have an automated recommendation of the tests that need to be modified for improving the quality of the applications under test. A video is available at: https://youtu.be/Uqxf-Wzp3Mg

DOI: 10.1145/3377812.3382151

GeekyNote： a technical documentation tool with coverage, backtracking, traces, and couplings

作者: Cheng, Yung-Pin and Hsiung, Wei-Nien and Wu, Yu-Shan and Chen, Li-Hsuan
关键词: document coverage, document effectiveness, program documentation, technical documentation

Abstract

Understanding an unfamiliar program is always a daunting task for any programmer, either experienced or inexperienced. Many studies have shown that even an experienced programmer who is already familiar with the code may still need to rediscover the code frequently during software maintenance. The difficulties of program comprehension is much more intense when a system is completely new. One well-known solution to this notorious problem is to create effective technical documentation to make up for the lack of knowledge.The purpose of technical documentation is to achieve the transfer of knowledge. However, creating effective technical documentation has been impeded by many problems in practice [1]. In this paper, we propose a novel tool called GeekyNote to address the major challenges in technical documentation. The key ideas GeekyNote proposes are: (1) documents are annotated to versioned source code transparently; (2) formal textual writings are discouraged and screencasts (or other forms of documents) are encouraged; (3) the up-to-dateness between documents and code can be detected, measured, and managed; (4) the documentation that works like a debugging-trace is supported; (5) couplings can be easily created and managed for future maintenance needs; (6) how good a system is documented can be measured. A demo video can be accessed at https://youtu.be/cBueuPVDgWM.

DOI: 10.1145/3377812.3382136

DroidMutator： an effective mutation analysis tool for Android applications

作者: Liu, Jian and Xiao, Xusheng and Xu, Lihua and Dou, Liang and Podgurski, Andy
关键词: Android, mutation analysis, operators

Abstract

With the rapid growth of Android devices, techniques that ensure high quality of mobile applications (i.e., apps) are receiving more and more attention. It is well-accepted that mutation analysis is an effective approach to simulate and locate realistic faults in the program. However, there exist few practical mutation analysis tools for Android apps. Even worse, existing mutation analysis tools tend to generate a large number of mutants, hindering broader adoption of mutation analysis, let alone the remaining high number of stillborn mutants. Additionally, mutation operators are usually pre-defined by such tools, leaving users less ability to define specific operators to meet their own needs. To address the aforementioned problems, we propose DroidMutator, a mutation analysis tool specifically for Android apps with configurability and extensibility. DroidMutator reduces the number of generated stillborn mutants through type checking, and the scope of mutation operators can be customized so that it only generates mutants in specific code blocks, thus generating fewer mutants with more concentrated purposes. Furthermore, it allows users to easily extend their mutation operators. We have applied DroidMutator on 50 open source Android apps and our experimental results show that DroidMutator effectively reduces the number of stillborn mutants and improves the efficiency of mutation analysis.Demo link: https://github.com/SQS-JLiu/DroidMutatorVideo link: https://youtu.be/dtD0oTVioHM

DOI: 10.1145/3377812.3382134

An SLR-tool： search process in practice： a tool to conduct and manage systematic literature review (SLR)

作者: Hinderks, Andreas and Mayo, Francisco Jos'{e
关键词: SLR, systematic literature review, tool

Abstract

Systematic Literature Reviews (SLRs) have established themselves as a method in the field of software engineering. The aim of an SLR is to systematically analyze existing literature in order to answer a research question. In this paper, we present a tool to support an SLR process. The main focus of the SLR tool (https://www.slr-tool.com/) is to create and manage an SLR project, to import search results from search engines, and to manage search results by including or excluding each paper. A demo video of our SLR tool is available at https://youtu.be/Jan8JbwiE4k.

DOI: 10.1145/3377812.3382137

Nimbus： improving the developer experience for serverless applications

作者: Chatley, Robert and Allerton, Thomas
关键词: developer experience, developer tools, serverless

Abstract

We present Nimbus, a framework for writing and deploying Java applications on a Function-as-a-Service (“serverless”) platform. Nimbus aims to soothe four main pain points experienced by developers working on serverless applications: that testing can be difficult, that deployment can be a slow and painful process, that it is challenging to avoid vendor lock-in, and that long cold start times can introduce unwelcome latency to function invocations.Nimbus provides a number of features that aim to overcome these challenges when working with serverless applications. It uses an annotation-based configuration to avoid having to work with large configuration files. It aims to allow the code written to be cloud-agnostic. It provides an environment for local testing where the complete application can be run locally before deployment. Lastly, Nimbus provides mechanisms for optimising the contents and size of the artifacts that are deployed to the cloud, which helps to reduce both deployment times and cold start times.Video: https://www.youtube.com/watch?v=0nYchh8jdY4

DOI: 10.1145/3377812.3382135

mCoq： mutation analysis for Coq verification projects

作者: Jain, Kush and Palmskog, Karl and Celik, Ahmet and Arias, Emilio Jes'{u
关键词: Coq, deductive verification, mutation analysis, proof assistants

Abstract

Software developed and verified using proof assistants, such as Coq, can provide trustworthiness beyond that of software developed using traditional programming languages and testing practices. However, guarantees from formal verification are only as good as the underlying definitions and specification properties. If properties are incomplete, flaws in definitions may not be captured during verification, which can lead to unexpected system behavior and failures. Mutation analysis is a general technique for evaluating specifications for adequacy and completeness, based on making small-scale changes to systems and observing the results. We demonstrate mCoq, the first mutation analysis tool for Coq projects. mCoq changes Coq definitions, with each change producing a modified project version, called a mutant, whose proofs are exhaustively checked. If checking succeeds, i.e., the mutant is live, this may indicate specification incompleteness. Since proof checking can take a long time, we optimized mCoq to perform incremental and parallel processing of mutants. By applying mCoq to popular Coq libraries, we found several instances of incomplete and missing specifications manifested as live mutants. We believe mCoq can be useful to proof engineers and researchers for analyzing software verification projects. The demo video for mCoq can be viewed at: https://youtu.be/QhigpfQ7dNo.

DOI: 10.1145/3377812.3382156

MPI-SV： a symbolic verifier for MPI programs

作者: Chen, Zhenbang and Yu, Hengbiao and Fu, Xianjin and Wang, Ji
关键词: MPI-SV, message passing interface, symbolic verification

Abstract

Message passing is the primary programming paradigm in high-performance computing. However, developing message passing programs is challenging due to the non-determinism caused by parallel execution and complex programming features such as non-deterministic communications and asynchrony. We present MPI-SV, a symbolic verifier for verifying the parallel C programs using message passing interface (MPI). MPI-SV combines symbolic execution and model checking in a synergistic manner to improve the scalability and enlarge the scope of verifiable properties. We have applied MPI-SV to real-world MPI C programs. The experimental results indicate that MPI-SV can, on average, achieve 19x speedups in verifying deadlock-freedom and 5x speedups in finding counter-examples. MPI-SV can be accessed at https://mpi-sv.github.io, and the demonstration video is at https://youtu.be/zzCY0CPDNCw.

DOI: 10.1145/3377812.3382144

DCO analyzer： local controllability and observability analysis and enforcement of distributed test scenarios

作者: Lima, Bruno and Faria, Jo~{a
关键词: distributed systems testing, integration testing, local controllability, local observability

Abstract

To ensure interoperability and the correct behavior of heterogeneous distributed systems in key scenarios, it is important to conduct automated integration tests, based on distributed test components (called local testers) that are deployed close to the system components to simulate inputs from the environment and monitor the interactions with the environment and other system components. We say that a distributed test scenario is locally controllable and locally observable if test inputs can be decided locally and conformance errors can be detected locally by the local testers, without the need for exchanging coordination messages between the test components during test execution (which may reduce the responsiveness and fault detection capability of the test harness). DCO Analyzer is the first tool that checks if distributed test scenarios specified by means of UML sequence diagrams exhibit those properties, and automatically determines a minimum number of coordination messages to enforce them.The demo video for DCO Analyzer can be found at https://youtu.be/LVIusK36_bs.

DOI: 10.1145/3377812.3382142

Uncertainty-guided testing and robustness enhancement for deep learning systems

作者: Zhang, Xiyue
关键词: No keywords

Abstract

Deep learning (DL) systems, though being widely used, still suffer from quality and reliability issues. Researchers have put many efforts to investigate these issues. One promising direction is to leverage uncertainty, an intrinsic characteristic of DL systems when making decisions, to better understand their erroneous behavior. DL system testing is an effective method to reveal potential defects before the deployment into safety- and security-critical applications. Various techniques and criteria have been designed to generate defect-triggers, i.e. adversarial examples (AEs). However, whether these test inputs could achieve a full spectrum examination of DL systems remains unknown and there still lacks understanding of the relation between AEs and DL uncertainty. In this work, we first conduct an empirical study to uncover the characteristics of AEs from the perspective of uncertainty. Then, we propose a novel approach to generate inputs that are missed by existing techniques. Further, we investigate the usefulness and effectiveness of the data for DL robustness enhancement.

DOI: 10.1145/3377812.3382160

Evaluation of brain activity while pair programming

作者: Thapaliya, Ananga
关键词: No keywords

Abstract

In this research, we investigate the effect of pair programming on the mind of software developers using data coming from EEG and how it effects the overall outcome of their task. For this research, we use EEG device to measure the brain-behavior relations of the developer and analyze the electromagnetic waves using ERD and correlation. We measure the concentration level, either it is high or low under three different cases: solo programming, pair programming (navigator) and pair programming (driver). The preliminary results of the analysis of pair programming confirms the higher concentration level as compared to solo programming.

DOI: 10.1145/3377812.3382161

Machine translation testing via pathological invariance

作者: Gupta, Shashij
关键词: No keywords

Abstract

Due to the rapid development of deep neural networks, in recent years, machine translation software has been widely adopted in people’s daily lives, such as communicating with foreigners or understanding political news from the neighbouring countries. However, machine translation software could return incorrect translations because of the complexity of the underlying network. To address this problem, we introduce a novel methodology called PaInv for validating machine translation software. Our key insight is that sentences of different meanings should not have the same translation (i.e., pathological invariance). Specifically, PaInv generates syntactically similar but semantically different sentences by replacing one word in the sentence and filter out unsuitable sentences based on both syntactic and semantic information. We have applied PaInv to Google Translate using 200 English sentences as input with three language settings: English→Hindi, English→Chinese, and English→German. PaInv can accurately find 331 pathological invariants in total, revealing more than 100 translation errors.

DOI: 10.1145/3377812.3382162

Automatic generation of simulink models to find bugs in a cyber-physical system tool chain using deep learning

作者: Shrestha, Sohil Lal
关键词: No keywords

Abstract

Testing cyber-physical system (CPS) development tools such as MathWorks’ Simulink is very important as they are widely used in design, simulation, and verification of CPS data-flow models. Existing randomized differential testing frameworks such as SLforge leverages semi-formal Simulink specifications to guide random model generation which requires significant research and engineering investment along with the need to manually update the tool, whenever MathWorks updates model validity rules. To address the limitations, we propose to learn validity rules automatically by learning a language model using our framework DeepFuzzSL from existing corpus of Simulink models. In our experiments, DeepFuzzSL consistently generate over 90% valid Simulink models and also found 2 confirmed bugs by MathWorks Support.

DOI: 10.1145/3377812.3382163

Playing with your project data in scrum retrospectives

作者: Matthies, Christoph
关键词: agile, retrospective, software repository mining

Abstract

Modern, agile software development methods rely on iterative work and improvement cycles to deliver their claimed benefits. In Scrum, the most popular agile method, process improvement is implemented through regular Retrospective meetings. In these meetings, team members reflect on the latest development iteration and decide on improvement actions. To identify potential issues, data on the completed iteration needs to be gathered. The Scrum method itself does not prescribe these steps in detail. However, Retrospective games, i.e. interactive group activities, have been proposed to encourage the sharing of experiences and problems. These activities mostly rely on the collected perceptions of team members. However, modern software development practices produce a large variety of digital project artifacts, e.g. commits in version control systems or test run results, which contain detailed information on performed teamwork. We propose taking advantage of this information in new, data-driven Retrospective activities, allowing teams to gain additional insights based on their own team-specific data.

DOI: 10.1145/3377812.3382164

An empirical study of the first contributions of developers to open source projects on GitHub

作者: Subramanian, Vikram N.
关键词: No keywords

Abstract

The popularity of Open Source Software (OSS) is at an all-time high and for it to remain so it is vital for new developers to continually join and contribute to the OSS community. In this paper, to better understand the first time contributor, we study the characteristics of the first pull request (PR) made to an OSS project by developers. We mine GitHub for the first OSS PR of 3501 developers to study certain characteristics of PRs like language and size. We find that over 1/3rd of the PRs were in Java while C++ was very unpopular. A large fraction of PRs didn’t even involve writing code, and were a mixture of trivial and non-trivial changes.

DOI: 10.1145/3377812.3382165

Stress testing SMT solvers via type-aware mutation

作者: Zhang, Chengyu
关键词: SMT solver, software testing, test case generation

Abstract

This paper introduces type-aware mutation, a simple, but effective methodology for stress testing Satisfiability Modulo Theories (SMT) solvers. The key idea is mutating the operators of the formula to generate test inputs for differential testing, while considering the types of the operators to ensure the mutants are still valid. The realization of type-aware mutation was evaluated on finding bugs in two state-of-the-art SMT solvers, Z3 and CVC4. During the three months of empirical evaluation, 101 unique, confirmed bugs were found by type-aware mutation, and 87 of them have been fixed. The testing efforts and bugs were well-appreciated by the developers.

DOI: 10.1145/3377812.3382166

The role of egocentric bias in undergraduate Agile software development teams

作者: Ramin, Frederike
关键词: agile software development, egocentric bias, scrum, teamwork

Abstract

The egocentric bias describes the tendency to value one’s own input and perspective higher than that of others. This phenomenon impacts collaboration and teamwork. However, current research on the subject concerning modern software development is lacking. We conducted a case study of 26 final year software engineering students and collected the perceptions of individual contributions to team efforts through regular surveys. We report evidence of an egocentric bias in engineering team members, which decreased over time. In contrast, we found no in-group bias, i.e. favoritism regarding contributions of own team members. We discuss our initial analyses and results, which we hypothesize can be explained by group cohesiveness as well as non-competition and group similarity, respectively.

DOI: 10.1145/3377812.3382167

Studying and suggesting logging locations in code blocks

作者: Li, Zhenhao
关键词: No keywords

Abstract

Developers write logging statements to generate logs and record system execution behaviors to assist in debugging and software maintenance. However, there exists no practical guidelines on where to write logging statements. On one hand, adding too many logging statements may introduce superfluously trivial logs and performance overheads. On the other hand, logging too little may miss necessary runtime information. Thus, properly deciding the logging location is a challenging task and a finer-grained understanding of where to write logging statements is needed to assist developers in making logging decisions. In this paper, we conduct a comprehensive study to uncover guidelines on logging locations at the code block level. We analyze logging statements and their surrounding code by combining both deep learning techniques and manual investigations. From our preliminary results, we find that our deep learning models achieve over 90% in precision and recall when trained using the syntactic (e.g., nodes in abstract syntax tree) and semantic (e.g., variable names) features. However, cross-system models trained using semantic features only have 45.6% in precision and 73.2% in recall, while models trained using syntactic features still have over 90% precision and recall. Our current progress highlights that there is an implicit syntactic logging guideline across systems, and such information may be leveraged to uncover general logging guidelines.

DOI: 10.1145/3377812.3382168

Exploring the relationship between dockerfile quality and project characteristics

作者: Wu, Yiwen
关键词: dockerfile, dockerfile quality, project characteristics

Abstract

Dockerfile plays an important role in the Docker-based software development process, but many Dockerfile codes are infected with quality issues in practice. Previous empirical studies showed the existence of association between code quality and project characteristics. However, the relationship between Dockerfile quality and project characteristics has never been explored. In this paper, we seek to empirically study this relation through a large dataset of 6,334 projects. Using linear regression analysis, when controlled for various variables, we statistically identify and quantify the relationship between Dockerfile quality and project characteristics.

DOI: 10.1145/3377812.3382169

Hanging by the thread： an empirical study of donations in open source

作者: Overney, Cassandra
关键词: No keywords

Abstract

Open source plays a critical role in our software infrastructure. It is used in the creation of almost every product and makes it increasingly easy to create powerful software cheaply and quickly, which many companies benefit from. However, its importance and our dependence on it, are often not recognized [2]. Like all software projects, open source needs maintenance to fix bugs and adapt code to evolving technologies [12]. With increasing popularity, demands for maintenance and support work also rise, resulting in many requests and reported issues. How to supply all of the needed maintenance and development work is an open and sometimes controversial question.

DOI: 10.1145/3377812.3382170

An automated framework for gaming platform to test multiple games

作者: Song, Zihe
关键词: adversarial inverse reinforcement learning, automated game testing, evolutionary multi-objective optimization

Abstract

Game testing is a necessary but challenging task for gaming platforms. Current game testing practice requires significant manual effort. In this paper, we proposed an automated game testing framework combining adversarial inverse reinforcement learning algorithm with evolutionary multi-objective optimization. This framework aims to help gaming platform to assure market-wide game qualities as the framework is suitable to test different games with minimum manual customization for each game.

DOI: 10.1145/3377812.3382171

Improving bug detection and fixing via code representation learning

作者: Li, Yi
关键词: No keywords

Abstract

The software quality and reliability have been proved to be important during the program development. There are many existing studies trying to help improve it on bug detection and automated program repair processes. However, each of them has its own limitation and the overall performance still have some improvement space. In this paper, we proposed a deep learning framework to improve the software quality and reliability on these two detect-fix processes. We used advanced code modeling and AI models to have some improvements on the state-of-the-art approaches. The evaluation results show that our approach can have a relative improvement up to 206% in terms of F-1 score when comparing with baselines on bug detection and can have a relative improvement up to 19.8 times on the correct bug-fixing amount when comparing with baselines on automated program repair. These results can prove that our framework can have an outstanding performance on improving software quality and reliability in bug detection and automated program repair processes.

DOI: 10.1145/3377812.3382172

Automated analysis of inter-parameter dependencies in web APIs

作者: Martin-Lopez, Alberto
关键词: DSL, automated analysis, interdependence CSP, web service

Abstract

Web services often impose constraints that restrict the way in which two or more input parameters can be combined to form valid calls to the service, i.e. inter-parameter dependencies. Current web API specification languages like the OpenAPI Specification (OAS) provide no support for the formal description of such dependencies, making it hardly possible to interact with the services without human intervention. We propose specifying and automatically analyzing inter-parameter dependencies in web APIs. To this end, we propose a domain-specific language to describe these dependencies, a constraint programming-aided tool supporting their automated analysis, and an OAS extension integrating our approach and easing its adoption. Together, these contributions open a new range of possibilities in areas such as source code generation and testing.

DOI: 10.1145/3377812.3382173

Detection and mitigation of JIT-induced side channels

作者: Brennan, Tegan
关键词: No keywords

Abstract

Cyber-attacks stealing confidential information are becoming increasingly frequent and devastating as modern software systems store and manipulate greater amounts of sensitive data. Leaking information about private user data, such as the financial and medical records of individuals, trade secrets of companies and military secrets of states can have drastic consequences. Confidentiality of such private data is critical for users of these systems. Many software development practices, such as the encryption of packages sent over a network, aim to protect the confidentiality of private data by ensuring that an observer is unable to learn anything meaningful about a program’s secret input from its public output. Under these protections, the software system’s main communication channels, such as the content of the network packets it sends, or the output it writes to a public file, should not leak information about the private data. However, many software systems still contain serious security vulnerabilities. Side channels are an important class of information leaks where secret information can be captured through the observation of non-functional side effects of software systems. Potential side channels include those in execution time, memory usage, size and timings of network packets, and power consumption. Although side-channel vulnerabilities due to hardware (such as vulnerabilities that exploit the cache behavior) have been extensively studied [1, 2, 10, 13, 15-17, 19, 23], software side channels have only recently become an active area of research, including recent results on software side-channel detection [4, 8, 11, 12, 18, 22, 24] and quantification [5, 20, 21], and my own work on a static analysis framework for detection of software side-channels called CoCo-Channel [8]) and a constraint caching framework to accelerate side-channel quantification called Cashew [9].

DOI: 10.1145/3377812.3382174

Does fixing bug increase robustness in deep learning?

作者: Pan, Rangeet
关键词: bug fix, bugs, deep neural networks, robustness

Abstract

Deep Learning (DL) based systems are utilized vastly. Developers update the code to fix the bugs in the system. How these code fixing techniques impacts the robustness of these systems has not been clear. Does fixing code increase the robustness? Do they deteriorate the learning capability of the DL based systems? To answer these questions, we studied 321 Stack Overflow posts based on a published dataset. In this study, we built a classification scheme to analyze how bug-fixes changed the robustness of the DL model and found that most of the bug-fixes can increase the robustness. We also found evidence of bug-fixing that decrease the robustness. Our preliminary result suggests that 12.5% and 2.4% of the bug-fixes in Stack Overflow posts caused the increase and the decrease of the robustness of DL models, respectively.

DOI: 10.1145/3377812.3382175

An empirical study on the evolution of test smell

作者: Kim, Dong Jae
关键词: No keywords

Abstract

Test smell as analogous to code smell is a poor design choice in the implementation of test code. Recently, the concept of test smell has become the utmost interest of researchers and practitioners. Surveys show that developers’ are aware of test smells and their potential consequences in the software system. However, there is limited empirical evidence for how developers address test smells during software evolution. Thus, in this paper, we study 2 research questions: (RQ1) How do test smells evolve? (RQ2) What is the motivation for removing test smells? Our result shows that Assertion Roulette, Conditional Test Logic and Unknown tests have a high rate of churns, the feature addition and improvement motivate refactoring, but test smell persists, implicating sub-optimal practice. In our study, we hope to fill the gap between academia and industry by providing evidence of sub-optimal practice in the way developers address test smells, and how it may be detrimental to the software.

DOI: 10.1145/3377812.3382176

Efficient test execution in end to end testing： resource optimization in end to end testing through a smart resource characterization and orchestration

作者: Augusto, Cristian
关键词: containerization, efficient resource usage, end to end testing, orchestration

Abstract

Virtualization and containerization have been two disruptive technologies in the past few years. Both technologies have allowed isolating the applications with fewer resources and have impacted fields such as Software Testing. In the field of testing, the execution of the containerized/virtualized test suite has achieved great savings, but when the complexity increases or the cost of deployment rises, there are open challenges like the efficient execution of End to End (E2E) test suites. This paper proposes a research problem and a feasible solution that looks to improve resource usage in the E2E tests, towards smart resource identification and a proper organization of its execution in order to achieve efficient and effective resource usage. The resources are characterized by a series of attributes that provide information about the resource and its usage during the E2E testing phase. The test cases are grouped and scheduled with the resources (i.e. parallelized in the same machine or executed in a fixed arrangement), achieving an efficient test suite execution and reducing its total cost/time.

DOI: 10.1145/3377812.3382177

Towards automated migration for blockchain-based decentralized application

作者: Xu, Xiufeng
关键词: blockchain, decentralized application, ethereum, migration, smart contract

Abstract

Blockchain-based decentralized application is becoming more widely accepted because it publicly runs on the blockchain and cannot be modified implicitly. However, the fact that only a few developers can master both blockchain and front-end programming skills results in the error-prone DApps especially when smart contracts has undergone a migration. The existing techniques rarely pay attention to DApps’ automated migration. In this paper, we first summarized 6 migration categories and proposed an approach to figure out where changes are and its categories. Besides, we designed a function call graph structure to ensure mapping relationship accurate and compared it with distinctions between two versions of ABI to offer revising suggestions. We have developed an automated tool to implement our approach in real-world DApps and acquired positive preliminary evaluation results which illustrated the practical value in realizing DApps’ automated migration.

DOI: 10.1145/3377812.3382178

Skill-based engineering in industrial automation domain： skills modeling and orchestration

作者: Dorofeev, Kirill
关键词: No keywords

Abstract

Software engineering in the industrial automation domain requires generic methods to keep the development complexity at an acceptable level. However, nowadays various PLC vendors use different dialects of the standardized programming languages in their tools, which hinders the re-usability and interoperability across the platforms. The service-oriented approaches can serve to overcome interoperability issues. In distributed control systems, the functionality of an automation component can be offered to the other parties that constitute a production system via a standardized interface, easing the orchestration of the whole system. This paper proposes such a generic interface that hides away the low-level implementation details of a particular functionality and provides a common semantic model for the execution. Further, we show how using such an interface can help to support and automate the overall engineering process, combining the functionality of different components to fulfill a production task. The reference implementation of the proposed concept was used in an industrial demonstrator, which shows the benefits in the system flexibility due to components interoperability and re-usability compared to the traditional control approaches.

DOI: 10.1145/3377812.3381394

Scalable and approximate program dependence analysis

作者: Lee, Seongmin
关键词: ORBS, program analysis, program slicing

Abstract

Program dependence is a fundamental concept to many software engineering tasks, yet the traditional dependence analysis struggles to cope with common modern development practices such as multi-lingual implementations and use of third-party libraries. While Observation-based Slicing (ORBS) solves these issues and produces an accurate slice, it has a scalability problem due to the need to build and execute the target program multiple times. We would like to propose a radical change of perspective: a useful dependence analysis needs to be scalable even if it approximates the dependency. Our goal is a scalable approximate program dependence analysis via estimating the likelihood of dependence. We claim that 1) using external information such as lexical analysis or a development history, 2) learning dependence model from partial observations, and 3) merging static, and observation-based approach would assist the proposition. We expect that our technique would introduce a new perspective of program dependence analysis into the likelihood of dependence. It would also broaden the capability of the dependence analysis towards large and complex software.

DOI: 10.1145/3377812.3381392

The effects of required security on software development effort

作者: Venson, Elaine
关键词: secure software development, software cost model, software security practices

Abstract

Problem: developers are increasingly adopting security practices in software projects in response to cyber threats. Despite the additional effort required to perform those practices, current cost models either do not consider security as an input or were not properly validated with empirical data. Hypothesis: increasing degrees of application of security practices and security features, motivated by security risks, lead to growing levels of added software development effort. Such an effort increase can be quantified through a parametric model that takes as input the usage degrees of security practices and requirements and outputs the additional software development effort. Contributions: the accurate prediction of secure software development effort will support the provision of a proper amount of resources to projects. We also expect that the quantification of the security effort will contribute to advance research on the cost-effectiveness of software security.

DOI: 10.1145/3377812.3381393

Towards greener Android application development

作者: Anwar, Hina
关键词: code smell refactoring, energy consumption, green Android development, green software engineering, third-party libraries

Abstract

Empirical studies have shown that mobile applications that do not drain battery usually get good ratings from users. To make mobile application energy efficient many studies have been published that present refactoring guidelines and tools to optimize the code. However, these guidelines cannot be generalized w.r.t energy efficiency, as there is not enough energy related data for every context. Existing energy enhancement tools/profilers are mostly prototypes applicable to only a small subset of energy related problems. In addition, the existing guidelines and tools mostly address the energy issues once they have already been introduced. My goal is to add to the existing energy related data by evaluating the energy consumption of various code smell refactorings and third-party libraries used in Android development. Data from such evaluations could provide generalized contextual guidelines that could be used during application development to prevent the introduction of energy related problems. I also aim to develop a support tool for the Android Studio IDE that could give meaningful recommendations to developers during development to make the application code more energy efficient.

DOI: 10.1145/3377812.3381390

Towards DO-178C certification of adaptive learning UAV agents designed with a cognitive architecture

作者: Pyrgies, John
关键词: DO-178C certification, adaptive and learning agent (ALA), cognitive architecture, formal methods, model-based development and verification, reinforcement learning, requirements engineering, unmanned aerial vehicle (UAV)

Abstract

Adaptive and Learning Agents (ALAs) bring computational intelligence to their Cyber Physical host systems to adapt to novel situations encountered in their complex operational environment. They do so by learning from their experience to improve their performance. RTCA DO-178C specifies a stringent certification process for airborne software which represents several challenges when applied to an ALA in regards of functional completeness, functional correctness, testability and adaptability. This research claims that it is possible to certify an Adaptive Learning Unmanned Aerial Vehicle (UAV) Agent designed as per a Cognitive Architecture with current DO-178C certification process when leveraging a qualified tool (DO-330), Model-Based Development and Verification (DO-331) and Formal Methods (DO-333). The research consists in developing, as a case study, an ALA embedded in a UAV aimed at neutralizing rogue UAVs in the vicinity of civil airports and test it in the field. This article is the plan to complete, by end 2022, a dissertation currently in its confirmation phase.

DOI: 10.1145/3377812.3381398

Bridging the divide between API users and API developers by mining public code repositories

作者: Lamothe, Maxime
关键词: No keywords

Abstract

Software application programming interfaces (APIs) are a ubiquitous part of Software Engineering. The evolution of these APIs requires constant effort from their developers and users alike. API developers must constantly balance keeping their products modern whilst keeping them as stable as possible. Meanwhile, API users must continually be on the lookout to adapt to changes that could break their applications. As APIs become more numerous, users are challenged by a myriad of choices and information on which API to use. Current research attempts to provide automatic documentation, code examples, and code completion to make API evolution more scalable for users. Our work will attempt to establish practical and scalable API evolution guidelines and tools based on public code repositories, to aid both API users and API developers.This thesis focuses on investigating the use of public code repositories provided by the open-source community to improve software API engineering practices. More specifically, I seek to improve software engineering practices linked to API evolution, both from the perspective of API users and API developers. To achieve this goal, I will apply quantitative and qualitative research methods to understand the problems at hand. I will then mine public code repositories to develop novel solutions to these problems.

DOI: 10.1145/3377812.3382124

Refactoring operations Grounded in manual code changes

作者: Eilertsen, Anna Maria
关键词: productivity, refactoring, refactoring tools

Abstract

Refactoring tools automate tedious and error-prone source code changes. The prevalence and difficulty of refactorings in software development makes this a high-impact area for successful automation of manual operations. Automated refactorings tools can improve the speed and accuracy of software development and are easily accessible in many programming environments. Even so, developers frequently eschew automation in favor of manual refactoring and cite reasons like lack of support for real usage scenarios and unpredictable tools. In this paper, we propose to redesign refactoring operations into transformations that are useful and applicable in real software evolution scenarios with the help of repository mining and user studies.

DOI: 10.1145/3377812.3381395

A composed technical debt identification methodology to predict software vulnerabilities

作者: Halepmollasi, Ru\c{s
关键词: feature engineering, machine learning, software security, technical debt

Abstract

Technical debt (TD), its impact on development and its consequences such as defects and vulnerabilities, are of common interest and great importance to software researchers and practitioners. Although there exist many studies investigating TD, the majority of them focuses on identifying and detecting TD from a single stage of development. There are also studies that analyze vulnerabilities focusing on some phases of the life cycle. Moreover, several approaches have investigated the relationship between TD and vulnerabilities, however, the generalizability and validity of findings are limited due to small dataset. In this study, we aim to identify TD through multiple phases of development, and to automatically measure it through data and text mining techniques to form a comprehensive feature model. We plan to utilize neural network based classifiers that will incorporate evolutionary changes on TD measures into predicting vulnerabilities. Our approach will be empirically assessed on open source and industrial projects.

DOI: 10.1145/3377812.3381396

Variability aware requirements reuse analysis

作者: Abbas, Muhammad
关键词: product line, requirements, similarities, software reuse, variability

Abstract

Problem: The goal of a software product line is to aid quick and quality delivery of software products, sharing common features. Effectively achieving the above-mentioned goals requires reuse analysis of the product line features. Existing requirements reuse analysis approaches are not focused on recommending product line features, that can be reused to realize new customer requirements. Hypothesis: Given that the customer requirements are linked to product line features’ description satisfying them: then the customer requirements can be clustered based on patterns and similarities, preserving the historic reuse information. New customer requirements can be evaluated against existing customer requirements and reuse of product line features can be recommended. Contributions: We treated the problem of feature reuse analysis as a text classification problem at the requirements-level. We use Natural Language Processing and clustering to recommend reuse of features based on similarities and historic reuse information. The recommendations can be used to realize new customer requirements.

DOI: 10.1145/3377812.3381399

KNOCAP： capturing and delivering important design bits in whiteboard design meetings

作者: Soria, Adriana Meza
关键词: design knowledge, design meetings, design rationale, important design bits

Abstract

It is well known that it is desirable to capture the most essential parts of software design meetings that take place at the whiteboard. It is equally well known, however, that actual capture rarely takes place. A few photos may be taken, informal notes might be scribbled down, and at best one of the developers may be tasked with creating a summary. Regardless, problems persist with important information being lost and, even when information is captured, that information not being easily located and accessed. To address these problems, I propose to design and evaluate a novel suite of tools that enables software designers working at the whiteboard to: (1) efficiently and in-the-moment capture important information produced during that meeting, and (2) be delivered, either by request or proactively by the tools, relevant information captured in the past when it is needed in a future design meeting.

DOI: 10.1145/3377812.3381397

Towards providing automated supports to developers on writing logging statements

作者: Li, Zhenhao
关键词: No keywords

Abstract

Developers write logging statements to generate logs and record system execution behaviors. Such logs are widely used for a variety of tasks, such as debugging, testing, program comprehension, and performance analysis. However, there exists no practical guidelines on how to write logging statements; hence, making the logging decision a very challenging task. There are two main challenges that developers are facing while making logging decisions: 1) Difficult to accurately and succinctly record execution behaviors; and 2) Hard to decide where to write logging statements. This thesis proposes a series of approaches to address the problems and help developers make logging decisions in two aspects: assist in making decisions on logging contents and on logging locations. Through case studies on large-scale open source and commercial systems, we anticipate that our study will provide useful suggestions and supports to developers for writing better logging statements.

DOI: 10.1145/3377812.3381385

AI-driven web API testing

作者: Martin-Lopez, Alberto
关键词: artificial intelligence, automated software testing, restful API, testing framework, web API

Abstract

Testing of web APIs is nowadays more critical than ever before, as they are the current standard for software integration. A bug in an organization’s web API could have a huge impact both internally (services relying on that API) and externally (third-party applications and end users). Most existing tools and testing approaches require writing tests or instrumenting the system under test (SUT). The main aim of this dissertation is to take web API testing to an unprecedented level of automation and thoroughness. To this end, we plan to apply artificial intelligence (AI) techniques for the autonomous detection of software failures. Specifically, the idea is to develop intelligent programs (we call them “bots”) capable of generating hundreds, thousands or even millions of test inputs and to evaluate whether the test outputs are correct based on: 1) patterns learned from previous executions of the SUT; and 2) knowledge gained from analyzing thousands of similar programs. Evaluation results of our initial prototype are promising, with bugs being automatically detected in some real-world APIs.

DOI: 10.1145/3377812.3381388

Performance regression detection in DevOps

作者: Chen, Jinfu
关键词: load test, performance regression detection, software performance

Abstract

Performance is an important aspect of software quality. The goals of performance are typically defined by setting upper and lower bounds for response time and throughput of a system and physical level measurements such as CPU, memory and I/O. To meet such performance goals, several performance-related activities are needed in development (Dev) and operations (Ops). In fact, large software system failures are often due to performance issues rather than functional bugs. One of the most important performance issues is performance regression. Although performance regressions are not all bugs, they often have a direct impact on users’ experience of the system. The process of detection of performance regressions in development and operations is faced with challenges. First, the detection of performance regression is conducted after the fact, i.e., after the system is built and deployed in the field or dedicated performance testing environments. Large amounts of resources are required to detect, locate, understand and fix performance regressions at such a late stage in the development cycle. Second, even we can detect a performance regression, it is extremely hard to fix it because other changes are applied to the system after the introduction of the regression. These challenges call for further in-depth analyses of the performance regression. In this dissertation, to avoid performance regression slipping into operation, we first perform an exploratory study on the source code changes that introduce performance regressions in order to understand root-causes of performance regression in the source code level. Second, we propose an approach that automatically predicts whether a test would manifest performance regressions in a code commit. To assist practitioners to analyze system performance with operational data, we propose an approach to recovering field-representative workload that can be used to detect performance regression. We also propose that using execution logs generated by unit tests to predict performance regression in load tests.

DOI: 10.1145/3377812.3381386

Formalization and analysis of quantitative attributes of distributed systems

作者: Su~{n
关键词: No keywords

Abstract

While there is not much discussion on the importance of formally describing and analyzing quantitative requirements in the process of software construction; in the paradigm of API-based software systems, it could be vital. Quantitative attributes can be thought of as attributes determining the Quality of Service - QoS provided by a software component published as a service. In this sense, they play a determinant role in classifying software artifacts according to specific needs stated as requirements.In this work, we present a research program consisting of the development of formal languages and tools to characterize and analyze the Quality of Service attributes of software components in the context of distributed systems. More specifically, our main motivational scenario lays on the execution of a service-oriented architecture.

DOI: 10.1145/3377812.3381387

Bridging fault localisation and defect prediction

作者: Sohn, Jeongju
关键词: SBSE, defect prediction, fault localisation

Abstract

Identifying the source of a program failure plays an integral role in maintaining software quality. Both fault localisation and defect prediction aim to locate faults: fault localisation aims to locate faults after they are revealed while defect prediction aims to locate yet-to-happen faults. Despite sharing a similar goal, fault localisation and defect prediction have been studied as separate topics, mainly due to the difference in available data to exploit. In our doctoral research, we aim to bridge fault localisation and defect prediction. Our work is divided into three parts: 1) applying defect prediction to fault localisation, i.e., DP2FL, 2) applying fault localisation to defect prediction, i.e., FL2DP, 3) consecutive application of DP2FL and FL2DP in a single framework. We expect the synergy between fault localisation and defect prediction not only to improve the accuracy of each process but to allow us to build a single model that gradually improve the overall software quality throughout the entire software development life-cycle.

DOI: 10.1145/3377812.3381403

Improving students’ testing practices

作者: Bai, Gina R.
关键词: No keywords

Abstract

Software testing prevents and detects the introduction of faults and bugs during the process of evolving and delivering reliable software. As an important software development activity, testing has been intensively studied to measure test code quality and effectiveness, and assist professional developers and testers with automated test generation tools. In recent years, testing has been attracting educators’ attention and has been integrated into some Computer Science education programs. Understanding challenges and problems faced by students can help inform educators the topics that require extra attention and practice when presenting testing concepts and techniques.In my research, I study how students implement and modify source code given unit tests, and how they perceive and perform unit testing. I propose to quantitatively measure the quality of student-written test code, and qualitatively identify the common mistakes and bad smells observed in student-written test code. We compare the performance of students and professionals, who vary in prior testing experience, to investigate the factors that lead to high-quality test code. The ultimate goal of my research is to address the challenges students encountered during test code composition and improve their testing skills with supportive tools or guidance.

DOI: 10.1145/3377812.3381401

The sustainability of quality in free and open source software

作者: Alami, Adam
关键词: FOSS sustainability, software quality sustainability, software sustainability

Abstract

We learned from the history of software that great software are the ones who manage to sustain their quality. Free and open source software (FOSS) has become a serious software supply channel. However, trust on FOSS products is still an issue. Quality is a trait that enhances trust. In my study, I investigate the following question: how do FOSS communities sustain their software quality? I argue that human and social factors contribute to the sustainability of quality in FOSS communities. Amongst these factors are: the motivation of participants, robust governance style for the software change process, and the exercise of good practices in the pull requests evaluation process.

DOI: 10.1145/3377812.3381402

Understanding software changes： extracting, classifying, and presenting fine-grained source code changes

作者: Frick, Veit
关键词: change classification, change extraction, change summarization, software evolution, source code dependencies

Abstract

In modern software engineering, developers have to work with constantly evolving, interconnected software systems. Understanding how and why these systems and their dependencies between each other change is therefore an essential step in improving or maintaining them. For this, it is important to know what changed and how these changes influence the system. Most currently used tools that help developers to understand source code changes either use the textual representation of source code, allowing for a coarse-grained overview, or use the AST (abstract syntax tree) representation of source code to extract more fine-grained changes. We plan to improve the accuracy and classification of the extracted source code changes and to extend them by analysing the fine-grained changes of source code dependencies. We also propose a dynamical analysis of the impact of the previously extracted changes on performance metrics. This helps to understand what changes caused a certain change in program behaviour. We plan to use and combine this information to generate accurate and detailed change overviews that bridge the gap between existing coarse-grained solutions and the raw changes contained in the code, aiming to reduce the developers’ time spent reading changed code and help them to quickly understand the changes between two versions of source code.

DOI: 10.1145/3377812.3381400

Search-based test generation for Android apps

作者: Moreno, Iv'{a
关键词: Android testing, evolutionary algorithms, search-based

Abstract

Despite their growing popularity, apps tend to contain defects which can ultimately manifest as failures (or crashes) to end-users. Different automated tools for testing Android apps have been proposed in order to improve software quality. Although Genetic Algorithms and Evolutionary Algorithms (EA) have been promising in recent years, in light of recent results, it seems they are not yet fully tailored to the problem of Android test generation. Thus, this thesis aims to design and evaluate algorithms for alleviating the burden of testing Android apps. In particular, I plan to investigate which is the best search-based algorithm for this particular problem. As the thesis advances, I expect to develop a fully open-source test case generator for Android applications that will serve as a framework for comparing different algorithms. These algorithms will be compared using statistical analysis on both open-source (i.e., from F-Droid) and commercial applications (i.e., from Google Play Store).

DOI: 10.1145/3377812.3381389

作者: Chatterjee, Preetha
关键词: archival quality social content, online software developer chats

Abstract

Software developers are increasingly having conversations about software development via online chat services. Many of those chat communications contain valuable information, such as code descriptions, good programming practices, and causes of common errors/exceptions. However, the nature of chat community content is transient, as opposed to the archival nature of other developer communications such as email, bug reports and Q&A forums. As a result, important information and advice are lost over time.The focus of this dissertation is Extracting Archival Information from Software-Related Chats, specifically to (1) automatically identify conversations that contain archival-quality information, (2) accurately reduce the granularity of the information reported as archival information, and (3) conduct a case study to investigate how archival quality information extracted from chats compare to related posts in Q&A forums. Archiving knowledge from developer chats could be used potentially in several applications such as: creating a new archival mechanism available to a given chat community, augmenting Q&A forums, or facilitating the mining of specific information and improving software maintenance tools.

DOI: 10.1145/3377812.3381391

Software startups in growth phase SE practices adopted to SEE

作者: Cico, Orges
关键词: empirical studies, software engineering, software engineering education, software startups

Abstract

Context: Software has become ubiquitous in every corner of modern societies. During the last five decades, software engineering has also changed significantly to advance the development of various types and scales of software products. In this context, Software Engineering Education plays an essential role in keeping students updated with software technologies, processes, and practices that are popular in industries. Aim: In this PhD work, I want to answer the following research questions: To what extent software engineering trends are present in software engineering education? In which way software startup in growth phase characteristics can be transferred into software engineering education context? What is the impact of software startup engineering in the curriculum and to software engineering students? Method: I utilize literature review and mix-methods approaches (quantitative and qualitative data and methods triangulation) in gathering empirical evidence. More precisely, I split my research method into two phases. The first phase of the research will acquire knowledge and insight based on the existing literature review. The second research phase will split the focus in two directions. Firstly, I shall gather empirical evidence on how software startup practices are present in software engineering education. Secondly, I will conduct parallel investigations into SE practices in growth phase software startups. Expected Results: I argue that software startup engineering practices are an ultimate tool for software engineering education approaches. I expect students to acquire software engineering skills in a more realistic context while using software startup in growth phase practices.

DOI: 10.1145/3377812.3381406

Towards better technical debt detection with NLP and machine learning methods

作者: Rantala, Leevi
关键词: machine learning, natural language processing, technical debt

Abstract

Technical debt (TD) is an economical term used to depict non-optimal choices made in the software development process. It occurs usually when developers take shortcuts instead of following agreed upon development practices, and unchecked growth of technical debt can start to incur negative effects for software development processes.Technical debt detection and management is mainly done manually, and this is both slow and costly way of detecting technical debt. Automatic detection would solve this issue, but even state-of-the-art tools of today do not accurately detect the appearance of technical debt. Therefore, increasing the accuracy of automatic classification is of high importance, so that we could eliminate significant portion from the costs relating to technical debt detection.This research aims to solve the problem in detection accuracy by bringing in together static code analysis and natural language processing. This combination of techniques will allow more accurate detection of technical debt, when compared to them being used separately from each other. Research also aims to discover themes and topics from written developer messages that can be linked to technical debt. These can help us to understand technical debt from developers’ viewpoint. Finally, we will build an open-source tool/plugin that can be used to accurately detect technical debt using both static analysis and natural language processing methods.

DOI: 10.1145/3377812.3381404

Towards data integrity in Cassandra database applications using conceptual models

作者: Su'{a
关键词: Cassandra, NoSQL, consistency, data integrity, data modeling

Abstract

Data modeling in Cassandra databases follows a query-driven approach where each table is created to satisfy a query, leading to repeated data as the Cassandra model is not normalized by design. Consequently, developers bear the responsibility to maintain the data integrity at the application level, as opposed to when the model is normalized. This is done by embedding in the client application the appropriate statements to perform data changes, which is error prone. Cassandra data modeling methodologies have emerged to cope with this problem by proposing the use of a conceptual model to generate the logical model, solving the data modeling problem but not the data integrity one. In this thesis we address the problem of the integrity of these data by proposing a method that, given a data change at either the conceptual or the logical level, determines the executable statements that should be issued to preserve the data integrity. Additionally, as this integrity may also be lost as a consequence of creating new data structures in the logical model, we complement our method to preserve data integrity in these scenarios. Furthermore, we address the creation of data structures at the conceptual level to represent a normalized version of newly created data structures in the logical model.

DOI: 10.1145/3377812.3381405

Experiential learning in computing accessibility education

作者: Shi, Weishi and Khan, Saad and El-Glaly, Yasmine and Malachowsky, Samuel and Yu, Qi and Krutz, Daniel E.
关键词: accessibility education, computing accessibility, computing education

Abstract

Many developers don’t understand how to, or recognize the need to develop accessible software. To address this, we have created five educational Accessibility Learning Labs (ALL) using an experiential learning structure. Each of these labs addresses a foundational concept in computing accessibility and both inform participants about foundational concepts in creating accessible software while also demonstrating the necessity of creating accessible software. The hosted labs provide a complete educational experience, containing materials such as lecture slides, activities, and quizzes.We evaluated the labs in ten sections of a CS2 course at our university, with 276 students participating. Our primary findings include: I) The labs are an effective way to inform participants about foundational topics in creating accessible software II) The labs demonstrate the potential benefits of our proposed experiential learning format in motivating participants about the importance of creating accessible software III) The labs demonstrate that empathy material increases learning retention. Created labs and project materials are publicly available on the project website: http://all.rit.edu

DOI: 10.1145/3377812.3390901

Large-scale patch recommendation at Alibaba

作者: Zhang, Xindong and Zhu, Chenguang and Li, Yi and Guo, Jianmei and Liu, Lihua and Gu, Haobo
关键词: defect detection, patch generation, patch recommendation

Abstract

We present Precfix, a pragmatic approach targeting large-scale industrial codebase and making recommendations based on previously observed debugging activities. Precfix collects defect-patch pairs from development histories, performs clustering, and extracts generic reusable patching patterns as recommendations. Our approach is able to make recommendations within milliseconds and achieves a false positive rate of 22%. Precfix has been rolled out to Alibaba to support various critical businesses.

DOI: 10.1145/3377812.3390902

Do preparatory programming lab sessions contribute to even work distribution in student teams?

作者: Borg, Markus
关键词: No keywords

Abstract

Unfair work distribution is common in project-based learning with teams of students. One contributing factor is that students are differently skilled developers. To mitigate the differences in a course with group work, we introduced mandatory programming lab sessions. The intervention did not affect the work distribution, showing that more is needed to balance the workload. Contrary to our goal, the intervention was very well received among experienced students, but unpopular with students weak at programming.

DOI: 10.1145/3377812.3390893

A practical, collaborative approach for modeling big data analytics application requirements

作者: Khalajzadeh, Hourieh and Simmons, Andrew and Abdelrazek, Mohamed and Grundy, John and Hosking, John and He, Qiang and Ratnakanthan, Prasanna and Zia, Adil and Law, Meng
关键词: No keywords

Abstract

Data analytics application development introduces many challenges including: new roles not in traditional software engineering practices - e.g. data scientists and data engineers; use of sophisticated machine learning (ML) model-based approaches; uncertainty inherent in the models; interfacing with models to fulfill software functionalities; deploying models at scale and rapid evolution of business goals and data sources. We describe our Big Data Analytics Modeling Languages (BiDaML) toolset to bring all stakeholders around one tool to specify, model and document big data applications. We report on our experience applying BiDaML to three real-world large-scale applications. Our approach successfully supports complex data analytics application development in industrial settings.

DOI: 10.1145/3377812.3390811

A mixed methods research agenda to identify undergraduate misconceptions in software engineering, lecturers’ handling, and didactical implications

作者: Gold-Veerkamp, Carolin and Diethelm, Ira and Abke, J"{o
关键词: No keywords

Abstract

Due to the growing value of software technology in our everyday life, young professionals and undergraduates need to be well-qualified for Software Engineering (SE) careers. Additionally, its didactic basis is a recent development.

DOI: 10.1145/3377812.3390810

Understanding DevOps education with Grounded theory

作者: Pang, Candy and Hindle, Abram and Barbosa, Denilson
关键词: DevOps, Grounded theory, continuous delivery, continuous integration, education, software engineering

Abstract

DevOps stands for Development-Operations. It arises from the IT industry as a movement aligning development and operations teams. DevOps is broadly recognized as an IT standard, and there is high demand for DevOps practitioners in industry. Therefore, we studied whether undergraduates acquired adequate DevOps skills to fulfill the demand for DevOps practitioners in industry. We employed Grounded Theory (GT), a social science qualitative research methodology, to study DevOps education from academic and industrial perspectives. In academia, academics were not motivated to learn or adopt DevOps, and we did not find strong evidence of academics teaching DevOps. Academics need incentives to adopt DevOps, in order to stimulate interest in teaching DevOps. In industry, DevOps practitioners lack clearly defined roles and responsibilities, for the DevOps topic is diverse and growing too fast. Therefore, practitioners can only learn DevOps through hands-on working experience. As a result, academic institutions should provide fundamental DevOps education (in culture, procedure, and technology) to prepare students for their future DevOps advancement in industry. Based on our findings, we proposed five groups of future studies to advance DevOps education in academia.

DOI: 10.1145/3377812.3390808

Understanding and handling alert storm for online service systems

作者: Zhao, Nengwen and Chen, Junjie and Peng, Xiao and Wang, Honglin and Wu, Xinya and Zhang, Yuanzong and Chen, Zikai and Zheng, Xiangzhong and Nie, Xiaohui and Wang, Gang and Wu, Yong and Zhou, Fang and Zhang, Wenchi and Sui, Kaixin and Pei, Dan
关键词: alert storm, alert summary, failure diagnoisis, problem identification

Abstract

Alert is a kind of key data source in monitoring system for online service systems, which is used to record the anomalies in service components and report to engineers. In general, the occurrence of a service failure tends to be along with a large number of alerts, which is called alert storm. However, alert storm brings great challenges to diagnose the failure, since it is time-consuming and tedious for engineers to investigate such an overwhelming number of alerts manually. To help understand alert storm, we conduct the first empirical study of alert storm based on large-scale real-world alert data and gain some valuable insights. Based on the findings, we propose a novel approach to handling alert storm. Specifically, this approach includes alert storm detection which aims to identify alert storm accurately, and alert storm summary which aims to recommend a small set of representative alerts to engineers for failure diagnosis. Our experimental study on real-world dataset demonstrates that our alert storm detection can achieve high F1-score (larger than 0.9). Besides, our alert storm summary can reduce the number of alerts that need to be examined by more than 98% and discover useful alerts accurately.

DOI: 10.1145/3377812.3390809

Factors influencing software engineering career choice of Andean indigenous

作者: S'{a
关键词: career choice, diversity, ethnicity, genre, software engineering

Abstract

A diverse workforce is not just “nice to have”, it is a reflection of a changing world. Such a diverse workforce brings high value to organizations and it is essential for developing the national technological innovation, economic vitality, and global competitiveness. Despite the importance of diversity in the broad field of computing, there is not only a comparatively low representation of women but also other underrepresented minorities, such as indigenous people. To gain insights about their career choice, we conducted 10 interviews with Andean indigenous. The findings reveal that seven factors (social support, exposure to digital technology, autonomy of use, purpose of use, digital skill, identity, and work ethic) help to understand how and why indigenous people choose a career related to Software Engineering. This exploratory study also contributes to challenge common stereotypes and perceptions about indigenous people as low-qualified workers, academically untalented, and unmotivated.

DOI: 10.1145/3377812.3390899

ProvBuild： improving data scientist efficiency with provenance

作者: Hu, Jingmei and Joung, Jiwon and Jacobs, Maia and Gajos, Krzysztof Z. and Seltzer, Margo I.
关键词: No keywords

Abstract

Data scientists frequently analyze data by writing scripts. We conducted a contextual inquiry with interdisciplinary researchers, which revealed that parameter tuning is a highly iterative process and that debugging is time-consuming. As analysis scripts evolve and become more complex, analysts have difficulty conceptualizing their workflow. In particular, after editing a script, it becomes difficult to determine precisely which code blocks depend on the edit. Consequently, scientists frequently re-run entire scripts instead of re-running only the necessary parts. We present ProvBuild, a data analysis environment that uses change impact analysis [1] to improve the iterative debugging process in script-based workflow pipelines. ProvBuild is a tool that leverages language-level provenance [2] to streamline the debugging process by reducing programmer cognitive load and decreasing subsequent runtimes, leading to an overall reduction in elapsed debugging time. ProvBuild uses provenance to track dependencies in a script. When an analyst debugs a script, ProvBuild generates a simplified script that contains only the information necessary to debug a particular problem. We demonstrate that debugging the simplified script lowers a programmer’s cognitive load and permits faster re-execution when testing changes. The combination of reduced cognitive load and shorter runtime reduces the time necessary to debug a script. We quantitatively and qualitatively show that even though ProvBuild introduces overhead during a script’s first execution, it is a more efficient way for users to debug and tune complex workflows. ProvBuild demonstrates a novel use of language-level provenance, in which it is used to proactively improve programmer productively rather than merely providing a way to retroactively gain insight into a body of code. To the best of our knowledge, ProvBuild is a novel application of change impact analysis and it is the first debugging tool to leverage language-level provenance to reduce cognitive load and execution time.

DOI: 10.1145/3377812.3390912

How has forking changed in the last 20 years? a study of hard forks on GitHub

作者: Zhou, Shurui and Vasilescu, Bogdan and K"{a
关键词: No keywords

Abstract

The notion of forking has changed with the rise of distributed version control systems and social coding environments, like GitHub. Traditionally forking refers to splitting off an independent development branch (which we call hard forks); research on hard forks, conducted mostly in pre-GitHub days showed that hard forks were often seen critical as they may fragment a community. Today, in social coding environments, open-source developers are encouraged to fork a project in order to contribute to the community (which we call social forks), which may have also influenced perceptions and practices around hard forks. To revisit hard forks, we identify, study, and classify 15,306 hard forks on GitHub and interview 18 owners of hard forks or forked repositories. We find that, among others, hard forks often evolve out of social forks rather than being planned deliberately and that perception about hard forks have indeed changed dramatically, seeing them often as a positive non-competitive alternative to the original project.

DOI: 10.1145/3377812.3390911

Scaling application-level dynamic taint analysis to enterprise-scale distributed systems

作者: Fu, Xiaoqin and Cai, Haipeng
关键词: distributed systems, dynamic taint analysis, new bugs, scalability

Abstract

With the increasing deployment of enterprise-scale distributed systems, effective and practical defenses for such systems against various security vulnerabilities such as sensitive data leaks are urgently needed. However, most existing solutions are limited to centralized programs. For real-world distributed systems which are of large scales, current solutions commonly face one or more of scalability, applicability, and portability challenges. To overcome these challenges, we develop a novel dynamic taint analysis for enterprise-scale distributed systems. To achieve scalability, we use a multi-phase analysis strategy to reduce the overall cost. We infer implicit dependencies via partial-ordering method events in distributed programs to address the applicability challenge. To achieve greater portability, the analysis is designed to work at an application level without customizing platforms. Empirical results have shown promising scalability and capabilities of our approach.

DOI: 10.1145/3377812.3390910

Evolutionary hot-spots in software systems

作者: Benkoczi, Robert and Gaur, Daya and Hossain, Shahadat and Khan, Muhammad and Tedlapu, Ajay Raj
关键词: No keywords

Abstract

We propose a methodology to study and visualize the evolution of the modular structure of a network of functional dependencies in a software system. Our method identifies periods of significant refactoring activities, also known as the evolutionary hot spots in software systems. Our approach is based on clustering design structure matrices of functional dependencies and Kleinberg’s method of identifying evolutionary hot-spots in dynamic networks. As a case study, we characterize the evolution of the modular structure of Octave over its entire life cycle.

DOI: 10.1145/3377812.3390909

Clairvoyance： cross-contract static analysis for detecting practical reentrancy vulnerabilities in smart contracts

作者: Ye, Jiaming and Ma, Mingliang and Lin, Yun and Sui, Yulei and Xue, Yinxing
关键词: cross contract analysis, path feasibility analysis, reentrancy detection, smart contract security

Abstract

Reentrancy bugs in smart contracts caused a devastating financial loss in 2016, considered as one of the most severe vulnerabilities in smart contracts. Most of the existing general-purpose security tools for smart contracts have claimed to be able to detect reentrancy bugs. In this paper, we present Clairvoyance, a cross-function and cross-contract static analysis by identifying infeasible paths to detect reentrancy vulnerabilities in smart contracts. To reduce FPs, we have summarized five major path protective techniques (PPTs) to support fast yet precise path feasibility checking. We have implemented our approach and compared Clairvoyance with three state-of-the-art tools on 17770 real-worlds contracts. The results show that Clairvoyance yields the best detection accuracy among all the tools.

DOI: 10.1145/3377812.3390908

Towards automatic assessment of object-oriented programs

作者: Ardimento, Pasquale and Bernardi, Mario Luca and Cimitile, Marta
关键词: abstract data types, inheritance, object oriented development, polymorphism, student assessment

Abstract

The computing education community has shown a long-time interest in how to analyze the Object-Oriented (OO) source code developed by students to provide them with useful formative tips. In this paper, we propose and evaluate an approach to analyze how students use Java and its language constructs. The approach is implemented through a cloud-based integrated development environment (IDE) and it is based on the analysis of the most common violations of the OO paradigm in the student source code. Moreover, the IDE supports the automatic generation of reports about student’s mistakes and misconceptions that can be used by instructors to improve the course design. The paper discusses the preliminary results of an experiment performed in a class of a Programming II course to investigate the effects of the provided reports in terms of coding ability (concerning the correctness of the produced code).

DOI: 10.1145/3377812.3390907

Bugine： a bug report recommendation system for Android apps

作者: Li, Ziqiang and Tan, Shin Hwei
关键词: Android apps, bug report, recommendation system

Abstract

Many automated test generation tools were proposed for finding bugs in Android apps. However, a recent study revealed that developers prefer reading automated test generation cased written in natural language. We present Bugine, a new bug recommendation system that automatically selects relevant bug reports from other applications that have similar bugs. Bugine (1) searches for GitHub issues that mentioned common UI components shared between the app under test and the apps in our database, and (2) ranks the quality and relevance of issues. Our results show that Bugine could find 34 new bugs in five evaluated apps.

DOI: 10.1145/3377812.3390906

作者: Xu, Chen and Xiong, Yan and Huang, Wenchao and Meng, Zhaoyi and Miao, Fuyou and Su, Cheng and Mo, Guangshuai
关键词: No keywords

Abstract

With the prosperity of Android, developers need to deal with the compatibility issues among different devices, which is costly. In this paper, we propose an automated and general approach named ICARUS to identify compatibility-related APIs in Android apps. The insight of our approach is that the compatibility-related API has the biased distribution among code segments, which is similar to the distribution of keywords among documents. It motivates us to leverage statistical features to discriminate compatibility-related APIs from normal APIs. Experimental results on apps demonstrate the effectiveness of our work.

DOI: 10.1145/3377812.3390905

Lean kanban in an industrial context： a success story

作者: Pato, Roberto Hens and Granada, David and Vara, Juan M. and Marcos, Esperanza
关键词: No keywords

Abstract

Even though Lean principles have already been broadly applied to the manufacturing industry [1], we cannot say the same regarding software development. The objective of this article is therefore to present a real experience where the Lean Kanban method [2] was applied by a software development team from an IT consulting firm. The team (7 people) is responsible for the maintenance of internal management applications at a large governmental organization (over 4,000 employees). It had to combine new evolutionary developments with corrective maintenance and incident resolution within the production area of 20 to 25 information systems with heterogeneous purposes and technologies.

DOI: 10.1145/3377812.3390904

Preliminary findings on FOSS dependencies and security： a qualitative study on developers’ attitudes and experience

作者: Pashchenko, Ivan and Vu, Duc-Ly and Massacci, Fabio
关键词: dependency management, interviews, qualitative study, security, vulnerable dependencies

Abstract

Developers are known to keep third-party dependencies of their projects outdated even if some of them are affected by known vulnerabilities. In this study we aim to understand why they do so. For this, we conducted 25 semi-structured interviews with developers of both large and small-medium enterprises located in nine countries. All interviews were transcribed, coded, and analyzed according to applied thematic analysis. The results of the study reveal important aspects of developers’ practices that should be considered by security researchers and dependency tool developers to improve the security of the dependency management process.

DOI: 10.1145/3377812.3390903

What disconnects practitioner belief and empirical evidence?

作者: Shrikanth, N. C. and Menzies, Tim
关键词: beliefs, defects, empirical software engineering, practitioner

Abstract

Just because software developers say they believe in “X”, that does not necessarily mean that “X” is true. As shown here, there exist numerous beliefs listed in the recent Software Engineering literature which are only supported by small portions of the available data. Hence we ask what is the source of this disconnect between beliefs and evidence?.To answer this question we look for evidence for ten beliefs within 300,000+ changes seen in dozens of open-source projects. Some of those beliefs had strong support across all the projects; specifically, “A commit that involves more added and removed lines is more bug-prone” and “Files with fewer lines contributed by their owners (who contribute most changes) are bug-prone”.Most of the widely-held beliefs studied are only sporadically supported in the data; i.e. large effects can appear in project data and then disappear in subsequent releases. Such sporadic support explains why developers believe things that were relevant to their prior work, but not necessarily their current work.

DOI: 10.1145/3377812.3390802

Restoring reproducibility of Jupyter notebooks

作者: Wang, Jiawei and Kuo, Tzu-yang and Li, Li and Zeller, Andreas
关键词: Jupyter notebooks, Osiris, Python, reproducibility

Abstract

Jupyter notebooks—documents that contain live code, equations, visualizations, and narrative text—now are among the most popular means to compute, present, discuss and disseminate scientific findings. In principle, Jupyter notebooks should easily allow to reproduce and extend scientific computations and their findings; but in practice, this is not the case. The individual code cells in Jupyter notebooks can be executed in any order, with identifier usages preceding their definitions and results preceding their computations. In a sample of 936 published notebooks that would be executable in principle, we found that 73% of them would not be reproducible with straightforward approaches, requiring humans to infer (and often guess) the order in which the authors created the cells.In this paper, we present an approach to (1) automatically satisfy dependencies between code cells to reconstruct possible execution orders of the cells; and (2) instrument code cells to mitigate the impact of non-reproducible statements (i.e., random functions) in Jupyter notebooks. Our Osiris prototype takes a notebook as input and outputs the possible execution schemes that reproduce the exact notebook results. In our sample, Osiris was able to reconstruct such schemes for 82.23% of all executable notebooks, which has more than three times better than the state-of-the-art; the resulting reordered code is valid program code and thus available for further testing and analysis.

DOI: 10.1145/3377812.3390803

Identification of cultural influences on requirements engineering activities

作者: Alsanoosy, Tawfeeq and Spichkova, Maria and Harland, James
关键词: culture, global software development, requirement engineering

Abstract

Requirements Engineering (RE) involves critical activities to ensure the accurate elicitation and documentation of clients’ requirements. RE is a socio-technical activity and requires intensive communication with several clients. RE activities might be considerably influenced by individuals’ cultural background because culture has a deep impact on the way in which people communicate. However, there has been limited exploration of this issue. We present a framework that identifies and analyses cultural influences on RE activities. To build the framework, we employed Hofstede’s cultural model and a mixed-methods design comprising two case studies involving two cultures: Saudi Arabia and Australia. The evaluation highlighted that the framework provides high accuracy to identify cultural influences in other cultures as well.

DOI: 10.1145/3377812.3390804

An exploratory study on improving automated issue triage with attached screenshots

作者: Aktas, Ethem Utku and Yilmaz, Cemal
关键词: issue report assignment, issue triage, optical character recognition, text mining

Abstract

Issue triage is a manual and time consuming process for both open and closed source software projects. Triagers first validate the issue reports and then find the appropriate developers or teams to solve them. In our industrial case, we automated the assignment part of the problem with a machine learning based approach. However, the automated system’s average accuracy performance is 3% below the human triagers’ performance. In our effort to improve our approach, we analyzed the incorrectly assigned issue reports and realized that many of them have attachments with them, which are mostly screenshots. Such issue reports generally have short descriptions compared to the ones without attachments, which we consider as one of the reasons for incorrect classification. In this study, we describe our proposed approach to include this new piece of information for issue triage and present the initial results.

DOI: 10.1145/3377812.3390805

Open-vocabulary models for source code

作者: Karampatsis, Rafael-Michael and Babii, Hlib and Robbes, Romain and Sutton, Charles and Janes, Andrea
关键词: byte-pair encoding, naturalness of code, neural language models

Abstract

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work, and outperforms the state of the art. To our knowledge, this is the largest NLM for code that has been reported.

DOI: 10.1145/3377812.3390806

Building a theory of software teams organization in a continuous delivery context

作者: Leite, Leonardo and Kon, Fabio and Pinto, Gustavo and Meirelles, Paulo
关键词: DevOps, continuous delivery, release process, software teams

Abstract

Based on Grounded Theory guidelines, we interviewed 27 IT professionals to investigate how organizations pursuing continuous delivery should organize their development and operations teams. In this paper, we present the discovered organizational structures: (1) siloed departments, (2) classical DevOps, (3) cross-functional teams, and (4) platform teams.

DOI: 10.1145/3377812.3390807

An intelligent tool for combatting contract cheating behaviour by facilitating scalable student-tutor discussions

作者: Renzella, Jake and Cain, Andrew and Schneider, Jean-Guy
关键词: computing assessment, computing education, contract cheating detection, human computer interaction, learning management system, online education, programming education

Abstract

With the global increase in demand for online tertiary education, teachers are facing unique challenges in scaling assessment activities and meaningful student engagement. One such aspect is contract cheating behaviours exhibited in the modern online environment — posing a threat to the academic integrity of tertiary education. These obstacles amplify when applied to traditionally difficult domains like introductory programming education. Prior research on contract cheating identification proposes that while challenging, techniques such as developing strong teacher-student relationships, and real-time discussions may lead to instances of identifying contract cheating behaviours. The proposition, then, is to scale real-time, student-teacher discussions with large, online cohorts — similar to those discussions which traditionally took place in the classroom. This poster paper presents Intelligent Discussion Comments (IDCs): A scalable, teacher-asynchronous system which engages students in real-time discussions to extract authentic student understanding. Artificial intelligence services such as voice identification and transcription enrich the discussion process, supporting the teaching team in their decision-making process.

DOI: 10.1145/3377812.3390795

Debugging inputs

作者: Kirschner, Lukas and Soremekun, Ezekiel and Zeller, Andreas
关键词: No keywords

Abstract

Program failures are often caused by invalid inputs, for instance due to input corruption. To obtain the passing input, one needs to debug the data. In this paper we present a generic technique called ddmax that (1) identifies which parts of the input data prevent processing, and (2) recovers as much of the (valuable) input data as possible. To the best of our knowledge, ddmax is the first approach that fixes faults in the input data without requiring program analysis. In our evaluation, ddmax repaired about 69% of input files and recovered about 78% of data within one minute per input.

DOI: 10.1145/3377812.3390797

Managing data constraints in database-backed web applications

作者: Yang, Junwen and Sethi, Utsav and Yan, Cong and Cheung, Alvin and Lu, Shan
关键词: No keywords

Abstract

There are often constraints associated with data used in software, describing the expected length, value, uniqueness, and other properties of the stored data. Correctly specifying and checking such constraints are crucial for reliability, maintainability, and usability of software. This is particularly important for database-backed web applications, where a huge amount of data generated by millions of users plays a central role in user interaction and application logic. Furthermore, such data persists in database and needs to continue serving users despite frequent software upgrades [2] and data migration [1]. As a result, consistently and comprehensively specifying data constraints, checking them, and handling constraint violations are of uttermost importance.

DOI: 10.1145/3377812.3390798

Testing DNN image classifiers for confusion & bias errors

作者: Tian, Yuchi and Zhong, Ziyuan and Ordonez, Vicente and Kaiser, Gail and Ray, Baishakhi
关键词: DNNs, bias, deep learning, image classifiers, whitebox testing

Abstract

We found that many of the reported erroneous cases in popular DNN image classifiers occur because the trained models confuse one class with another or show biases towards some classes over others. Most existing DNN testing techniques focus on per-image violations, so fail to detect class-level confusions or biases. We developed a testing technique to automatically detect class-based confusion and bias errors in DNN-driven image classification software. We evaluated our implementation, DeepInspect, on several popular image classifiers with precision up to 100% (avg. 72.6%) for confusion errors, and up to 84.3% (avg. 66.8%) for bias errors.

DOI: 10.1145/3377812.3390799

Fluid intelligence doesn’t matter! effects of code examples on the usability of crypto APIs

作者: Mindermann, Kai and Wagner, Stefan
关键词: example code, intelligence, security, usability

Abstract

Context: Programmers frequently look for the code of previously solved problems that they can adapt for their own problem. Despite existing example code on the web, on sites like Stack Overflow, cryptographic Application Programming Interfaces (APIs) are commonly misused. There is little known about what makes examples helpful for developers in using crypto APIs. Analogical problem solving is a psychological theory that investigates how people use known solutions to solve new problems. There is evidence that the capacity to reason and solve novel problems a.k.a Fluid Intelligence (Gf) and structurally and procedurally similar solutions support problem solving. Aim: Our goal is to understand whether similarity and Gf also have an effect in the context of using cryptographic APIs with the help of code examples. Method: We conducted a controlled experiment with 76 student participants developing with or without procedurally similar examples, one of two Java crypto libraries and measured the Gf of the participants as well as the effect on usability (effectiveness, efficiency, satisfaction) and security bugs. Results: We observed a strong effect of code examples with a high procedural similarity on all dependent variables. Fluid intelligence Gf had no effect. It also made no difference which library the participants used. Conclusions: Example code must be more highly similar to a concrete solution, not very abstract and generic to have a positive effect in a development task.

DOI: 10.1145/3377812.3390892

Semantic analysis of issues on Google play and Twitter

作者: Yadav, Aman and Fard, Fatemeh H.
关键词: Twitter, app review analysis, bug reports, semantic analysis

Abstract

Mobile app users post their opinion about the apps, report bugs or request features on various platforms, the main one being App Stores. Previous research suggests that Twitter should be used as an additional resource to receive users’ feedback, as app users tweet different issues. Although the classification and review summarization methods are developed previously for each platform separately, manual investigation of reviews or tweets is still required to identify the similar or different points that are discussed on App Store or Twitter. In this paper, we propose a framework to study the differences or similarities among app reviews from Google Play Store and tweets automatically by using the semantics of the words. The results from several experiments compared with expert evaluation, confirm that it can be applied to identify the similarities or differences among the extracted topics, n-grams, and users’ comments.

DOI: 10.1145/3377812.3390790

Summary-guided incremental symbolic execution

作者: Yi, Qiuping and Wen, Junye and Yang, Guowei
关键词: No keywords

Abstract

Symbolic execution is a powerful technique for systematically exploring program paths, but scaling symbolic execution to practical programs remains challenging. State-of-the-art techniques face the challenge to efficiently explore incremental behaviors, especially for highly coupled programs with complex control and data dependency. In this paper, we present a novel approach for incremental symbolic execution based on an iteration loop between path exploration and path suffixes summarization. On one hand, the explored paths are summarized to enable more precise identification of affected paths; on the other hand, the summary guides path exploration to prune paths that have no incremental behaviors. We implemented the prototype of our approach and conducted experiments on a set of real-world applications. The results show that it is efficient and effective in exploring incremental behaviors.

DOI: 10.1145/3377812.3390895

Elite developers’ activities at open source ecosystem level

作者: Wang, Zhendong and Feng, Yang and Wang, Yi and Jones, James A. and Redmiles, David
关键词: activity, elite developers, open source ecosystem, project outcomes

Abstract

OSS ecosystems promote code reuse, and knowledge sharing across projects within them. An ecosystem’s developers often develop similar activity patterns which might impact project outcomes in an ecosystem-specific way. Since elite developers play critical roles in most OSS projects, investigating their behaviors at the ecosystem level becomes urgent. Thus, we aim to investigate elite developers’ activities and their relationships with project outcomes (productivity and quality). We design an large scale empirical study which characterizes elite developers’ activity profiles and identifies the relationships between their effort allocations and project outcomes across five ecosystems. Our current results and findings reveal that elite developers in each ecosystem do behave in ecosystem-specific ways. Further, we find that the elites’ effort allocations on different activity categories are potentially correlated with project outcomes.

DOI: 10.1145/3377812.3390894

GUI-focused overviews of mobile development videos

作者: Alahmadi, Mohammad and Malkadi, Abdulkarim and Haiduc, Sonia
关键词: deep learning, mobile development, programming video tutorials, software documentation, video mining

Abstract

The need for mobile applications and mobile programming is increasing due to the continuous rise in the pervasiveness of mobile devices. Developers often refer to video programming tutorials to learn more about mobile programming topics. To find the right video to watch, developers typically skim over several videos, looking at their title, description, and video content in order to determine if they are relevant to their information needs. Unfortunately, the title and description do not always provide an accurate overview, and skimming over videos is time-consuming and can lead to missing important information. We propose a novel approach that locates and extracts the GUI screens showcased in a video tutorial, then selects and displays the most representative ones to provide a GUI-focused overview of the video. We believe this overview can be used by developers as an additional source of information for determining if a video contains the information they need. To evaluate our approach, we performed an empirical study on iOS and Android programming screencasts which investigates the accuracy of our automated GUI extraction. The results reveal that our approach can detect and extract GUI screens with an accuracy of 94%.

DOI: 10.1145/3377812.3390900

Improving automated program repair using two-layer tree-based neural networks

作者: Li, Yi and Wang, Shaohua and Nguyen, Tien N.
关键词: automated program repair, deep learning

Abstract

We present DLFix, a two-layer tree-based model learning bug-fixing code changes and their surrounding code context to improve Automated Program Repair (APR). The first layer learns the surrounding code context of a fix and uses it as weights for the second layer that is used to learn the bug-fixing code transformation. Our empirical results on Defect4J show that DLFix can fix 30 bugs and its results are comparable and complementary to the best performing pattern-based APR tools. Furthermore, DLFix can fix 2.5 times more bugs than the best performing deep learning baseline.

DOI: 10.1145/3377812.3390896

An empirical study on the characteristics of question-answering process on developer forums

作者: Li, Yi and Wang, Shaohua and Nguyen, Tien N.
关键词: API documentation, question answering

Abstract

Developer forums are one of the most popular and useful Q&A websites on API usages. The analysis of API forums can be a critical step towards automated question and answer approaches. In this poster, we empirically study three API forums: Twitter, eBay, and AdWords, to investigate the characteristics of question-answering process. We observe that +60% of the posts on all forums were answered with API method names or documentation. +85% of the questions were answered by API development teams and the answers from API development teams drew fewer follow-up questions. Our results provide empirical evidence in future work to build automated solutions to answer developer questions on API forums.

DOI: 10.1145/3377812.3390897

Towards understanding and fixing upstream merge induced conflicts in divergent forks： an industrial case study

作者: Sung, Chungha and Lahiri, Shuvendu K. and Kaufman, Mike and Choudhury, Pallavi and Wolk, Jessica and Wang, Chao
关键词: No keywords

Abstract

Divergent forks are a common practice in open-source software development to perform long-term, independent and diverging development on top of a popular source repository. However, keeping such divergent downstream forks in sync with the upstream source evolution poses engineering challenges in terms of frequent merge conflicts. In this work, we conduct the first industrial case study of frequent merges from upstream and the resulting merge conflicts, in the context of Microsoft Edge development. The study consists of two parts. First, we describe the nature of merge conflicts that arise due to merges from upstream. Second, we investigate the feasibility of automatically fixing a class of merge conflicts related to build breaks that consume a significant amount of developer time to root-cause and fix. Towards this end, we have implemented a tool MrgBldBrkFixer and evaluate it on three months of real Microsoft Edge Beta development data, and report encouraging results.

DOI: 10.1145/3377812.3390800

Importance-driven deep learning system testing

作者: Gerasimou, Simos and Eniser, Hasan Ferit and Sen, Alper and Cakan, Alper
关键词: No keywords

Abstract

Deep Learning (DL) systems are key enablers for engineering intelligent applications. Nevertheless, using DL systems in safety- and security-critical applications requires to provide testing evidence for their dependable operation. We introduce DeepImportance, a systematic testing methodology accompanied by an Importance-Driven (IDC) test adequacy criterion for DL systems. Applying IDC enables to establish a layer-wise functional understanding of the importance of DL system components and use this information to assess the semantic diversity of a test set. Our empirical evaluation on several DL systems and across multiple DL datasets demonstrates the usefulness and effectiveness of DeepImportance.

DOI: 10.1145/3377812.3390793

Refactor4Green： a game for novice programmers to learn code smells

作者: Agrahari, Vartika and Chimalakonda, Sridhar
关键词: code smells, energy-efficiency, game, refactoring

Abstract

The rise in awareness of sustainable software has led to a focus on energy efficiency and consideration of code smells during software development. This eventually requires software engineering teachers to focus on topics such as code smells in their software engineering courses to bring awareness among students on the impact of code smells and bad design choices, not just for the software but also for the environment. Thus, we propose a desktop game named Refactor4Green to teach code smells and refactoring to novice programmers. The core idea of the game is to introduce code smells with refactoring choices through the theme of a green environment. We conducted a preliminary study with university students and got positive feedback from 83.06% of the participants.

DOI: 10.1145/3377812.3390792

Industry Agile practices in large-scale capstone projects

作者: Schneider, Jean-Guy and Eklund, Peter W. and Lee, Kevin and Chen, Feifei and Cain, Andrew and Abdelrazek, Mohamed
关键词: Agile software development, capstone projects, software engineering education

Abstract

To give students as authentic learning experience as possible, many software-focused degrees incorporate team-based capstone projects in the final year of study. Designing capstone projects, however, is not a trivial undertaking, and a number of constraints need to be considered, especially when it comes to defining learning outcomes, choosing clients and projects, providing guidance to students, creating an effective project “support infrastructure”, and measuring student outcomes. To address these challenges, we propose a novel, scalable model for managing capstone projects, called ACE, that adapts Spotify’s Squads and Tribes organization to an educational setting. We present our motivation, the key components of the model, its adoption, and refer to preliminary observations.

DOI: 10.1145/3377812.3390794

Real-world ethics for self-driving cars

作者: Holstein, Tobias and Dodig-Crnkovic, Gordana and Pelliccione, Patrizio
关键词: autonomous cars, challenges, decision making, ethics, self-driving cars, social aspects, software engineering, trolley problem

Abstract

Ethical and social problems of the emerging technology of self-driving cars can best be addressed through an applied engineering ethical approach. However, currently social and ethical problems are typically being presented in terms of an idealized unsolvable decision-making problem, the so-called Trolley Problem. Instead, we propose that ethical analysis should focus on the study of ethics of complex real-world engineering problems. As software plays a crucial role in the control of self-driving cars, software engineering solutions should handle actual ethical and social considerations. We take a closer look at the regulative instruments, standards, design, and implementations of components, systems, and services and we present practical social and ethical challenges that must be met in the ecology of the socio-technological system of self-driving cars which implies novel expectations for software engineering in the automotive industry.

DOI: 10.1145/3377812.3390801

Learning-to-rank vs ranking-to-learn： strategies for regression testing in continuous integration

Abstract

A cost-efficient approach to building in continuous integration

Abstract

Practical fault detection in puppet programs

Abstract

Learning from, understanding, and supporting DevOps artifacts for docker

Abstract

Adapting requirements models to varying environments

Abstract

Comparing formal tools for system design： a judgment study

Abstract

Debugging inputs

Abstract

Causal testing： understanding defects’ root causes

Abstract

Impact analysis of cross-project bugs on software ecosystems

Abstract

Taming behavioral backward incompatibilities via cross-project testing and analysis

Abstract

Watchman： monitoring dependency conflicts for Python library ecosystem

Abstract

One size does not fit all： a grounded theory and online survey study of developer preferences for security warning types

Abstract

Schr"{o

Abstract

How software practitioners use informal local meetups to share software engineering knowledge

Abstract

Predicting developers’ negative feelings about code review

Abstract

Near-duplicate detection in web app model inference

Abstract

Extracting taint specifications for JavaScript libraries

Abstract

SLACC： simion-based language agnostic code clones

Abstract

Finding client-side business flow tampering vulnerabilities

Abstract

Securing unsafe rust programs with XRust

Abstract

Is rust used safely by software developers?

Abstract

Burn after reading： a shadow stack with microsecond-level runtime rerandomization for protecting return addresses

Abstract

SAVER： scalable, precise, and safe memory-error repair

Abstract

Revealing injection vulnerabilities by leveraging existing tests

Abstract

RoScript： a visual script driven truly non-intrusive robotic testing system for touch screen applications

Abstract

Translating video recordings of mobile app usages into replayable scenarios

Abstract

Unblind your apps： predicting natural-language labels for mobile GUI components by deep learning

Abstract

SLEMI： equivalence modulo input (EMI) based mutation of CPS models for finding compiler bugs in Simulink

Abstract

DeepBillboard： systematic physical-world testing of autonomous driving systems

Abstract

Misbehaviour prediction for autonomous driving systems

Abstract

Approximation-refinement testing of compute-intensive cyber-physical models： an approach based on system identification

Abstract

A comprehensive study of autonomous vehicle bugs

Abstract

Studying the use of Java logging utilities in the wild

Abstract

A study on the prevalence of human values in software engineering publications, 2015 – 2018

Abstract

Explaining pair programming session dynamics from knowledge gaps

Abstract

Engineering gender-inclusivity into software： ten teams’ tales from the trenches

Abstract

How has forking changed in the last 20 years? a study of hard forks on GitHub

Abstract

Multiple-entry testing of Android applications by constructing activity launching contexts

Abstract

ComboDroid： generating high-quality test inputs for Android apps via use case combinations

Abstract

Time-travel testing of Android apps

Abstract