ICSE 2023

CodeS： Towards Code Model Generalization Under Distribution Shift

作者: Hu, Qiang and Guo, Yuejun and Xie, Xiaofei and Cordy, Maxime and Papadakis, Mike and Ma, Lei and Traon, Yves Le
关键词: distribution shift, source code learning

Abstract

Distribution shift has been a longstanding challenge for the reliable deployment of deep learning (DL) models due to unexpected accuracy degradation. Although DL has been becoming a driving force for large-scale source code analysis in the big code era, limited progress has been made on distribution shift analysis and benchmarking for source code tasks. To fill this gap, this paper initiates to propose CodeS, a distribution shift benchmark dataset, for source code learning. Specifically, CodeS supports two programming languages (Java and Python) and five shift types (task, programmer, time-stamp, token, and concrete syntax tree). Extensive experiments based on CodeS reveal that 1) out-of-distribution detectors from other domains (e.g., computer vision) do not generalize to source code, 2) all code classification models suffer from distribution shifts, 3) representation-based shifts have a higher impact on the model than others, and 4) pre-trained bimodal models are relatively more resistant to distribution shifts.

DOI: 10.1109/ICSE-NIER58687.2023.00007

Towards Using Few-Shot Prompt Learning for Automating Model Completion

作者: Chaaben, Meriem Ben and Burgue~{n
关键词: model completion, domain modeling, prompt learning, few-shot learning, language models

Abstract

We propose a simple yet a novel approach to improve completion in domain modeling activities. Our approach exploits the power of large language models by using few-shot prompt learning without the need to train or fine-tune those models with large datasets that are scarce in this field. We implemented our approach and tested it on the completion of static and dynamic domain diagrams. Our initial evaluation shows that such an approach is effective and can be integrated in different ways during the modeling activities.

DOI: 10.1109/ICSE-NIER58687.2023.00008

Anti-Patterns (Smells) in Temporal Specifications

作者: Ma’ayan, Dor and Maoz, Shahar and Ringert, Jan Oliver
关键词: No keywords

Abstract

Temporal specifications are essential inputs for verification and synthesis. Despite their importance, temporal specifications are challenging to write, which might limit their use by software engineers. To this day, almost no quality attributes of temporal specifications have been defined and investigated. Our work takes a first step toward exploring and improving the quality of temporal specifications by proposing a preliminary catalog of anti-patterns (a.k.a. smells). We base the catalog on our experience in developing and teaching temporal specifications for verification and synthesis. In addition, we examined publicly available specification repositories and relevant literature. Finally, we outline our future plans for a better understanding of what constitutes high-quality temporal specifications and the development of tools that will help engineers write them.

DOI: 10.1109/ICSE-NIER58687.2023.00009

Interpersonal Trust in OSS： Exploring Dimensions of Trust in GitHub Pull Requests

作者: Sajadi, Amirali and Damevski, Kostadin and Chatterjee, Preetha
关键词: pull requests, open source software, trust

Abstract

Interpersonal trust plays a crucial role in facilitating collaborative tasks, such as software development. While previous research recognizes the significance of trust in an organizational setting, there is a lack of understanding in how trust is exhibited in OSS distributed teams, where there is an absence of direct, in-person communications. To foster trust and collaboration in OSS teams, we need to understand what trust is and how it is exhibited in written developer communications (e.g., pull requests, chats). In this paper, we first investigate various dimensions of trust to identify the ways trusting behavior can be observed in OSS. Next, we sample a set of 100 GitHub pull requests from Apache Software Foundation (ASF) projects, to analyze and demonstrate how each dimension of trust can be exhibited. Our findings provide preliminary insights into cues that might be helpful to automatically assess team dynamics and establish interpersonal trust in OSS teams, leading to successful and sustainable OSS.

DOI: 10.1109/ICSE-NIER58687.2023.00010

The Risk-Taking Software Engineer： A Framed Portrait

作者: Graf-Vlachy, Lorenz
关键词: big five, five-factor model, personality, framing, risk-taking

Abstract

Background: Risk-taking is prevalent in a host of activities performed by software engineers on a daily basis, yet there is scant research on it. Aims and Method: We study if software engineers’ risk-taking is affected by framing effects and by software engineers’ personality. To this end, we perform a survey experiment with 124 software engineers. Results: We find that framing substantially affects their risk-taking. None of the “Big Five” personality traits are related to risk-taking in software engineers after correcting for multiple testing. Conclusions: Software engineers and their managers must be aware of framing effects and account for them properly.

DOI: 10.1109/ICSE-NIER58687.2023.00011

MLTEing Models： Negotiating, Evaluating, and Documenting Model and System Qualities

作者: Maffey, Katherine R. and Dotterrer, Kyle and Niemann, Jennifer and Cruickshank, Iain and Lewis, Grace A. and K"{a
关键词: responsible AI, machine learning evaluation, test and evaluation, machine learning

Abstract

Many organizations seek to ensure that machine learning (ML) and artificial intelligence (AI) systems work as intended in production but currently do not have a cohesive methodology in place to do so. To fill this gap, we propose MLTE (Machine Learning Test and Evaluation, colloquially referred to as “melt”), a framework and implementation to evaluate ML models and systems. The framework compiles state-of-the-art evaluation techniques into an organizational process for interdisciplinary teams, including model developers, software engineers, system owners, and other stakeholders. MLTE tooling supports this process by providing a domain-specific language that teams can use to express model requirements, an infrastructure to define, generate, and collect ML evaluation metrics, and the means to communicate results.

DOI: 10.1109/ICSE-NIER58687.2023.00012

Handling Communication via APIs for Microservices

作者: Kanvar, Vini and Jain, Ridhi and Tamilselvam, Srikanth
关键词: communication, JSON, microservice, monolith

Abstract

Enterprises in their journey to the cloud, want to decompose their monolith applications into microservices to maximize cloud benefits. Current research focuses a lot on how to partition the monolith into smaller clusters that perform well across standard metrics like coupling, cohesion etc. However, there is little research done on taking the partitions, identifying their dependencies between the microservices, exploring ways to further reduce the dependencies, and making appropriate code changes to enable robust communication without changing the application behaviour.In this work, we discuss the challenges with the conventional techniques of communication using JSON and propose an alternative way of ID-passing via APIs. We also devise an algorithm to reduce the number of APIs. For this, we construct subgraphs of methods and their associated variables in each class, and relocate them to their more functionally aligned microservices. Our quantitative and qualitative studies on five public Java applications clearly demonstrate that our refactored microservices using ID have decidedly better time and memory complexities than JSON. Our automation reduces 40–60% of the manual refactoring efforts.

DOI: 10.1109/ICSE-NIER58687.2023.00013

Iterative Assessment and Improvement of DNN Operational Accuracy

作者: Guerriero, Antonio and Pietrantuono, Roberto and Russo, Stefano
关键词: accuracy improvement, accuracy assessment, deep neural networks

Abstract

Deep Neural Networks (DNN) are nowadays largely adopted in many application domains thanks to their human-like, or even superhuman, performance in specific tasks. However, due to unpredictable/unconsidered operating conditions, unexpected failures show up on field, making the performance of a DNN in operation very different from the one estimated prior to release.In the life cycle of DNN systems, the assessment of accuracy is typically addressed in two ways: offline, via sampling of operational inputs, or online, via pseudo-oracles. The former is considered more expensive due to the need for manual labeling of the sampled inputs. The latter is automatic but less accurate.We believe that emerging iterative industrial-strength life cycle models for Machine Learning systems, like MLOps, offer the possibility to leverage inputs observed in operation not only to provide faithful estimates of a DNN accuracy, but also to improve it through remodeling/retraining actions.We propose DAIC (DNN Assessment and Improvement Cycle), an approach which combines “low-cost” online pseudo-oracles and “high-cost” offline sampling techniques to estimate and improve the operational accuracy of a DNN in the iterations of its life cycle. Preliminary results show the benefits of combining the two approaches and integrating them in the DNN life cycle.

DOI: 10.1109/ICSE-NIER58687.2023.00014

Toward Gaze-Assisted Developer Tools

作者: Kuang, Peng and S"{o
关键词: mapping study, software engineering, developer tools, gaze behavior, eye tracking

Abstract

Many crucial activities in software development are linked to gaze and can potentially benefit from gaze-assisted developer tools. However, despite the maturity of eye trackers and the potential for such tools, we see very few studies of practitioners. Here, we present a systematic mapping study to examine recent developments in the field with a focus on the experimental setup of eye-tracking studies in software engineering research. We identify two gaps regarding studies of practitioners in realistic settings and three challenges in existing experimental setups. We present six recommendations for how to steer the research community toward gaze-assisted developer tools that can benefit practitioners.

DOI: 10.1109/ICSE-NIER58687.2023.00015

Under the Bridge： Trolling and the Challenges of Recruiting Software Developers for Empirical Research Studies

作者: Kokinda, Ella and Moster, Makayla and Dominic, James and Rodeghero, Paige
关键词: research methods, software engineering research, meta-research, software development

Abstract

Much of software engineering research focuses on tools, algorithms, and optimization of software. Recently, we, as a community, have come to acknowledge that there is a gap in meta-research and addressing the human-factors in software engineering research. Through meta research, we aim to deepen our understanding of online participant recruitment and human-subjects software engineering research. In this paper we motivate the need to consider the unique challenges that human studies pose in software engineering research. We present several challenges faced by our research team in several distinct research studies, how they affected research, and motivate how, as researchers, we can address these challenges. We present results from a pilot study and categorize issues faced into three broad categories including participant recruitment, community engagement, and data poisoning. We further discuss how we can address these challenges and outline the benefits a full-study could provide to the software engineering research community.

DOI: 10.1109/ICSE-NIER58687.2023.00016

On ML-Based Program Translation： Perils and Promises

作者: Malyala, Aniketh and Zhou, Katelyn and Ray, Baishakhi and Chakraborty, Saikat
关键词: program transformation, code translation, code generation

Abstract

With the advent of new and advanced programming languages, it becomes imperative to migrate legacy software to new programming languages. Unsupervised Machine Learning-based Program Translation could play an essential role in such migration, even without a sufficiently sizeable reliable corpus of parallel source code. However, these translators are far from perfect due to their statistical nature. This work investigates unsupervised program translators and where and why they fail. With in-depth error analysis of such failures, we have identified that the cases where such translators fail follow a few particular patterns. With this insight, we develop a rule-based program mutation engine, which pre-processes the input code if the input follows specific patterns and post-process the output if the output follows certain patterns. We show that our code processing tool, in conjunction with the program translator, can form a hybrid program translator and significantly improve the state-of-the-art. In the future, we envision an end-to-end program translation tool where programming domain knowledge can be embedded into an ML-based translation pipeline using pre- and post-processing steps.

DOI: 10.1109/ICSE-NIER58687.2023.00017

Reasoning-Based Software Testing

作者: Giamattei, Luca and Pietrantuono, Roberto and Russo, Stefano
关键词: software testing, causal reasoning

Abstract

With software systems becoming increasingly pervasive and autonomous, our ability to test for their quality is severely challenged. Many systems are called to operate in uncertain and highly-changing environment, not rarely required to make intelligent decisions by themselves. This easily results in an intractable state space to explore at testing time. The state-of-the-art techniques try to keep the pace, e.g., by augmenting the tester’s intuition with some form of (explicit or implicit) learning from observations to search this space efficiently. For instance, they exploit historical data to drive the search (e.g., ML-driven testing) or the tests execution data itself (e.g., adaptive or search-based testing). Despite the indubitable advances, the need for smartening the search in such a huge space keeps to be pressing.We introduce Reasoning-Based Software Testing (RBST), a new way of thinking at the testing problem as a causal reasoning task. Compared to mere intuition-based or state-of-the-art learning-based strategies, we claim that causal reasoning more naturally emulates the process that a human would do to “smartly” search the space. RBST aims to mimic and amplify, with the power of computation, this ability. The conceptual leap can pave the ground to a new trend of techniques, which can be variously instantiated from the proposed framework, by exploiting the numerous tools for causal discovery and inference. Preliminary results reported in this paper are promising.

DOI: 10.1109/ICSE-NIER58687.2023.00018

Safe-DS： A Domain Specific Language to Make Data Science Safe

作者: Reimann, Lars and Kniesel-W"{u
关键词: domain specific language, schema types, refined types, static safety, machine learning, data science

Abstract

Due to the long runtime of Data Science (DS) pipelines, even small programming mistakes can be very costly, if they are not detected statically. However, even basic static type checking of DS pipelines is difficult because most are written in Python. Static typing is available in Python only via external linters. These require static type annotations for parameters or results of functions, which many DS libraries do not provide.In this paper, we show how the wealth of Python DS libraries can be used in a statically safe way via Safe-DS, a domain specific language (DSL) for DS. Safe-DS catches conventional type errors plus errors related to range restrictions, data manipulation, and call order of functions, going well beyond the abilities of current Python linters. Python libraries are integrated into Safe-DS via a stub language for specifying the interface of its declarations, and an API-Editor that is able to extract type information from the code and documentation of Python libraries, and automatically generate suitable stubs.Moreover, Safe-DS complements textual DS pipelines with a graphical representation that eases safe development by preventing syntax errors. The seamless synchronization of textual and graphic view lets developers always choose the one best suited for their skills and current task.We think that Safe-DS can make DS development easier, faster, and more reliable, significantly reducing development costs.

DOI: 10.1109/ICSE-NIER58687.2023.00019

Rapid Development of Compositional AI

作者: Martie, Lee and Rosenberg, Jessie and Demers, Veronique and Zhang, Gaoyuan and Bhardwaj, Onkar and Henning, John and Prasad, Aditya and Stallone, Matt and Lee, Ja Young and Yip, Lucy and Adesina, Damilola and Paikari, Elahe and Resendiz, Oscar and Shaw, Sarah and Cox, David
关键词: framework, artificial intelligence, compositional, agile, rapid development

Abstract

Compositional AI systems, which combine multiple artificial intelligence components together with other application components to solve a larger problem, have no known pattern of development and are often approached in a bespoke and ad hoc style. This makes development slower and harder to reuse for future applications. To support the full rapid development cycle of compositional AI applications, we have developed a novel framework called (Bee)* (written as a regular expression and pronounced as “beestar”). We illustrate how (Bee)* supports building integrated, scalable, and interactive compositional AI applications with a simplified developer experience.

DOI: 10.1109/ICSE-NIER58687.2023.00020

A Novel and Pragmatic Scenario Modeling Framework with Verification-in-the-Loop for Autonomous Driving Systems

作者: Du, Dehui and Li, Bo and Zheng, Chenghang and Zhang, Xinyuan
关键词: UPPAAL-SMC, scenario simulation, domain specific modeling language, scenario modeling, ADS

Abstract

Scenario modeling for Autonomous Driving Systems (ADS) enables scenario-based simulation and verification which are critical for the development of safe ADS. However, with the increasing complexity and uncertainty of ADS, it becomes increasingly challenging to manually model driving scenarios and conduct verification analysis. To tackle these challenges, we propose a novel and pragmatic framework for scenario modeling, simulation and verification. The novelty is that it’s a verification-in-the-loop scenario modeling framework. The scenario modeling language with formal semantics is proposed based on the domain knowledge of ADS. It facilitates scenario verification to analyze the safety of scenario models. Moreover, the scenario simulation is implemented based on the scenario executor. Compared with existing works, our framework can simplify the description of scenarios in a non-programming, user-friendly manner, model stochastic behavior of vehicles, support safe verification of scenario models with UPPAAL-SMC and generate executable scenario in some open-source simulators such as CARLA. To preliminarily demonstrate the effectiveness and feasibility of our approach, we build a prototype tool and apply our approach in several typical scenarios for ADS.

DOI: 10.1109/ICSE-NIER58687.2023.00021

Towards Human-Centred Crowd Computing： Software for Better Use of Computational Resources

作者: Fernando, Niroshinie and Arora, Chetan and Loke, Seng W. and Alam, Lubna and La Macchia, Stephen and Graesser, Helen
关键词: requirements engineering, mobile applications, crowd computing

Abstract

Internet-connected smart devices are increasing at an exponential rate. These powerful devices have created a yet-untapped pool of idle resources that can be utilised, among others, for processing data in resource-depleted environments. The idea of bringing together a pool of smart devices for “crowd computing” (CC) has been studied in the recent past from an infrastructural feasibility perspective. However, for the CC paradigm to be successful, numerous socio-technical and software engineering (SE), specifically the requirements engineering (RE)-related factors are at play and have not been investigated in the literature. In this paper, we motivate the SE-related aspects of CC and the ideas for implementing mobile apps required for CC scenarios. We present the results of a preliminary study on understanding the human aspects, incentives that motivate users, and CC app requirements, and present our future development plan in this relatively new field of research for SE applications.

DOI: 10.1109/ICSE-NIER58687.2023.00022

Auto-Logging： Al-Centred Logging Instrumentation

作者: Bogatinovski, Jasmin and Kao, Odej
关键词: AIOps, logging, software engineering

Abstract

Logging in software development plays a crucial role in bug-fixing, maintaining the code and operating the application. Logs are hints created by human software developers that aim to help human developers and operators in identifying root causes for application bugs or other misbehaviour types. They also serve as a bridge between the Devs and the Ops, allowing the exchange of information. The rise of the DevOps paradigm with the CI/CD pipelines led to a significantly higher number of deployments per month and consequently increased the logging requirements. In response, AI-enabled methods for IT operation (AIOps) are introduced to automate the testing and run-time fault tolerance to a certain extent. However, using logs tailored for human understanding to learn (automatic) AI methods poses an ill-defined problem: AI algorithms need no hints but structured, precise and indicative data. Until now, AIOps researchers adapt the AI algorithms to the properties of the existing human-centred data (e.g., log sentiment), which are not always trivial to model. By pointing out the discrepancy, we envision that there exists an alternative approach: the logging can be adapted such that the produced logs are better tailored towards the strengths of the AI-enabled methods. In response, in this vision paper, we introduce auto-logging, which devises the idea of how to automatically insert log instructions into the code that can better suit AI-enabled methods as end-log consumers.

DOI: 10.1109/ICSE-NIER58687.2023.00023

Towards Supporting Emotion Awareness in Retrospective Meetings

作者: Grassi, Daniela and Lanubile, Filippo and Novielli, Nicole and Serebrenik, Alexander
关键词: visualization, biometric sensors, retrospective meetings, agile teams, emotion awareness

Abstract

Emotion awareness is a key antecedent to team effectiveness and the use of biometrics can help software developers in gaining awareness of emotions at the individual and team level. In this paper, we propose an approach to include emotional feedback in agile retrospective meetings as a proxy to identify developers’ feelings in association with the activity performed by the team. As a proof of concept, we developed an emotion visualization tool that provides an integrated visualization of self-reported emotions, activities, and biometrics. We run a pilot study to evaluate our approach with the agile retrospective meetings of a software engineering capstone project. The preliminary findings suggest that integrated emotion visualization can be useful to inform discussion and reflection around the potential causes of unhappiness, thus triggering actionable insights that could enhance team productivity and improve collaboration.

DOI: 10.1109/ICSE-NIER58687.2023.00024

Test-Driven Development Benefits beyond Design Quality： Flow State and Developer Experience

作者: Calais, Pedro and Franzini, Lissa
关键词: flow state, developer experience, TDD

Abstract

Test-driven development (TDD) is a coding technique that combines design and testing in an iterative and incremental fashion. It prescribes that tests written before the production code help the developer to find good interfaces and to evolve the design safely and incrementally. Improvements on the design of code produced by the test-driven development approach have been extensively evaluated in the literature; in this research, we focus on seeking explanations on the benefits of TDD in another dimension which we believe has been undervalued - developer experience. We identified that there is a natural connection between the TDD approach and flow state, a well-known mental state characterized by total immersion, focus, and involvement in a task that promotes increased enjoyment and productivity. We present evidence that the continuous stream of mini-scope, shortlived, red-green-refactor cycles of TDD frame the development task as a structure that creates the pre-conditions reported by neuroscience research to produce flow state, namely (1) clear goals, (2) unambiguous feedback, (3) challenge-skill balance and (4) sense of control. Our work contributes to increase the understanding on the reasons why adopting practices such as TDD can benefit the software development process as a whole and can support its adoption in software development projects.

DOI: 10.1109/ICSE-NIER58687.2023.00025

Performance Analysis with Bayesian Inference

作者: Couderc, Noric and Reichenbach, Christoph and S"{o
关键词: ANOVA, performance analysis, bayesian inference

Abstract

Statistics are part of any empirical science, and performance analysis is no exception. However, for non-statisticians, picking the right statistical tool to answer a research question can be challenging; each statistical tool comes with a set of assumptions, and it is not clear to researchers what happens when those assumptions are violated. Bayesian statistics offers a framework with more flexibility and with explicit assumptions. In this paper, we present a method to analyse benchmark results using Bayesian inference. We demonstrate how to perform a Bayesian analysis of variance (ANOVA) to estimate what factors matter most for performance, and describe how to investigate what factors affect the impact of optimizations. We find the Bayesian model more flexible, and the Bayesian ANOVA’s output easier to interpret.

DOI: 10.1109/ICSE-NIER58687.2023.00026

Judging Adam： Studying the Performance of Optimization Methods on ML4SE Tasks

作者: Pasechnyuk, Dmitry and Prazdnichnykh, Anton and Evtikhiev, Mikhail and Bryksin, Timofey
关键词: No keywords

Abstract

Solving a problem with a deep learning model requires researchers to optimize the loss function with a certain optimization method. The research community has developed more than a hundred different optimizers, yet there is scarce data on optimizer performance in various tasks. In particular, none of the benchmarks test the performance of optimizers on source code-related problems. However, existing benchmark data indicates that certain optimizers may be more efficient for particular domains. In this work, we test the performance of various optimizers on deep learning models for source code and find that the choice of an optimizer can have a significant impact on the model quality, with up to two-fold score differences between some of the relatively well-performing optimizers. We also find that RAdam optimizer (and its modification with the Lookahead envelope) is the best optimizer that almost always performs well on the tasks we consider. Our findings show a need for a more extensive study of the optimizers in code-related tasks, and indicate that the ML4SE community should consider using RAdam instead of Adam as the default optimizer for code-related deep learning tasks.

DOI: 10.1109/ICSE-NIER58687.2023.00027

Continuously Accelerating Research

作者: Barr, Earl and Bell, Jonathan and Hilton, Michael and Mechtaev, Sergey and Timperley, Christopher
关键词: containers, scientific software, continuous integration, artifact evaluation, reproducibility

Abstract

Science is facing a software reproducibility crisis. Software powers experimentation, and fuels insights, yielding new scientific contributions. Yet, the research software is often difficult for other researchers to reproducibly run. Beyond reproduction, research software that is truly reusable will speed science by allowing other researchers to easily build upon and extend prior work. As software engineering researchers, we believe that it is our duty to create tools and processes that instill reproducibility, reusability, and extensibility into research software. This paper outlines a vision for a community infrastructure that will bring the benefits of continuous integration to scientists developing research software. To persuade researchers to adopt this infrastructure, we will appeal to their self-interest by making it easier for them to develop and evaluate research prototypes. Building better research software is a complex socio-technical problem that requires stakeholders to join forces to solve this problem for the software engineering community, and the greater scientific community. This vision paper outlines an agenda for realizing a world where the reproducibility and reusability barriers in research software are lifted, continuously accelerating research.

DOI: 10.1109/ICSE-NIER58687.2023.00028

An Alternative to Cells for Selective Execution of Data Science Pipelines

作者: Reimann, Lars and Kniesel-W"{u
关键词: machine learning, data science, usability, notebook

Abstract

Data Scientists often use notebooks to develop Data Science (DS) pipelines, particularly since they allow to selectively execute parts of the pipeline. However, notebooks for DS have many well-known flaws. We focus on the following ones in this paper: (1) Notebooks can become littered with code cells that are not part of the main DS pipeline but exist solely to make decisions (e.g. listing the columns of a tabular dataset). (2) While users are allowed to execute cells in any order, not every ordering is correct, because a cell can depend on declarations from other cells. (3) After making changes to a cell, this cell and all cells that depend on changed declarations must be rerun. (4) Changes to external values necessitate partial re-execution of the notebook. (5) Since cells are the smallest unit of execution, code that is unaffected by changes, can inadvertently be re-executed.To solve these issues, we propose to replace cells as the basis for the selective execution of DS pipelines. Instead, we suggest populating a context-menu for variables with actions fitting their type (like listing columns if the variable is a tabular dataset). These actions are executed based on a data-flow analysis to ensure dependencies between variables are respected and results are updated properly after changes. Our solution separates pipeline code from decision making code and automates dependency management, thus reducing clutter and the risk of making errors.

DOI: 10.1109/ICSE-NIER58687.2023.00029

Assurance Case Development as Data： A Manifesto

作者: Menghi, Claudio and Viger, Torin and Di Sandro, Alessio and Rees, Chris and Joyce, Jeff and Chechik, Marsha
关键词: recommendations, data analysis, assurance cases, safety

Abstract

Safety problems can be costly and catastrophic. Engineers typically rely on assurance cases to ensure their systems are adequately safe. Building safe software systems requires engineers to iteratively design, analyze and refine assurance cases until sufficient safety evidence is identified. The assurance case development is typically manual, time-consuming, and far from being straightforward. This paper presents a manifesto for our forward-looking idea: using assurance cases as data. We argue that engineers produce a lot of data during the assurance case development process, and such data can be collected and used to effectively improve this process. Therefore, in this manifesto, we propose to monitor the assurance case development activities, treat assurance cases as data, and learn suggestions that help safety engineers in designing safer systems.

DOI: 10.1109/ICSE-NIER58687.2023.00030

How Does Quality Deviate in Stable Releases by Backporting?

作者: Tasnim, Jarin and Chakroborti, Debasish and Roy, Chanchal K. and Schneider, Kevin A.
关键词: software maintenance, software engineering, backporting, code quality

Abstract

Software goes through continuous evolution in its life cycle to sustain bugs and adopt enhanced features. However, many industrial users show reluctance to upgrade to the latest version, considering the stability and intuitive solace of the release they are using. This boosts the need to derive change patches from state-of-the-art versions to older software versions. This phenomenon is frequently supported by ‘Backporting’ in the industrial setting as the intent for backward patch propagation stood principally to sustain older releases, and the contribution does not count up to the upstream repository. However, it is yet unknown whether backport can act as a credible threat for stable releases. In this study, we aim to empirically quest backports to reveal the evolution trend of code entities through maintenance and pinpoint how they pull stable releases into the weak spectrum. The breakdown shows code entities often encounter gradual transformation in size, complexity and coupling due to consecutive commits on them. However, the numerics of outlier quality degradations are not insignificant at all in this context which calls for further investigation into why and when they may occur. Moreover, we observed that vulnerable change transmission often materializes with quality degradation. Understanding these issues and consequences is crucial for effectively supporting the backporting process for stable release maintenance.

DOI: 10.1109/ICSE-NIER58687.2023.00031

Understanding Inconsistency in Azure Cosmos DB with TLA+

作者: Hackett, Finn and Rowe, Joshua and Kuppe, Markus Alexander
关键词: cloud computing, formal methods, model checking

Abstract

Beyond implementation correctness of a distributed system, it is equally important to understand exactly what users should expect to see from that system. Even if the system itself works as designed, insufficient understanding of its user-observable semantics can cause bugs in its dependencies. By focusing a formal specification effort on precisely defining the expected user-observable behaviors of the Azure Cosmos DB service at Microsoft, we were able to write a formal specification of the database that was significantly smaller and conceptually simpler than any other specification of Cosmos DB, while representing a wider range of valid user-observable behaviors than existing more detailed specifications. Many of the additional behaviors we documented were previously poorly understood outside of the Cosmos DB development team, even informally, leading to data consistency errors in Microsoft products that depend on it. Using this specification, we were able to raise two key issues in Cosmos DB’s public-facing documentation, which have since been addressed. We were also able to offer a fundamental solution to a previous high-impact outage within another Azure service that depends on Cosmos DB.

DOI: 10.1109/ICSE-SEIP58684.2023.00006

Scaling Web API Integrations

作者: Chari, Guido and Sheffer, Brandon and Branavan, S. R. K and D’ippolito, Nicol'{a
关键词: No keywords

Abstract

In ASAPP, a company that offers AI solutions to enterprise customers, internal services consume data from our customers’ web APIs. Implementing and maintaining integrations between our customers’ APIs and internal services is a major effort for the company. In this paper, we present a scalable approach for integrating web APIs in enterprise software that is lightweight and semi-automatic. It leverages a combination of Ontology-Based Data Access architectures (OBDA), a Domain Specific Language (DSL) called IBL, Natural Language Processing (NLP) models, and Automated Planning techniques. The OBDA architecture decouples our platform from our customers’ APIs via an ontology that acts as a single internal data access point. IBL is a functional and graphical DSL that enables domain experts to implement integrations, even if they don’t have software development expertise. To reduce the effort of manually writing the IBL code, an NLP model suggests correspondences from each web API to the ontology. Given the API, ontology, and selected mappings for a set of desired fields from the ontology, we define an Automated Planning problem. The resulting policy is finally fed to a code synthesizer that generates the appropriate IBL method implementing the desired integration.This approach has been in production in ASAPP for 2 years with more than 300 integrations already implemented. Results indicate a ≈; 50% reduction in effort due to implementing integrations with IBL. Preliminary results on the IBL automatic code generation show an encouraging further ≈; 25% reduction so far.

DOI: 10.1109/ICSE-SEIP58684.2023.00007

DAppHunter： Identifying Inconsistent Behaviors of Blockchain-Based Decentralized Applications

作者: Zhou, Jianfei and Jiang, Tianxing and Wang, Haijun and Wu, Meng and Chen, Ting
关键词: blockchain, smart contract, DApp testing, inconsistent behavior

Abstract

A blockchain-based decentralized application (DApp) refers to an application typically using web pages or mobile applications as the front-end and smart contracts as the back-end. The front-end of the DApp helps users generate transactions and send them to the user’s blockchain wallet. After the user signs and confirms the transaction using the blockchain wallet, the transaction will invoke the smart contract of the DApp. However, users bear the following risks when using DApps because of the potential inconsistent behaviors in DApps. First, the DApp front-end may generate incorrect transactions inconsistent with users’ intentions. Second, the smart contract may have misbehaviors when executing the transactions. Inconsistent behaviors of DApps not only lead to user confusion but also cause significant financial losses. In this paper, we proposed a novel approach to identify inconsistent behaviors of DApps on EVM-compatible blockchains by contrasting the behaviors of DApps that derived from the front-end, blockchain wallet, and smart contracts, respectively. We implemented our approach into a prototype named DAppHunter. We have applied DAppHunter on 92 real-world DApps of Ethereum and Binance Smart Chain and successfully identified 37 DApps with inconsistent behaviors. We confirmed that 35 of them are scam DApps and over 5 million blockchain addresses are at risk of becoming victims of these inconsistent DApps.

DOI: 10.1109/ICSE-SEIP58684.2023.00008

Evolutionary Approach for Concurrency Testing of Ripple Blockchain Consensus Algorithm

作者: van Meerten, Martijn and Ozkan, Burcu Kulahcioglu and Panichella, Annibale
关键词: ripple, blockchains, distributed systems, consensus, concurrency, evolutionary algorithms, software testing

Abstract

Blockchain systems are prone to concurrency bugs due to the nondeterminism in the delivery order of messages between the distributed nodes. These bugs are hard to detect since they can only be triggered by a specific order or timing of concurrent events in the execution. Systematic concurrency testing techniques, which explore all possible delivery orderings of messages to uncover concurrency bugs, are not scalable to large distributed systems such as blockchains. Random concurrency testing methods search for bugs in a randomly generated set of executions and offer a practical testing method.In this paper, we investigate the effectiveness of random concurrency testing on blockchain systems using a case study on the XRP Ledger of the Ripple blockchain, which maintains one of the most popular cryptocurrencies in the market today. We test the Ripple consensus algorithm of the XRP Ledger by exploring different delivery orderings of consensus protocol messages. Moreover, we design an evolutionary algorithm to guide the random test case generation toward certain system behaviors to discover concurrency bugs more efficiently. Our case study shows that random concurrency testing is effective at detecting concurrency bugs in blockchains, and the evolutionary approach for test generation improves test efficiency. Our experiments could successfully detect the bugs we seeded in the Ripple source code. Moreover, we discovered a previously unknown concurrency bug in the production implementation of Ripple.

DOI: 10.1109/ICSE-SEIP58684.2023.00009

A Model for Understanding and Reducing Developer Burnout

作者: Trinkenreich, Bianca and Stol, Klaas-Jan and Steinmacher, Igor and Gerosa, Marco A. and Sarma, Anita and Lara, Marcelo and Feathers, Michael and Ross, Nicholas and Bishop, Kevin
关键词: job burnout, work satisfaction, culture, belonging, inclusiveness

Abstract

Job burnout is a type of work-related stress associated with a state of physical or emotional exhaustion that also involves a sense of reduced accomplishment and loss of personal identity. Burnt out can affect one’s physical and mental health and has become a leading industry concern and can result in high workforce turnover. Through an empirical study at Globant, a large multi-national company, we created a theoretical model to evaluate the complex interplay among organizational culture, work satisfaction, and team climate, and how they impact developer burnout. We conducted a survey of developers in software delivery teams (n=3,281) to test our model and analyzed the data using structural equation modeling, moderation, and multi-group analysis. Our results show that Organizational Culture, Climate for Learning, Sense of Belonging, and Inclusiveness are positively associated with Work Satisfaction, which in turn is associated with Reduced Burnout. Our model generated through a large-scale survey can guide organizations in how to reduce workforce burnout by creating a climate for learning, inclusiveness in teams, and a generative organizational culture where new ideas are welcome, information is actively sought and bad news can be shared without fear.

DOI: 10.1109/ICSE-SEIP58684.2023.00010

A Model-Based, Quality Attribute-Guided Architecture Re-Design Process at Google

作者: Jia, Qin and Cai, Yuanfang and \c{C
关键词: software architecture, software modeling, quality attribute

Abstract

Communicating and justifying design decisions are difficult, especially when the architecture design has to evolve. In this paper, we report our experiences of using formal but lightweight design models to communicate, justify, and analyze the quality trade-offs of an architecture revision plan for Monarch, a large-scale legacy system from Google. We started from a few critical user scenarios and their associated quality attribute scenarios, which makes these models lightweight and concise, expressing high-level abstractions only. We also separated static views from dynamic views so that each diagram can be precise and suitable for analyzing different types of quality attributes respectively. The combination of scenarios, quality attributes, and lightweight modeling was well accepted by the team as an effective way to analyze and communicate the tradeoffs. A few days after we presented and shared this process, two new projects within the Monarch team adopted component and sequence diagrams in their design documents, and two other product areas within Google started to learn and to adopt the process as well. Our experience indicates that these architecture modeling and analysis techniques can be integrated into software development process to communicate and assess features, quality attributes, or design decisions continuously and iteratively.

DOI: 10.1109/ICSE-SEIP58684.2023.00011

An Empirical Comparison on the Results of Different Clone Detection Setups for C-Based Projects

作者: Zhou, Yan and Chen, Jinfu and Shi, Yong and Chen, Boyuan and Jiang, Zhen Ming (Jack)
关键词: code clone, clone detection

Abstract

Code clones have been used in many different software maintenance and evaluation tasks in practice (e.g., change proportion and evolution, refactoring, and vulnerability management). There are many clone detection techniques (e.g., text-based, token-based, and AST-based) which can detect code clones not only at the source code-level but also at the compiled artifacts (e.g., IR or binary) level. Unfortunately, there are few studies which thoroughly compare the results of various clone detection setups (a.k.a., different clone detection techniques applied at different artifacts), especially for C-based projects. Therefore, in this paper, we conduct a systematic study to compare the effectiveness of six different code clone detection setups. Each setup, which uses the representative one of the three clone detection techniques: token-based technique (SourcererCC), AST-based (NiCad) technique, and text-based (MSFinder) technique, is applied either at the source code-level or at the LLVM-based IR-level. We conduct our experiments on five C-based open-source systems, Apache, Python, PostgreSQL, FFmpeg, and Linux kernel. Experimental results show that the AST-based technique is better than the token-based and text-based techniques, and clone detection setups performed at the IR-level generally yield higher performance than those performed at the source code-level. The setup of AST-based technique applied at the IR-level, yields the highest performance overall, with an F-score of 84%. However, there is no one setup which can detect all the clones. Through manual qualitative analysis, we have identified ten reasons why certain clones cannot be detected at the IR level or at the source code-level, and two reasons why one of the techniques fails to detect clones. Our findings highlight the usefulness of conducting clone detection under different setups. Furthermore, this study also motivates the need for more application-oriented clone comparison studies.

DOI: 10.1109/ICSE-SEIP58684.2023.00012

Daisy： Effective Fuzz Driver Synthesis with Object Usage Sequence Analysis

作者: Zhang, Mingrui and Zhou, Chijin and Liu, Jianzhong and Wang, Mingzhe and Liang, Jie and Zhu, Juan and Jiang, Yu
关键词: No keywords

Abstract

Fuzzing is increasingly used in industrial settings for vulnerability detection due to its scalability and effectiveness. Libraries require driver programs to feed the fuzzer-generated inputs into library-provided interfaces. Writing such drivers manually is tedious and error-prone, thus greatly hindering the widespread use of fuzzing in practical situations. Previous attempts at automatic driver synthesis perform static analysis on the libraries and their consumers. However, a lack of dynamic object usage information renders them ineffective at generating interface function calls with correct parameters and meaningful sequences. This severely limits fuzzing’s bug-finding capabilities and can produce faulty drivers.In this paper, we propose Daisy, a driver synthesis framework, which extracts dynamic object usage sequences of library consumers to synthesize significantly more effective drivers. Daisy uses the following two steps to synthesize a fuzz driver for a library. First, it models each object’s behaviors into an object usage sequence during the execution of its consumers. Next, it merges all the extracted sequences and constructs a series of interface calls with valid object usages based on the merged sequence. We implemented Daisy and evaluated its effectiveness on real-world libraries selected from both the Android Open Source Project (AOSP) and Google’s FuzzBench. DAISY’s synthesized drivers significantly outperform drivers produced by other state-of-the-art fuzz driver synthesizers. In addition, on applying Daisy to the latest versions of those extensively-fuzzed real-world libraries of the benchmark, e.g. libaom and freetype2, we also found 9 previously-unknown bugs with 3 CVEs assigned.

DOI: 10.1109/ICSE-SEIP58684.2023.00013

Challenges in Adopting Artificial Intelligence Based User Input Verification Framework in Reporting Software Systems

作者: Kim, Dong Jae and Locke, Steve and Chen, Tse-Hsun (Peter) and Toma, Andrei and Sporea, Steve and Weinkam, Laura and Sajedi, Sarah
关键词: user input, testing, experience report

Abstract

Artificial intelligence is driving new industrial solutions for challenging problems once considered impossible. Many large-scale companies use AI to identify opportunities to improve business processes and products. Despite the promise and perils of AI, many traditional software systems (e.g., taxation or reporting) are implemented without AI in mind. Adopting AI-based capabilities in such software can be challenging due to a lack of resources and uncertainties in requirements. This paper documents our experience working with our industry partner on adopting AI capabilities in enterprise software. The enterprise software receives and processes thousands of user inputs with different configuration settings daily, which makes manual user input verification infeasible. To assist our industry partner, we design and integrate an AI-based input verification framework into the software. However, during the design and integration of the framework, we encounter many challenges that range from the requirement engineering process to the development, adoption, and verification process. We discuss the challenges we encountered and their corresponding solutions while working with our industrial partner to integrate the AI-based input verification framework into their non-AI software. Our experience report may provide valuable insight to practitioners and researchers on better integrating AI-based capabilities with existing software systems.

DOI: 10.1109/ICSE-SEIP58684.2023.00014

Scalable Compositional Static Taint Analysis for Sensitive Data Tracing on Industrial Micro-Services

作者: Zhong, Zexin and Liu, Jiangchao and Wu, Diyu and Di, Peng and Sui, Yulei and Liu, Alex X. and Lui, John C. S.
关键词: program analysis, taint analysis, micro-services

Abstract

In recent years, there has been an increasing demand for sensitive data tracing for industrial microservices; these include change of governance, data breach detection, to data consistency validation. As an information tracking technique, Taint analysis is widely used to address these demands. This paper aims to share our experience in developing a scalable static taint analyzer on sensitive data tracing for large-scale industrial microservices. Although several taint analyzers have been proposed for Java applications, our experiments show that existing approaches are inefficient and/or ineffective (in terms of low recall/precision rates) for analyzing large-scale industrial microservices.Instead, we present CFTaint, a compositional field-based taint analyzer, to address the challenges for popular microservices running on industrial Fintech applications. CFTaint improves scalability by using a fast compositional function summary, which summarizes the data propagation of each function during the on-the-fly taint analysis. CFTaint also uses a novel filed-based algorithm to analyze the taint propagation based on specified sensitive fields to reduce false negatives. Our field-based algorithm maximizes the soundness of our approach even when the taint tracking is performed on an unsound call graph. Furthermore, we also propose an efficient code transformation method to model the behaviours of the containers, which allows our analysis to trace data propagation in a container environment. Experiments on numerous production microservices demonstrate the high recall (96.09%) rates and precision (93.51% for tracing sensitive data) of CFTaint with low time complexity (121.73 seconds).

DOI: 10.1109/ICSE-SEIP58684.2023.00015

Simulation-Driven Automated End-to-End Test and Oracle Inference

作者: Tuli, Shreshth and Bojarczuk, Kinga and Gucevska, Natalija and Harman, Mark and Wang, Xiao-Yu and Wright, Graham
关键词: automated test design, oracle problem, automated oracle inference, test automation, safety testing, integrity testing

Abstract

This is the first work to report on inferential testing at scale in industry. Specifically, it reports the experience of automated testing of integrity systems at Meta. We built an internal tool called ALPACAS for automated inference of end-to-end integrity tests. Integrity tests are designed to keep users safe online by checking that interventions take place when harmful behaviour occurs on a platform. ALPACAS infers not only the test input, but also the oracle, by observing production interventions to prevent harmful behaviour. This approach allows Meta to automate the process of generating integrity tests for its platforms, such as Facebook and Instagram, which consist of hundreds of millions of lines of production code. We outline the design and deployment of ALPACAS, and report results for its coverage, number of tests produced at each stage of the test inference process, and their pass rates. Specifically, we demonstrate that using ALPACAS significantly improves coverage from a manual test design for the particular aspect of integrity end-to-end testing it was applied to. Further, from a pool of 3 million data points, ALPACAS automatically yields 39 production-ready end-to-end integrity tests. We also report that the ALPACAS-inferred test suite enjoys exceptionally low flakiness for end-to-end testing with its average in-production pass rate of 99.84%.

DOI: 10.1109/ICSE-SEIP58684.2023.00016

StreamAI： Dealing with Challenges of Continual Learning Systems for Serving AI in Production

作者: Barry, Mariam and Bifet, Albert and Billy, Jean-Luc
关键词: StreamAI, AI, challenges, production, serving, online learning, MLOps, streaming data, industry, banking

Abstract

How to build, deploy, update & maintain dynamic models which continuously learn from streaming data? This paper covers the industrialization aspects of these questions in production systems. In today’s fast-changing environments, organizations are faced with the crucial challenge of predictive analytics in online fashion from big data and deploying Artificial Intelligence models at scale. Applications include cyber-security, cloud infrastructure, social networks and financial markets. Online learning models that learn continuously and adapt to the potentially evolving data distributions have demonstrated efficiency for big data stream learning. However, the challenges of deploying and maintaining such models in production (serving) have stalled their adoption. In this paper, we first categorize key challenges faced by the R&D, MLOps and governance teams for deploying automated and self-training AI models in production. Next, we highlight the challenges related to stream-based online machine-learning systems. Finally, we propose StreamAI, a technology-agnostic architecture to deal with the MLOps journey (learning, serving, maintenance) of online models in production. We conclude with open research questions for AI, MLOps and software engineering to bridge the gaps between industry needs and research-oriented development.

DOI: 10.1109/ICSE-SEIP58684.2023.00017

CONAN： Diagnosing Batch Failures for Cloud Systems

作者: Li, Liqun and Zhang, Xu and He, Shilin and Kang, Yu and Zhang, Hongyu and Ma, Minghua and Dang, Yingnong and Xu, Zhangwei and Rajmohan, Saravan and Lin, Qingwei and Zhang, Dongmei
关键词: No keywords

Abstract

Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the last decade. In this paper, we focus on diagnosing batch failures, which occur to a batch of instances of the same subject (e.g., API requests, VMs, nodes, etc.), resulting in degraded service availability and performance. Manual investigation over a large volume of high-dimensional telemetry data (e.g., logs, traces, and metrics) is labor-intensive and time-consuming, like finding a needle in a haystack. Meanwhile, existing proposed approaches are usually tailored for specific scenarios, which hinders their applications in diverse scenarios. According to our experience with Azure and Microsoft 365 - two world-leading cloud systems, when batch failures happen, the procedure of finding the root cause can be abstracted as looking for contrast patterns by comparing two groups of instances, such as failed vs. succeeded, slow vs. normal, or during vs. before an anomaly. We thus propose CONAN, an efficient and flexible framework that can automatically extract contrast patterns from contextual data. CONAN has been successfully integrated into multiple diagnostic tools for various products, which proves its usefulness in diagnosing real-world batch failures.

DOI: 10.1109/ICSE-SEIP58684.2023.00018

Please Fix This Mutant： How Do Developers Resolve Mutants Surfaced during Code Review?

作者: Petrovi'{c
关键词: mutation testing, test efficacy, code quality, code review, mutant resolution

Abstract

Mutation testing has been demonstrated to motivate developers to write more tests when presented with undetected, actionable mutants. To facilitate this effect, modern mutation systems aim to generate and surface only actionable mutants—few in numbers but highly valuable to the developer. This requires a deeper understanding of the extent to which developers resolve surfaced mutants and how: If they decide not to resolve an undetected mutant, why not? On the other hand, if they do resolve a mutant, do they simply add a test that detects it, or do they also improve the code?In order to answer these questions we compiled and analyzed a dataset of 1,538 merge requests with corresponding mutants surfaced during the code review phase. Our analysis reveals that determining whether a mutant is indeed resolved during code review is actually a non-trivial problem: for 64% of mutants, the mutated code changes as the merge request evolves, requiring dedicated techniques to precisely resurface the same mutants and to discover which of them remain unresolved after a code change. Overall, our analysis demonstrates that 38% of all surfaced mutants are resolved via code changes or test additions. Out of all mutants that are endorsed by a reviewer, 60% are resolved and result in additional tests, code refactorings, and improved documentation. Unresolved, yet endorsed, mutants stem from developers questioning the value of adding tests for surfaced mutants, later resolving mutants in deferred code changes (atomicity of merge requests), and false positives (mutants being resolved by tests not considered when creating the mutants, e.g., in integration test suites).

DOI: 10.1109/ICSE-SEIP58684.2023.00019

Using Large-Scale Heterogeneous Graph Representation Learning for Code Review Recommendations at Microsoft

作者: Zhang, Jiyang and Maddila, Chandra and Bairi, Ram and Bird, Christian and Raizada, Ujjwal and Agrawal, Apoorva and Jhawar, Yamini and Herzig, Kim and van Deursen, Arie
关键词: No keywords

Abstract

Code review is an integral part of any mature software development process, and identifying the best reviewer for a code change is a well-accepted problem within the software engineering community. Selecting a reviewer who lacks expertise and understanding can slow development or result in more defects. To date, most reviewer recommendation systems rely primarily on historical file change and review information; those who changed or reviewed a file in the past are the best positioned to review in the future.We posit that while these approaches are able to identify and suggest qualified reviewers, they may be blind to reviewers who have the needed expertise and have simply never interacted with the changed files before. Fortunately, at Microsoft, we have a wealth of work artifacts across many repositories that can yield valuable information about our developers. To address the aforementioned problem, we present Coral, a novel approach to reviewer recommendation that leverages a socio-technical graph built from the rich set of entities (developers, repositories, files, pull requests (PRs), work items, etc.) and their relationships in modern source code management systems. We employ a graph convolutional neural network on this graph and train it on two and a half years of history on 332 repositories within Microsoft.We show that Coral is able to model the manual history of reviewer selection remarkably well. Further, based on an extensive user study, we demonstrate that this approach identifies relevant and qualified reviewers who traditional reviewer recommenders miss, and that these developers desire to be included in the review process. Finally, we find that “classical” reviewer recommendation systems perform better on smaller (in terms of developers) software projects while Coral excels on larger projects, suggesting that there is “no one model to rule them all.”

DOI: 10.1109/ICSE-SEIP58684.2023.00020

作者: Wu, Xiongfei and Ye, Jiaming and Chen, Ke and Xie, Xiaofei and Hu, Yujing and Huang, Ruochen and Ma, Lei and Zhao, Jianjun
关键词: mobile game testing, GUI testing, GUI detection, software quality assurance

Abstract

The fast advances in mobile hardware and widespread smartphone usage have fueled the growth of global mobile gaming in the past decade. As a result, the need for quality assurance of mobile gaming has become increasingly pressing. While general-purpose testing methods have been developed for mobile applications, they become struggling when being applied to mobile games due to the unique characteristics of mobile games, such as dynamic loading and stunning visual effects. There comes a growing industrial demand for automated testing techniques with high compatibility (compatible with various resolutions, and platforms) and non-intrusive characteristics (without packaging external modules into the source code, e.g., POCO). To fulfill these demands, in this paper, we introduce our experience in adopting the widget detection-based testing technique WDTest, for mobile games at NetEase Games. To this end, we have constructed by far the largest graphical user interface (GUI) dataset for mobile games and conducted comprehensive evaluations on the performance of state-of-the-art widget detection techniques in the context of mobile gaming.We leverage widget detection techniques to develop WDTest, which performs automated testing using only screenshots as input. Our evaluation shows that WDTest outperforms the widely used tool Monkey in achieving three times more coverage of unique UI in gaming scenarios. Our further experiments demonstrate that WDTest can be applied to general mobile applications without additional fine-tuning. Furthermore, we conducted a thorough survey at NetEase Games to gain a comprehensive understanding of widget detection-based testing techniques and identify challenges in industrial mobile game testing. The results show that testers are overall satisfied with the compatibility testing aspect of widget detection-based testing, but not much with functionality testing. This survey also highlights several unique characteristics of mobile games, providing valuable insights for future research directions.

DOI: 10.1109/ICSE-SEIP58684.2023.00021

Towards More Effective AI-Assisted Programming： A Systematic Design Exploration to Improve Visual Studio IntelliCode’s User Experience

作者: Vaithilingam, Priyan and Glassman, Elena L. and Groenwegen, Peter and Gulwani, Sumit and Henley, Austin Z. and Malpani, Rohan and Pugh, David and Radhakrishna, Arjun and Soares, Gustavo and Wang, Joey and Yim, Aaron
关键词: inline-suggestion, AI-suggestion, refactoring, iterative-refinement, code-completion

Abstract

AI-driven code editor extensions such as Visual Studio IntelliCode and Github CoPilot have become extremely popular. These tools recommend inserting chunks of code, with the lines to be inserted presented inline at the current cursor location as gray text. In contrast to their popularity, other AI-driven code recommendation tools that suggest code changes (as opposed to code completions) have remained woefully underused. We conducted lab studies at Microsoft to understand this disparity and found one major cause: discoverability. Code change suggestions are hard to surface through bold, inline interfaces and hence, developers often do not even notice them.Towards a systematic understanding of code change interfaces, we performed a thorough design exploration for various categories of code changes: additive single-line changes, single-line changes, and multi-line changes. Overall, we explored 19 designs through a series of 7 laboratory studies involving 61 programmers and distilled our findings into a set of 5 design principles. To validate our results, we built and deployed a new version of IntelliCode with two of our new inline interfaces in Microsoft Visual Studio 2022 and found that they lead to a significant increase in usage of the corresponding tools.

DOI: 10.1109/ICSE-SEIP58684.2023.00022

Code Librarian： A Software Package Recommendation System

作者: Tao, Lili and Cazan, Alexandru-Petre and Ibraimoski, Senad and Moran, Sean
关键词: artificial intelligence, software engineering, recommender systems

Abstract

The use of packaged libraries can significantly shorten the software development life cycle by improving the quality and readability of code. In this paper, we present a recommendation engine called Code Librarian for open source libraries. A candidate library package is recommended for a given context if: 1) it has been frequently used with the imported libraries in the program; 2) it has similar functionality to the imported libraries in the program; 3) it has similar functionality to the developer’s implementation, and 4) it can be used efficiently in the context of the provided code. We apply the state of the art CodeBERT-based model for analysing the context of the source code to deliver relevant library recommendations to users.

DOI: 10.1109/ICSE-SEIP58684.2023.00023

DocToModel： Automated Authoring of Models from Diverse Requirements Specification Documents

作者: Rajbhoj, Asha and Nistala, Padmalata and Kulkarni, Vinay and Soni, Shivani and Pathan, Ajim
关键词: meta-model, automated model authoring, model extraction, document parser, NLP, meta-model pattern, pattern interpreter

Abstract

Early stages of Software Development Life Cycle (SDLC) namely requirement elicitation and requirements analysis have remained document-centric in the industry for market-driven, complex, large-scale business applications and products. The documentation typically runs into hundreds of Natural Language (NL) text documents which requirements engineers need to sift looking for the relevant information and also maintain these documents in-sync over time - a time-consuming and error-prone activity. Much of this difficulty can be overcome if the information is available in a structured form that is amenable to automated processing. Purposive models offer a way out. However, for easy adoption by industry practitioners, these models must be populated from NL text documents in a largely automated manner. This task is characterized by high variability with several documents containing different information conforming to different structures and styles. As a result, purposive information extractors need to be developed for each project/ product. Moreover, being an open-ended space there is no upper bound on the information extractors that need to be developed. To overcome this difficulty, we propose a document structure agnostic and meta-model agnostic tool, DocToModel, for the automated authoring of models from NL text documents. It provides a pattern mapping language to specify a mapping of structured and unstructured document information to meta-model elements, and a pattern interpreter to automate model authoring. The configurable and extensible architecture of DocToModel makes it generic and amenable to easy repurposing for other NL documents. This paper, describes the approach and illustrates its utility and efficacy on multiple real-world case studies.

DOI: 10.1109/ICSE-SEIP58684.2023.00024

Investigating a NASA Cyclomatic Complexity Policy on Maintenance Risk of a Critical System

作者: Port, Dan and Taber, Bill and Huang, LiQuo
关键词: cyclomatic complexity, software maintenance, defect proneness, defect repair effort

Abstract

Monte is a mission critical system used by NASA for navigation and design of deep space missions has been used in over 40 missions over the last 18 years. Its continuous, reliable operation is considered critical to the operation for over 18 ongoing missions. Recently Monte has been escalated to safety-critical software and subject to NASA Software Assurance and Software Safety Standard requirements. One of these requirements mandates a policy that the cyclomatic complexity (CC) for safety-critical components be under 16 or given a technically detailed explanation as to why it cannot or should not be lower. Conformance to this requirement would be costly and we had doubts about its benefit and efficacy in managing our maintenance risk (defect proneness and defect repair effort). The requirement was not substantiated either empirically or in principle, and guidance in the literature for use of CC as in indicator of maintenance risk is limited and often speculative or have contradictory empirical or non-definitive results.This led us to rigorously investigate the impact the CC policy would have on the practical management of maintenance risk for Monte. The effect of CC on maintenance risk for Monte is explored using a variety of statistical methods and machine learning with aim of provide objective empirical evidence to support our decision as to what extent we will conform to the NASA CC policy of maintaining a CC under 16. This paper presents the conclusions and insights gained from this investigation. Practical questions related to the policy are addressed such as: Does higher CC associate with higher defect proneness? More effort to repair defects? If so, is there a CC after which action should be taken to reduce defect proneness and repair effort?We conclude that the impacts of CC on maintenance risk are, with high confidence, consistent with the risk management expectations of the NASA policy. We can quantify the benefit and weight it against the cost to inform our decision in conforming to the policy. Furthermore, we gained insight into how CC affects maintenance risk and is a useful indicator for maintenance risk management.

DOI: 10.1109/ICSE-SEIP58684.2023.00025

Aegis： Attribution of Control Plane Change Impact across Layers and Components for Cloud Systems

作者: Yan, Xiaohan and Hsieh, Ken and Liyanage, Yasitha and Ma, Minghua and Chintalapati, Murali and Lin, Qingwei and Dang, Yingnong and Zhang, Dongmei
关键词: safe deployment, regression detection, impact assessment, counterfactual analysis, cloud computing

Abstract

Modern cloud control plane infrastructure like Microsoft Azure has evolved into a complex one to serve customer needs for diverse types of services and adequate cloud-based resources. On such interconnected system, implementing changes at one component can have an impact on other components, even across different hierarchical computing layers. As a result of the complexity and interconnected nature of the cloud-based services, it poses a challenge to correctly attribute service quality degradation to a control plane change, to infer causality between the two and to mitigate any negative impact. In this paper, we present Aegis, an end-to-end analytical service for attributing control plane change impact across computing layers and service components in large-scale real-world cloud systems. Aegis processes and correlates service health signals and control plane changes across components to construct the most probable causal relationship. Aegis at its core leverages a domain knowledge-driven correlation algorithm to attribute platform signals to changes, and a counterfactual projection model to quantify control plane change impact to customers. Aegis can mitigate the impact of bad changes by alerting service team and recommending pausing the bad ones. Since Aegis’ inception in Azure Control Plane 12 months ago, it has caught several bad changes across service components and layers, and promptly paused them to guard the quality of service. Aegis achieves precision and recall around 80% on real-world control plane deployments.

DOI: 10.1109/ICSE-SEIP58684.2023.00026

An Empirical Study on Change-Induced Incidents of Online Service Systems

作者: Wu, Yifan and Chai, Bingxu and Li, Ying and Liu, Bingchang and Li, Jianguo and Yang, Yong and Jiang, Wei
关键词: incident, change management, empirical study, online service system

Abstract

Although dedicated efforts have been devoted to ensuring the service quality of online service systems, these systems are still suffering from incidents due to various causes, which lead to user dissatisfaction and economic loss. Change is the most disruptive yet unavoidable maintenance event in online service systems. Among all possible causes of incidents, change is one of the leading causes that induce incidents. To enforce changes with minimized negative impact, change management has been widely applied in industry. However, change-induced incidents are still happening. Most empirical studies involving change-induced incidents are limited to one specific type of incident-inducing change. Moreover, the characteristics of change-induced incidents and challenges of change management have not been studied. To fill the knowledge gap, this paper presents the first empirical study on change-induced incidents of online service systems. 161 real change-induced incidents are collected from a large-scale online service system over two years in Ant Group. By manually examining their post-mortem reports, we clarify the severity of change-induced incidents and analyze the characteristics of change-induced incidents in terms of change types, root causes, and mitigation strategies. Furthermore, we identify a series of vital challenges of change management in practice and point out several practical implications for researchers and engineers. We believe our work could help understand change-induced incidents and give some inspiration and guidance for engineers and researchers to improve change management.

DOI: 10.1109/ICSE-SEIP58684.2023.00027

Fulfilling Industrial Needs for Consistency among Engineering Artifacts

作者: Marchezan, Luciano and Assun\c{c
关键词: model-driven engineering, consistency checking, trace generation, consistency flexibility

Abstract

Maintaining the consistency of engineering artifacts is a challenge faced by several engineering companies. This is more evident when the engineering artifacts are created using different tools and have different formats. This is the context of a company that builds agricultural machines, where components are developed using a decentralized iterative process. In this study, we present an approach developed in collaboration with an industry partner to address the issues and requirements of a real engineering scenario. These issues include the manual execution of consistency checking, without guidelines that formalize the activity. Furthermore, the industry partner aims at a flexible solution that can be applied without disrupting the current development process significantly. The proposed approach applies consistency rules (CR) defined to automatically detect and provide inconsistency feedback to engineers in real-time. The approach presented in this work also allows the customization of the CRs, giving flexibility to how the consistency checking is applied. The feasibility of our approach is demonstrated in such an industrial scenario, with a discussion about how the issues were addressed and the limitations of the current solution. We also perform a scalability evaluation showing that the approach can be applied in large systems (up to 21,061 elements) in a reasonable amount of time, taking less than 0.25 milliseconds to apply a CR, in the worst cases.

DOI: 10.1109/ICSE-SEIP58684.2023.00028

TraceArk： Towards Actionable Performance Anomaly Alerting for Online Service Systems

作者: Zeng, Zhengran and Zhang, Yuqun and Xu, Yong and Ma, Minghua and Qiao, Bo and Zou, Wentao and Chen, Qingjun and Zhang, Meng and Zhang, Xu and Zhang, Hongyu and Gao, Xuedong and Fan, Hao and Rajmohan, Saravan and Lin, Qingwei and Zhang, Dongmei
关键词: No keywords

Abstract

Performance anomaly alerting based on trace data plays an important role in assuring the quality of online service systems. However, engineers find that many anomalies reported by existing techniques are not of interest for them to take further actions. For a large scale online service with hundreds of different microservices, current methods either fire lots of false alarms by applying simple thresholds to temporal metrics (i.e., latency), or run complex end-to-end deep learning model with limited interpretability. Engineers often feel difficult to understand why anomalies are reported, which hinders the follow-up actions. In this paper, we propose an actionable anomaly alerting approach TraceArk. More specifically, we design an anomaly evaluation model by extracting service impact related anomalous features. A small amount of engineer experience (i.e., feedback) is also incorporated to learn the actionable anomaly alerting model. Comprehensive experiments on a real dataset of Microsoft Exchange service and an anomaly injection dataset collected from an open-source project demonstrate that TraceArk significantly outperforms the existing state-of-the-art approaches. The improvement in F1 is 50.47% and 20.34% on the two datasets, respectively. Furthermore, TraceArk has been running stably for four months in a real production environment and showing a 2.3x improvement in Precision over the previous approach. TraceArk also provides intrepretable alerting details for engineers to take further actions.

DOI: 10.1109/ICSE-SEIP58684.2023.00029

You Don’t Know Search： Helping Users Find Code by Automatically Evaluating Alternative Queries

作者: van Tonder, Rijnard
关键词: No keywords

Abstract

Tens of thousands of engineers use Sourcegraph day-to-day to search for code and rely on it to make progress on software development tasks. We face a key challenge in designing a query language that accommodates the needs of a broad spectrum of users. Our experience shows that users express different and often contradictory preferences for how queries should be interpreted. These preferences stem from users with differing usage contexts, technical experience, and implicit expectations from using prior tools. At the same time, designing a code search query language poses unique challenges because it intersects traditional search engines and full-fledged programming languages. For example, code search queries adopt certain syntactic conventions in the interest of simplicity and terseness but invariably risk encoding implicit semantics that are ambiguous at face-value (a single space in a query could mean three or more semantically different things depending on surrounding terms). Users often need to disambiguate intent with additional syntax so that a query expresses what they actually want to search. This need to disambiguate is one of the primary frustrations we’ve seen users experience with writing search queries in the last three years. We share our observations that lead us to a fresh perspective where code search behavior can straddle seemingly ambiguous queries. We develop Automated Query Evaluation (AQE), a new technique that automatically generates and adaptively runs alternative query interpretations in frustration-prone conditions. We evaluate AQE with an A/B test across more than 10,000 unique users on our publicly-available code search instance. Our main result shows that relative to the control group, users are on average 22% more likely to click on a search result at all on any given day when AQE is active. We share our technique, learnings, and implementation that made it possible for a substantial number of users to now see and click on results that they would not have seen otherwise.

DOI: 10.1109/ICSE-SEIP58684.2023.00030

CFG2VEC： Hierarchical Graph Neural Network for Cross-Architectural Software Reverse Engineering

作者: Yu, Shih-Yuan and Achamyeleh, Yonatan Gizachew and Wang, Chonghan and Kocheturov, Anton and Eisen, Patrick and Faruque, Mohammad Abdullah Al
关键词: software reverse engineering, binary analysis, cross-architecture, machine learning, graph neural network

Abstract

Mission-critical embedded software is critical to our society’s infrastructure but can be subject to new security vulnerabilities as technology advances. When security issues arise, Reverse Engineers (REs) use Software Reverse Engineering (SRE) tools to analyze vulnerable binaries. However, existing tools have limited support, and REs undergo a time-consuming, costly, and error-prone process that requires experience and expertise to understand the behaviors of software and vulnerabilities. To improve these tools, we propose cfg2vec, a Hierarchical Graph Neural Network (GNN) based approach. To represent binary, we propose a novel Graph-of-Graph (GoG) representation, combining the information of control-flow and function-call graphs. Our cfg2vec learns how to represent each binary function compiled from various CPU architectures, utilizing hierarchical GNN and the siamese network-based supervised learning architecture. We evaluate cfg2vec’s capability of predicting function names from stripped binaries. Our results show that cfg2vec outperforms the state-of-the-art by 24.54% in predicting function names and can even achieve 51.84% better given more training data. Additionally, cfg2vec consistently outperforms the state-of-the-art for all CPU architectures, while the baseline requires multiple training to achieve similar performance. More importantly, our results demonstrate that our cfg2vec could tackle binaries built from unseen CPU architectures, thus indicating that our approach can generalize the learned knowledge. Lastly, we demonstrate its practicability by implementing it as a Ghidra plugin used during resolving DARPA Assured MicroPatching (AMP) challenges.

DOI: 10.1109/ICSE-SEIP58684.2023.00031

Do Software Security Practices Yield Fewer Vulnerabilities?

作者: Zahan, Nusrat and Shohan, Shohanuzzaman and Harris, Dan and Williams, Laurie
关键词: No keywords

Abstract

Due to the ever-increasing number of security breaches, practitioners are motivated to produce more secure software. In the United States, the White House Office released a memorandum on Executive Order (EO) 14028 that mandates organizations provide self-attestation of the use of secure software development practices. The OpenSSF Scorecard project allows practitioners to measure the use of software security practices automatically. However, little research has been done to determine whether the use of security practices improves package security, particularly which security practices have the biggest impact on security outcomes. The goal of this study is to assist practitioners and researchers in making informed decisions on which security practices to adopt through the development of models between software security practice scores and security vulnerability counts.To that end, we developed five supervised machine learning models for npm and PyPI packages using the OpenSSF Score-card security practices scores and aggregate security scores as predictors and the number of externally-reported vulnerabilities as a target variable. Our models found that four security practices (Maintained, Code Review, Branch Protection, and Security Policy) were the most important practices influencing vulnerability count. However, we had low R2 (ranging from 9% to 12%) when we tested the models to predict vulnerability counts. Additionally, we observed that the number of reported vulnerabilities increased rather than reduced as the aggregate security score of the packages increased. Both findings indicate that additional factors may influence the package vulnerability count. Other factors, such as the scarcity of vulnerability data, time to implicate security practices vs. time to detect vulnerabilities, and the need for more adequate scripts to detect security practices, may impede the data-driven studies to indicate that a practice can aid in the reduction of externally-reported vulnerabilities. We suggest that vulnerability count and security score data be refined such that these measures may be used to provide actionable guidance on security practices.

DOI: 10.1109/ICSE-SEIP58684.2023.00032

A/B Integrations： 7 Lessons Learned from Enabling A/B Testing as a Product Feature

作者: Fabijan, Aleksander and Dmitriev, Pavel and Arai, Benjamin and Drake, Andy and Kohlmeier, Sebastian and Kwong, April
关键词: A/B testing, A/B integrations, platform design

Abstract

A/B tests are the gold standard for evaluating product changes. At Microsoft, for example, we run tens of thousands of A/B tests every year to understand how users respond to new designs, new features, bug fixes, or any other ideas we might have on what will deliver value to users. In addition to testing product changes, however, A/B testing is starting to gain momentum as a differentiating feature of platforms or products whose primary purpose may not be A/B testing. As we describe in this paper, organizations such as Azure PlayFab and Outreach have integrated experimentation platforms and offer A/B testing to their customers as one of the many features in their product portfolio. In this paper and based on multiple-case studies, we present the lessons learned from enabling A/B integrations - integrating A/B testing into software products. We enrich each of the learnings with a motivating example, share the trade-offs made along this journey, and provide recommendations for practitioners. Our learnings are most applicable for engineering teams developing experimentation platforms, integrators considering embedding A/B testing into their products, and for researchers working in the A/B testing domain.

DOI: 10.1109/ICSE-SEIP58684.2023.00033

Long-Term Static Analysis Rule Quality Monitoring Using True Negatives

作者: Luo, Linghui and Mukherjee, Rajdeep and Tripp, Omer and Sch"{a
关键词: software security, static analysis

Abstract

Static application security testing (SAST) tools have found broad adoption in modern software development workflows. These tools employ a variety of static analysis rules to generate recommendations on how to improve the code of an application.Every recommendation consumes the time of the engineer that is investigating it, so it is important to measure how useful these rules are in the long term. But what is a good metric for monitoring rule quality over time? Counting the number of recommendations rewards noisy rules and ignores developers’ reactions. Measuring fix rate is not ideal either, because it overemphasizes rules that are easy to fix.In this paper, we report on an experiment where we use the frequency of true negatives to quantify if developers are able to learn a static analysis rule. We consider a static analysis rule to be ideal if its recommendations are not only addressed, but also internalized by the developer in a way that prevents the bug from recurring. That is, the rule contributes to code quality not only at present, but also in the future. We measure how often developers produce true negatives, that is, code changes that are relevant to a rule but do not trigger a recommendation, and we compare true-negative rate against other metrics. Our results show that measuring true negatives provides insights that cannot be provided by metrics such as fix rate or developer feedback.

DOI: 10.1109/ICSE-SEIP58684.2023.00034

A Language-Agnostic Framework for Mining Static Analysis Rules from Code Changes

作者: Effendi, Sedick David Baker and \c{C
关键词: static analysis, mining software repository, program synthesis, coding best practices, clustering

Abstract

Static analysis tools detect a wide range of code defects, including code quality issues, security vulnerabilities, operational risks, and best-practice violations. Creating and maintaining a set of high-quality static analysis rules that detect misuses of popular libraries and SDKs across multiple languages is challenging. One of the mechanisms for inferring static analysis rules is by leveraging frequently occurring bug-fix code changes in the wild that are committed by multiple developers and into different software repositories. The intuition is that code changes following a common pattern correspond to recurring mistakes, from which deriving best practices could likely be of high value and accepted by the community.Automating the process of mining and clustering code changes enables a scalable mechanism to source and generate best-practices rules. From a coverage standpoint, the rules are derived from real-world code changes, which ensures that popular libraries and application domains are accounted for.In this paper, we present a language-agnostic framework for mining and clustering code changes from software repositories using a graph-based representation dubbed MU (μ). Unlike language-specific ASTs, the MU representation generalizes across languages by modeling programs at a higher semantic level, which enables grouping of code changes that are semantically similar yet syntactically distinct. We have mined a total of 62 high-quality static analysis rules across Java, JavaScript, and Python from less than 600 code change clusters. These cover multiple libraries, including the AWS Java and Python SDKs, as well as libraries like pandas, React, Android libraries, Json parsing libraries, and many more. These rules are integrated into a cloud-based static analyzer, Amazon CodeGuru Reviewer. Developers have accepted 73% of recommendations from these rules during code review, which signifies the value of these rules to help improve developer productivity, make code secure, and improve code hygiene.

DOI: 10.1109/ICSE-SEIP58684.2023.00035

Designing for Cognitive Diversity： Improving the GitHub Experience for Newcomers

作者: Santos, Italo and Pimentel, Jo~{a
关键词: human-computer interaction, cognitive styles, human factors, diversity and inclusion, open source

Abstract

Social coding platforms such as GitHub have become defacto environments for collaborative programming and open source. When these platforms do not support specific cognitive styles, they create barriers to programming for some populations. Research shows that the cognitive styles typically favored by women are often unsupported, creating barriers to entry for woman newcomers. In this paper, we use the GenderMag method to evaluate GitHub to find cognitive style-specific inclusivity bugs. We redesigned the “buggy” GitHub features through a web browser plugin, which we evaluated through a between-subjects experiment (n=75). Our results indicate that the changes to the interface improve users’ performance and self-efficacy, mainly for individuals with cognitive styles more common to women. Our results can inspire designers of social coding platforms and software engineering tools to produce more inclusive development environments.Diversity is an important aspect of society. One form of diversity is cognitive diversity—differences in cognitive styles, which helps generate a diversity of thoughts. Unfortunately, software tools often do not support different cognitive styles (e.g., learning styles), disproportionately impacting those whose styles are not supported. These individuals pay a cognitive “tax” each time they use the tools. In this work, we found “inclusivity bugs” in GitHub, a social coding platform. We then redesigned these buggy features and evaluated them with users. Our results show that the redesign makes it easier for the group of individuals whose cognitive styles were unsupported in the original design, with the percentage of completed tasks rising from 67% to 95% for this group.

DOI: 10.1109/ICSE-SEIS58686.2023.00007

Security Thinking in Online Freelance Software Development

作者: Rauf, Irum and Petre, Marian and Tun, Thein and Lopez, Tamara and Nuseibeh, Bashar
关键词: societal challenges of secure software development, software development in society, developer, payment for security, freelance software development

Abstract

Online freelance software development (OFSD) is a significant part of the software industry and is a thriving online economy; a recent survey by Stack Overflow reported that nearly 15% of developers are independent contractors, freelancers, or self-employed. Although security is an important quality requirement for the social sustainability of software, existing studies have shown differences in the way security issues are handled by developers working in OFSD compared to those working in organisational environments. This paper investigates the security culture of OFSD developers, and identifies significant themes in how security is conceived, practiced, and compensated. Based on in-depth interviews with 20 freelance (FL) developers, we report that (a) security thinking is evident in descriptions of their work, (b) security thinking manifests in different ways within OFSD practice, and © the dynamics of the freelance development ecosystem influence financial investment in secure development. Our findings help to understand the reasons why insecure software development is evident in freelance development, and they contribute toward developing security interventions that are tailored to the needs of freelance software developers.General Summary- Online freelance software development (OFSD) is a significant part of the software industry and is a thriving online economy. Although security is an important quality requirement for the social sustainability of software, existing studies have shown differences in the way security issues are handled by developers working in OFSD compared to those working in organisational environments. Based on in-depth interviews with 20 freelance developers, this paper investigates the security culture of OFSD developers, and identifies significant themes in how security is conceived, practiced, and compensated.

DOI: 10.1109/ICSE-SEIS58686.2023.00008

Fundamentalists, Integrationists, & Transformationists： An Empirical Theory of Men Software Engineers’ Orientations in Gender Inequalities

作者: Wang, Yi and Zhang, Xinyue and Wang, Wei
关键词: No keywords

Abstract

Professional software development is an occupation dominated by men software engineers. Recently, researchers have been investigating gender inequality in the software development profession from multiple perspectives. While most of the extant literature on gender inequality in software development takes the perspective of female developers, the vast majority of the software development workforce-men software engineers-receives much less attention. In this article, we reported on a quantitative interview study aiming at developing empirical understandings of men software engineers’ orientations in gender inequalities. Through analyzing interview narratives from 21 professional men software engineers, plenty of findings emerged, enabling us to build a theoretical framework characterizing three different types of orientations in gender inequalities. In our theory, men software engineers’ orientation in gender issues could be conceptualized as fundamentalists, integrationists, and transformationists. We introduced each type’s characteristics from three aspects of individuals’ social cognition, which are belief, attitude, and action. We further discussed the implications and the limitations of the study.

DOI: 10.1109/ICSE-SEIS58686.2023.00009

Draw a Software Engineer Test - An Investigation into Children’s Perceptions of Software Engineering Profession

作者: Cutrupi, Claudia Maria and Zanardi, Irene and Jaccheri, Letizia and Landoni, Monica
关键词: coding, drawing, primary school students, draw-a-computer-scientist test, software engineering, children’s drawings, gender stereotypes

Abstract

Context: The gender gap is particularly affecting the software engineering community, as both academia and industry are dominated by men. Literature reports how the lack of women is a consequence of gender stereotypes around certain figures that begin in the early stages of education, affecting children’s perceptions of the role they can play across scientific fields.Objective: In this study, we asked children to draw a software engineer in order to collect their perceptions and let us check whether gender stereotypes still persist.Methods: We asked a total of 371 children to draw a person who works in the software engineering field. We analyzed the drawings based on a set of parameters extracted from literature and inspected the results through a cross-sectional study.Results: Children agreed on their representations of a software engineer: 51% drew a man and 44% drew a woman, while 5% a non-recognizable figure. The main differences emerged when the data were grouped by age and gender: only 23% of eleven-year-old girls drew a woman software engineer, while 54% drew a man, and in 23% gender was non-recognizable.Conclusion: The findings revealed a favorable gender balance in children’s perceptions of software engineering. They seem more willing to recognize diversity, an improvement compared with what was reported in previous studies. Children’s perceptions of technology may have become more accessible as a result of the COVID-19 situation. These findings may draw positive comparisons with the current gender gap in software engineering, encouraging future developments.

DOI: 10.1109/ICSE-SEIS58686.2023.00010

Benefits and Limitations of Remote Work to LGBTQIA+ Software Professionals

作者: de Souza Santos, Ronnie and de Magalh~{a
关键词: LGBTQIA+, software professionals, inclusion, diversity, equity, EDI

Abstract

Background. The mass transition to remote work amid the COVID-19 pandemic profoundly affected software professionals, who abruptly shifted into ostensibly temporary home offices. The effects of this transition on these professionals are complex, depending on the particularities of the context and individuals. Recent studies advocate for remote structures to create opportunities for many equity-deserving groups; however, remote work can also be challenging for some individuals, such as women and individuals with disabilities. As the discussions on equity, diversity, and inclusion increase in software engineering, it is important to explore the realities and perspectives of different equity-deserving groups to develop strategies that can support them post-pandemic. Objective. This study aims to investigate the effects of remote work on LGBTQIA+ software professionals. Method. Grounded theory methodology was applied based on information collected from two main sources: a survey questionnaire with a sample of 57 LGBTQIA+ software professionals and nine follow-up interviews with individuals from this sample. This sample included professionals of different genders, ethnicities, sexual orientations, and levels of experience. Consistent with grounded theory methodology, the process of data collection and analysis was conducted iteratively using three stages of coding: line-by-line, focused, and theoretical. Member checking was used to validate the findings obtained from interpreting the experiences commented on by LGBTQIA+ software professionals. Findings. Our findings demonstrate that (1) remote work benefits LGBTQIA+ people by increasing security and visibility; (2) remote work harms LGBTQIA+ software professionals through isolation and invisibility; (3) the benefits outweigh the drawbacks; (4) the drawbacks can be mitigated by supportive measures developed by software companies. Conclusion. This paper investigated how remote work can affect LGBTQIA+ software professionals and presented a set of recommendations on how software companies can address the benefits and limitations associated with this work model. In summary, we concluded that remote work is crucial in increasing diversity and inclusion in the software industry.Remote work is here to stay. There is no denying it, as some software professionals would rather quit their jobs than return to the office full-time. Therefore, software companies want to understand how the remote working model can be successfully used without causing major issues. The problem is that the effects of remote work are complex because they depend on individual and group characteristics that require careful evaluation. In this scenario, one thing has been extremely positive: remote work is helping to increase diversity in software engineering by fostering new opportunities and better work conditions for individuals from equity-deserving groups, for instance, LGBTQIA+ software professionals. The software industry is overly homogeneous, most of the professionals who work in this area are heterosexual men (a reflection of the university courses on computer science and software engineering), but diversity can only be good for an area that strongly depends on creativity and innovation. What better way to innovate than putting several individuals from different backgrounds and with various experiences to work together? Remote work plays an important role in improving equity, diversity, and inclusion in the software industry. In this paper, we discuss how remote work is affecting software professionals from the LGBTQIA+ community and provide a list of recommendations to support software companies in dealing with this work model.

DOI: 10.1109/ICSE-SEIS58686.2023.00011

作者: Gama, Kiev and Valen\c{c
关键词: non-governmental organization, hackathons, representation, inclusion, diversity

Abstract

Non-governmental Organizations (NGOs) usually have limited resources that prevent them from investing in software-based innovation. Sometimes hackathons are used as a resource to crowdsource software for NGOs, but often the resulting projects are not usable or not carried on. These events are not seen as a good option in Software Engineering for social good (i.e., software focused on social change) since they are too short to allow an understanding of the social context of the target institution. Taking that limitation into account, after performing 6 months of ethnography to understand the social context of an NGO, by identifying user needs and eliciting requirements, we organized an inclusive hackathon to address two specific challenges identified in that organization. This paper presents an experience report in the context of an interdisciplinary project with researchers from the Psychology, Design, and Computer Science domains, where the goal is to propose and apply an Open Social Innovation process focused on digital innovative solutions in the context of an NGO from Brazil that supports socially vulnerable people living with HIV/AIDS.Non-governmental Organizations (NGOs) often lack the resources to invest in software innovation. Hackathons have been used to get software for NGOs, but resulting projects are often not useful or continued. These events are not seen as the best way to make software for social change, since they are too short to understand the context of an NGO. To tackle this limitation, researchers in psychology, design, and computer science worked together for 6 months to understand the needs of an NGO that supports people living with HIV/AIDS. After that, two specific problems were prioritized and a hackathon was conducted to gather solutions proposed for those problems. The goal of the project was to find digital solutions for the NGO through an Open Social Innovation process.

DOI: 10.1109/ICSE-SEIS58686.2023.00012

作者: Olson, Lauren and Guzm'{a
关键词: feedback, reddit, software, ethics, communities, marginalized

Abstract

In this paper, we identified marginalized communities’ ethical concerns about social platforms. We performed this identification because recent platform malfeasance indicates that software teams prioritize shareholder concerns over user concerns. Additionally, these platform shortcomings often have devastating effects on marginalized populations. We first scraped 586 marginalized communities’ subreddits, aggregated a dataset of their social platform mentions and manually annotated mentions of ethical concerns in these data. We subsequently analyzed trends in the manually annotated data and tested the extent to which ethical concerns can be automatically classified by means of natural language processing (NLP). We found that marginalized communities’ ethical concerns predominantly revolve around discrimination and misrepresentation, and reveal deficiencies in current software development practices. As such, researchers and developers could use our work to further investigate these concerns and rectify current software flaws.In this paper, we identified marginalized communities’ ethical concerns about social platforms. We did this because recent platform wrongdoing indicates that software teams prioritize profit over user concerns. Additionally, these platform shortcomings often have devastating effects on marginalized populations. To accomplish this, we collected Reddit posts from marginalized communities’ subreddits where users mention social media platforms. Then, we labeled whether posts contained mentions of ethical concerns, like privacy or misinformation. Finally, we established trends within the resulting data and used artificial intelligence (AI) to find these ethical concerns automatically. We discovered that marginalized communities’ ethical concerns revolve around discrimination and misrepresentation, among other problems, and reveal deficiencies in current social platforms. As such, researchers and software engineers could use our work to further investigate these concerns and rectify present software flaws.

DOI: 10.1109/ICSE-SEIS58686.2023.00013

Do Users Act Equitably? Understanding User Bias Through a Large In-Person Study

作者: Liu, Yang and Moses, Heather and Sternefeld, Mark and Malachowsky, Samuel and Krutz, Daniel E.
关键词: computing accessibility, computing education, accessibility education

Abstract

Inequitable software is a common problem. Bias may be caused by developers, or even software users. As a society, it is crucial that we understand and identify the causes and implications of software bias from both users and the software itself. To address the problems of inequitable software, it is essential that we inform and motivate the next generation of software developers regarding bias and its adverse impacts. However, research shows that there is a lack of easily adoptable ethics-focused educational material to support this effort.To address the problem of inequitable software, we created an easily adoptable, self-contained experiential activity that is designed to foster student interest in software ethics, with a specific emphasis on AI/ML bias. This activity involves participants selecting fictitious teammates based solely on their appearance. The participant then experiences bias either against themselves or a teammate by the activity’s fictitious AI. The created lab was then utilized in this study involving 173 real-world users (age 18-51+) to better understand user bias.The primary findings of our study include: I) Participants from minority ethnic groups have stronger feeling regarding being impacted by inequitable software/AI, II) Participants with higher interest in AI/ML have a higher belief for the priority of unbiased software, III) Users do not act in an equitable manner, as avatars with ‘dark’ skin color are less likely to be selected, and IV) Participants from different demographic groups exhibit similar behavior bias. The created experiential lab activity may be executed using only a browser and internet connection, and is publicly available on our project website: https://all.rit.edu.Inequitable software is a significant problem in today’s society. Unfortunately, there is a lack of easily adoptable experiential educational materials that educators and practitioners can use to demonstrate and understand the impacts of inequitable software. To address this challenge, we developed a hosted educational experiential activity to provide a mechanism for instructors, students, and practitioners to experience the adverse impacts of bias software firsthand. We then utilized this activity as a basis for a large in-person study involving 173 real-world users to better understand user bias.Our hosted, experiential activity demonstrates the adverse impacts of inequitable software and bias. This is accomplished by inflicting bias against the participant and fictitious teammates in a simple tic-tac-toe game. The activity may be adopted using only a web browser and internet connection. The activity is available on the project website: https://all.rit.edu.To better understand user bias, the created experiential activity was utilized in a large in-person study. The primary findings of this study are that: participants from minority ethnicity demographics have stronger feeling towards inequitable software/AI; participants with higher interest in machine learning and AI have higher beliefs in the prioritization of unbiased software, and that users do not act in an equitable manner - choosing teammates of color at a disproportionately low rate.This work benefits educators by providing an easily adoptable experiential educational activity that they may use to demonstrate the adverse impacts of inequitable software in the classroom. The activity may also be used to foster discussions pertaining to this foundational and essential topic. Researchers will benefit from this work through an increased understanding of inequitable decisions made by users.

DOI: 10.1109/ICSE-SEIS58686.2023.00014

Developing Software for Low Socio-Economic End Users： Lessons Learned from A Case Study of Fisherfolk Communities in Bangladesh

作者: Kanij, Tanjila and Anwar, Misita and Oliver, Gillian and Hossain, Md. Khalid
关键词: co-creation, tacit knowledge, fisherfolk

Abstract

As part of a large Information and Communication Technology for Development (ICT4D) programme, we conducted a number of research projects to empower fisherfolk in Bangladesh. Due to their low socio-economic status, low level of digital as well as general literacy and many other similar factors, fisherfolk are very diverse as end users of software. It was important to understand their characteristics before designing any software for them. We started with exploratory research in reviewing literature and hearing from experts who closely work with fisherfolk. Based on the identification of some of the challenges, we designed a prototype software for tacit knowledge transfer among captains of boats. We conducted number of focus groups with the captains of the boats and adopted a co-creation process where the functionality and usability of the prototype software were decided by the end users themselves. From our experience of working with this diverse group of end users we propose specific recommendations for future software development for end users with a low socio-economic background.As part of a large Information and Communication Technology for Development (ICT4D) programme, we conducted exploratory research to develop software for fisherfolk in Bangladesh. Due to their low socio-economic status, low level of digital as well as general literacy and many other similar factors, it was important to understand the unique characteristics of the end users before designing any software for them. We started with reviewing literature and hearing from experts who closely work with fisherfolk. We identified that senior captains of boats possess important knowledge that is not systematically being transferred to the apprentice captains. We conducted a number of focus groups with the captains of the boats and adopted a co-creation process to design prototype software for record-keeping and sharing of tacit knowledge among captains. We report our experience and propose specific recommendations for future software development for end users with low socio-economic background.

DOI: 10.1109/ICSE-SEIS58686.2023.00015

Walking Down the Road to Independent Mobility： An Adaptive Route Training System for the Cognitively Impaired

作者: Rink, Konstantin and Gruschka, Tristan and Palsbr"{o
关键词: inclusive design, cognitive impairments, route training

Abstract

In this paper we describe the design and development of a route training system for individuals with cognitive impairments (CIs) living in residential care facilities. Learning to move autonomously in public spaces is a fundamental skill for people with CI, who face several challenges to independently and safely move around. Yet, exploring opportunities for route training support, especially in residential settings, has received very little attention. To explore these opportunities, we followed a design and development process based on inclusive design practices that considered the organisational context and aimed at involving people with CI in the software design. To ensure our solution addressed the identified needs and abilities of this heterogeneous population, we further framed the route training definition as a design process that is enacted by the system, making the trainer and user co-creators of a personalised training. In this paper we report on the needs and challenges for mobility training in residential settings, introduce the design and formative evaluation of the route training system, to conclude with reflections and considerations on our methodological approach.Learning to navigate public spaces without assistance is important for people with cognitive impairments (CIs). It can help them overcome challenges to independently and safely reach places in their daily lives. Yet, the use of technology for route learning has not been fully explored in research, especially for people with CIs living in residential care. In this article, we describe the process of developing a route training system that explores the use of technology support. In our research, we tried different ways to involve people with CIs in the design of the system, which is seen as a more inclusive approach to designing solutions. We identified that people with CIs have different needs and abilities when it comes to route learning and related skills. For this reason, our solution focuses on ways to personalise the training, making sure people with CIs are involved in the personalisation, so that it fits their needs, abilities and learning progress.

DOI: 10.1109/ICSE-SEIS58686.2023.00016

Diversity Awareness in Software Engineering Participant Research

作者: Dutta, Riya and Costa, Diego Elias and Shihab, Emad and Tajmel, Tanja
关键词: ICSE, diversity awareness, participant studies, content analysis, diversity

Abstract

Diversity and inclusion are necessary prerequisites for shaping technological innovation that benefits society as a whole. A common indicator of diversity consideration is the representation of different social groups among software engineering (SE) researchers, developers, and students. However, this does not necessarily entail that diversity is considered in the SE research itself.In our study, we examine how diversity is embedded in SE research, particularly research that involves participant studies. To this end, we have selected 79 research papers containing 105 participant studies spanning three years of ICSE technical tracks. Using a content analytical approach, we identified how SE researchers report the various diversity categories of their study participants and investigated: 1) the extent to which participants are described, 2) what diversity categories are commonly reported, and 3) the function diversity serves in the SE studies.We identified 12 different diversity categories reported in SE participant studies. Our results demonstrate that even though most SE studies report on the diversity of participants, SE research often emphasizes professional diversity data, such as occupation and work experience, over social diversity data, such as gender or location of the participants. Furthermore, our results show that participant diversity is seldom analyzed or reflected upon when SE researchers discuss their study results, outcome or limitations.To help researchers self-assess their study diversity awareness, we propose a diversity awareness model and guidelines that SE researchers can apply to their research. With this study, we hope to shed light on a new approach to tackling the diversity and inclusion crisis in the SE field.Incorporating diversity considerations in research, development, and innovation has become an increasingly important topic. It is a well-known fact that diverse teams produce better outcomes, whereas the lack of diversity might result in biased and discriminatory technologies. Therefore, the inclusion of diverse stakeholders is considered paramount for the creation of an ethical and social-responsible future. With this study, we aim to contribute to the conversation on how EDI (equity, diversity, inclusion) can be implemented in Software Engineering (SE) research. In our study, we focus on SE research that includes research participants since this is an evident opportunity to consider diversity, and we investigate to which extent and with what purpose SE researchers consider and report diversity in their research papers. Our results demonstrate that only a few studies do not consider diversity at all, however, the examined studies differ greatly in the range of the consideration and reporting of diversity. From these outcomes, we draw the conclusion of differences in the diversity awareness among SE researchers. Finally, we propose a model of diversity awareness for participant studies as a tool to support SE researchers in reflecting on diversity and incorporating it systematically in their research.

DOI: 10.1109/ICSE-SEIS58686.2023.00017

Harmful Terms in Computing： Towards Widespread Detection and Correction

作者: Winchester, Hana and Boyd, Alicia E. and Johnson, Brittany
关键词: harmful terminology, inclusive speech, software engineering, bias

Abstract

Modern-day software development and use is a product of decades of advancement and evolution. Over time as new technologies and concepts emerged, so did new terminology to describe and discuss them. Most terminology used in computing is harmless, however, some are rooted in historically discriminatory, and potentially harmful, terms. While the landscape of individuals who develop technology has diversified over the years, the terminology has become a normalized part of modern software development and computing jargon. Despite organizations such as the ACM raising awareness of the potential harm certain terms can do and companies like GitHub working to change the systemic use of harmful terms in computing, it is still not clear what the landscape of harmful terminology in computing really is and how we can support the widespread detection and correction of harmful terminology in computing artifacts. To this end, we conducted a review of existing work and efforts at curating, detecting, and removing harmful terminology in computing. Combining and building on these prior efforts, we produce an extensible database of what we define as harmful terminology in computing and describe an open source proof-of-concept tool for detecting and replacing harmful computing-related terminology.

DOI: 10.1109/ICSE-SEIS58686.2023.00018

Metamorphic Testing and Debugging of Tax Preparation Software

作者: Tizpaz-Niari, Saeid and Monjezi, Verya and Wagner, Morgan and Darian, Shiva and Reed, Krystia and Trivedi, Ashutosh
关键词: No keywords

Abstract

This paper presents a data-driven debugging framework to improve the trustworthiness of US tax preparation software systems. Given the legal implications of bugs in such software on its users, ensuring compliance and trustworthiness of tax preparation software is of paramount importance. The key barriers in developing debugging aids for tax preparation systems are the unavailability of explicit specifications and the difficulty of obtaining oracles. We posit that, since the US tax law adheres to the legal doctrine of precedent, the specifications about the outcome of tax preparation software for an individual taxpayer must be viewed in comparison with individuals that are deemed similar. Consequently, these specifications are naturally available as properties on the software requiring similar inputs provide similar outputs. Inspired by the metamorphic testing paradigm, we dub these relations metamorphic relations as they relate to structurally modified inputs.In collaboration with legal and tax experts, we explicated metamorphic relations for a set of challenging properties from various US Internal Revenue Services (IRS) publications including Form 1040 (U.S. Individual Income Tax Return), Publication 596 (Earned Income Tax Credit), Schedule 8812 (Qualifying Children and Other Dependents), and Form 8863 (Education Credits). While we focus on an open-source tax preparation software for our case study, the proposed framework can be readily extended to other commercial software. We develop a randomized test-case generation strategy to systematically validate the correctness of tax preparation software guided by metamorphic relations. We further aid this test-case generation by visually explaining the behavior of software on suspicious instances using easy-to-interpret decision-tree models. Our tool uncovered several accountability bugs with varying severity ranging from nonrobust behavior in corner-cases (unreliable behavior when tax returns are close to zero) to missing eligibility conditions in the updated versions of software.

DOI: 10.1109/ICSE-SEIS58686.2023.00019

Treat Societally Impactful Scientific Insights as Open-Source Software Artifacts

作者: Liem, Cynthia C. S. and Demetriou, Andrew M.
关键词: responsible research practice, transdisciplinary research, open source, software engineering, open science

Abstract

So far, the relationship between open science and software engineering expertise has largely focused on the open release of software engineering research insights and reproducible artifacts, in the form of open-access papers, open data, and open-source tools and libraries. In this position paper, we draw attention to another perspective: scientific insight itself is a complex and collaborative artifact under continuous development and in need of continuous quality assurance, and as such, has many parallels to software artifacts. Considering current calls for more open, collaborative and reproducible science; increasing demands for public accountability on matters of scientific integrity and credibility; methodological challenges coming with transdisciplinary science; political and communication tensions when scientific insight on societally relevant topics is to be translated to policy; and struggles to incentivize and reward academics who truly want to move into these directions beyond traditional publishing habits and cultures, we make the parallels between the emerging open science requirements and concepts already well-known in (open-source) software engineering research more explicit. We argue that the societal impact of software engineering expertise can reach far beyond the software engineering research community, and call upon the community members to pro-actively help driving the necessary systems and cultural changes towards more open and accountable research.

DOI: 10.1109/ICSE-SEIS58686.2023.00020

Contradicting Motivations in Civic Tech Software Development： Analysis of a Grassroots Project

作者: Knutas, Antti and Siemon, Dominik and Tylosky, Natasha and Maccani, Giovanni
关键词: contradictions, case study, activity theory, motivations, software development, software engineering, civic tech

Abstract

Grassroots civic tech, or software for social change, is an emerging practice where people create and then use software to create positive change in their community. In this interpretive case study, we apply Engestr"{o

DOI: 10.1109/ICSE-SEIS58686.2023.00021

Software Engineering for Smart Things in Public Spaces： Initial Insights and Challenges

作者: Batool, Amna and Loke, Seng W. and Fernando, Niroshinie and Kua, Jonathan
关键词: human-device interaction, IoT, supermarket, socio-ethical policy, smart things, smart devices

Abstract

Software engineering for mobile applications has its own challenges, different from when we engineer software just for desktop environments. With the emergence of smart things (including smart everyday objects embedded with connectivity, computational ability, sensors, and sometimes actuators, urban robots such as delivery and cleaning robots, smart street lighting, smart vehicles, and smart park benches, and so on) not just within the home but in public spaces, there is a need to consider software engineering challenges for software on such things. Human-centred software engineering and work on ethical behaviours in smart things will need to come together, even as we continue to understand what it takes to effectively develop software (and systems) for such emerging devices. In order to demonstrate how software (and systems) for intelligent devices in public places might be developed, findings from a quantitative survey we performed are discussed in this study. The survey was designed such that the questions focused on the socio-ethical behaviours of smart devices when interacting with people in public places. The survey was based on a supermarket scenario where the participants had to answer the different questions in the questionnaire. There were 250 participants who only completed part of the survey; of them, 60 participants finished it in full. The complete replies have been examined and analysed in this paper. To determine how people feel about employing smart technology in public places, a variety of smart devices, including robots, smart cameras, smart speakers, and smart trolleys, are utilised in the survey questions. According to the findings, more than 80 percent of respondents think it important for smart gadgets to be socially-aware and ethical in public places.This paper examines the survey results conducted to explore if smart devices such as robots or smart cameras can be deployed in public areas. The respondents reply to survey questions asking them whether they believe it is crucial to keep smart robots, smart carts, or any other smart devices in the supermarket. The survey’s questions are constructed in such a manner that participants are asked to imagine themselves as either a customer shopping for groceries at a store or a manager running the business and dealing with the friendly robot. This survey was created with the intention of thinking carefully about how intelligent software systems may be designed from the standpoint of software engineering for public settings. Later in this article, the survey findings and insights are discussed.

DOI: 10.1109/ICSE-SEIS58686.2023.00022

A Novel Approach to Improving the Digital Literacy of Older Adults

作者: Vaswani, Mehr and Balasubramaniam, Dharini and Boyd, Kenneth
关键词: andragogy, digital divide, continuing education, digital literacy

Abstract

Digitalisation offers opportunities for older adults (OA) to retain an active role in their lives and alleviate the ‘burden of care’ associated with ageing. Yet, digital engagement is consistently cited to be inversely related to age. Although interventions to enhance the digital literacy of OA through formal in-person training have emerged, there has been little evaluation of their effectiveness. This paper presents some insights into the attitudes, needs and challenges of OA in becoming digitally literate. We conducted preliminary user studies with OA and a survey of younger adults (YA) to understand their role as an informal support system. Based on these insights, we propose an innovative approach to digital literacy training for OA by combining a senior-friendly learning management system with informal inter-generational learning support. A preliminary evaluation of the system yielded positive feedback and indicates the need for a more extensive exploration of the digital experiences and requirements of OA and the influence of social support systems in their digital engagement process.

DOI: 10.1109/ICSE-SEIS58686.2023.00023

Values@Runtime： An Adaptive Framework for Operationalising Values

作者: Bennaceur, Amel and Hassett, Diane and Nuseibeh, Bashar and Zisman, Andrea
关键词: reflection, recommendations, adaptation, operationalisation, values@Runtime

Abstract

We present an adaptive framework to assist users in making more value-sensitive decisions during their (runtime) use of software. The framework enables users to (i) represent, instantiate, and monitor their values and behaviour; (ii) understand mismatches between stated values and their observed behaviour; and (iii) recommend ways to align users’ values and behaviour. We built a values shopping basket tool to illustrate and demonstrate the adaptive framework in the food consumption domain, a sector that is rich in values and regularly undergoes reflection and debate.Society is pondering the values it cherishes, and users increasingly find themselves reflecting on which values are important to them. With software playing a crucial role in society and having a significant impact on how we live, the way we engineer and use software must take into account those values. In this paper we present a framework to support users to articulate, measure, and reflect on their values as they interact with software systems. The rationale is that users gain better understanding of their values as they experience, reflect and learn about them, when making decisions mediated by software. We demonstrate our framework through a values shopping basket prototype that enables users to specify, reflect, and make value-sensitive decisions during food purchase.

DOI: 10.1109/ICSE-SEIS58686.2023.00024

Gender Representation Among Contributors to Open-Source Infrastructure： An Analysis of 20 Package Manager Ecosystems

作者: Qiu, Huilian Sophie and Zhao, Zihe H and Yu, Tielin Katy and Wang, Justin and Ma, Alexander and Fang, Hongbo and Dabbish, Laura and Vasilescu, Bogdan
关键词: gender diversity, open-source software

Abstract

While the severe underrepresentation of women and non-binary people in open source is widely recognized, there is little empirical data on how the situation has changed over time and which subcommunities have been more effectively reducing the gender imbalance. To obtain a clearer image of gender representation in open source, we compiled and synthesized existing empirical data from the literature, and computed historical trends in the representation of women across 20 open source ecosystems. While inherently limited by the ability of automatic name-based gender inference to capture true gender identities at an individual level, our census still provides valuable populationlevel insights. Across all and in most ecosystems, we observed a promising upward trend in the percentage of women among code contributors over time, but also high variation in the percentage of women contributors across ecosystems. We also found that, in most ecosystems, women withdraw earlier from open-source participation than men.The representation of women and non-binary people has been extremely low in the open-source software community. Most of the statistics reported by prior studies are below 10%. However, the majority of the prior works were based on subsamples instead of the entire population. Our work started with a review of the gender distributions reported in the literature. Then we provided an overview of the gender distribution in 20 of the largest open-source ecosystem, i.e., grouped by package managers such as npm and PyPI, and investigated its change over time. Moreover, we analyzed the turnover rate between men and women contributors. Across all and in most ecosystems, we observed a promising upward trend in the percentage of women among code contributors over time, but also high variation in the percentage of women contributors across ecosystems. We also found that, in most ecosystems, women withdraw earlier from open-source participation than men.

DOI: 10.1109/ICSE-SEIS58686.2023.00025

Workplace Discrimination in Software Engineering： Where We Stand Today

作者: Zhao, Xin and Young, Riley
关键词: survey, software professional, workplace discrimination

Abstract

Context: Discrimination within the workplace negatively impacts employees across the board and has been studied in various fields, such as wage-earning workplaces, healthcare, and social media. However, considerable work is still needed to gain a deeper understanding of workplace discrimination in Software Engineering. Objective: The research objective is to gain deeper insights into the causes, forms, and effects of workplace discrimination toward software professionals, thus providing insights to reduce workplace discrimination. Method: We applied an empirical investigation to reach our goal. We collected 97 complete responses from our online survey, including a set of open-ended, close-ended, and scale questions. Results: We found that most discriminatory actions happened more than one time, and most of the actions are carried out by colleagues in daily work, causing negative mental and physical effects. Conclusion: This paper provides an understanding of the causes, forms, and effects of workplace discrimination and discusses concrete suggestions to help reduce workplace discrimination in the software engineering field.Workplace discrimination has detrimental effects: it frustrates employees, diminishes morale, aggravates structural inequities in the labor market, and eventually hurts economic growth and dynamism. Although anti-discrimination laws have been legislated, calling for more fair and equitable working environments, workplace discrimination is a pervasive issue in today’s society. Although research communities have placed a strong emphasis on this issue, there are still considerable gaps in workplace discrimination research in software engineering. To fill this gap, we surveyed employees in software engineering with diverse demographic background about the causes, forms, and effects of their experiences with workplace discrimination. Our study reveals that age, gender, and race are contributing factors to experiencing or witnessing workplace discrimination in software engineering. We also found that many participants chose to “let it go” instead of engaging in effective communication with colleagues or Human Resources. Finally, respondents provided feedback on how companies can decrease discrimination and create a more positive work environment, describing several concrete actions employers can take to reduce discrimination and create a more healthy, equitable, and thriving workplace for every software professional.

DOI: 10.1109/ICSE-SEIS58686.2023.00026

Cognitive Reflection in Software Verification and Testing

作者: Buffardi, Kevin
关键词: software development, cognitive bias, accuracy, cognitive reflection test, reflection, cognition, verification, unit testing

Abstract

Verifying whether code meets its specifications requires critical thinking and analysis. However, cognitive biases may influence how well software engineers verify and test code. In this paper, I explore cognitive reflection and its association with accurately verifying code via both manual inspection and unit testing.In a two-phase exploratory study of Software Engineering undergraduate students (n=140), I examined their performance on Cognitive Reflection Tests (CRT), manual verification of function implementations, and the accuracy of unit test suites. The first phase found no relationship between CRT and unit test accuracy. However, the higher a student’s CRT score, the more likely they were to reject a defective implementation when inspecting it manually (p<0.0001, 95% CI 1.56-4.50h).The second phase replicated the outcomes on an alternate version of the CRT and on a different unit test assignment, revealing a positive correlation between CRT and recognizing defects manually (ρ=0.478, p<0.01). I conclude that cognitive reflection is associated with software engineering students’ aptitude at identifying defects, but is not associated with their affirmation of algorithms without defects, nor with the accuracy of their unit tests.

DOI: 10.1109/ICSE-SEET58685.2023.00006

Designing for Real People： Teaching Agility through User-Centric Service Design

作者: Chatley, Robert and Field, Tony and Wheelhouse, Mark and Runcie, Carolyn and Grinyer, Clive and de Leon, Nick
关键词: service design, agile development, education

Abstract

We present the design and evolution of a project-based course - Designing for Real People - that aims to teach agile software development through an unwavering focus on the user, rather than emphasising the processes and tools often associated with a method like Scrum. This module is the result of a fruitful collaboration between a Computer Science Department, bringing knowledge and skills in the software engineering aspects, and the Service Design group of a neighbouring Art College, with expertise in user research and user experience design.We present the details of the current structure, content and assessment strategies developed for the module, as well as the principles behind its design. The core theme of the course is gathering and responding to feedback, and so here we present how this has been applied to the design of the module itself, with lessons learned, and improvements made over time. By reflecting on our own work, we aim to provide recommendations that may aid others considering how to teach these topics.

DOI: 10.1109/ICSE-SEET58685.2023.00007

Open Design Case Study - A Crowdsourcing Effort to Curate Software Design Case Studies

作者: Chong, Chun Yong and Kang, Eunsuk and Shaw, Mary
关键词: case studies, software design, software engineering education

Abstract

Case study-based learning has been successfully integrated into various courses, including software engineering education. In the context of software design courses, the use of case studies often entails sharing of real successful or failed software development. Using examples of real-world case studies allows educators to reinforce the applicability and usefulness of fundamental design concepts, relate the importance of evaluating design trade-offs with respect to stakeholders’ requirements, and highlight the importance of upfront design where students that lack industrial experience tend to overlook. However, the use of real-world case studies is not straightforward because 1.) there is a lack of open source repositories for real software design case studies and 2.) even if case studies are available, they are often reported without a standardized format, which may hinder the alignment between the case and the desired learning outcomes. To address the lack of software design case studies for educational purposes, we propose the idea of Open Design Case Study, a repository to crowdsource, curate, and recruit other educators to contribute case studies for teaching software design courses. The platform will also allow educators and students to share, brainstorm, and discuss design solutions based on case studies shared publicly on the repository.

DOI: 10.1109/ICSE-SEET58685.2023.00008

Do the Test Smells Assertion Roulette and Eager Test Impact Students’ Troubleshooting and Debugging Capabilities?

作者: Aljedaani, Wajdi and Mkaouer, Mohamed Wiem and Peruma, Anthony and Ludi, Stephanie
关键词: software testing, computer science education, software engineering education, unit testing, test smells

Abstract

To ensure the quality of a software system, developers perform an activity known as unit testing, where they write code (known as test cases) that verifies the individual software units that make up the system. Like production code, test cases are subject to bad programming practices, known as test smells, that hurt maintenance activities. An essential part of most maintenance activities is program comprehension which involves developers reading the code to understand its behavior to fix issues or update features. In this study, we conduct a controlled experiment with 96 undergraduate computer science students to investigate the impact of two common types of test smells, namely Assertion Roulette and Eager Test, on a student’s ability to debug and troubleshoot test case failures. Our findings show that students take longer to correct errors in production code when smells are present in their associated test cases, especially Assertion Roulette. We envision our findings supporting academia in better equipping students with the knowledge and resources in writing and maintaining high-quality test cases. Our experimental materials are available online1

DOI: 10.1109/ICSE-SEET58685.2023.00009

Training for Security： Planning the Use of a SAT in the Development Pipeline of Web Apps

作者: Nocera, Sabato and Romano, Simone and Francese, Rita and Scanniello, Giuseppe
关键词: web app, static analysis tool, software security

Abstract

We designed a prospective empirical investigation to study our STW (Software Technologies for the Web) course with respect to the training of bachelor students in the context of software security when developing e-commerce Web apps. To that end, we devised the following steps: (i) studying the state of the students enrolled in the STW course in the a.y. (academic year) 2021–22; (ii) defining a training plan for the a.y. 2022–23; and (iii) acting the plan and measuring the differences (if any) between the students of the a.y. 2021–22 and 2022–23. In this idea paper, we present the results of the former two steps, as well as the evaluation strategy of the proposed training plan. We observed that security concerns are widespread in the code of the Web apps the students of the STW course (a.y. 2021–22) developed. Therefore, we plan (second step) to ask the students of the STW course (a.y. 2022–23) to use in their development pipeline a Static Analysis Tool (SAT) to detect security concerns.

DOI: 10.1109/ICSE-SEET58685.2023.00010

Are You Cloud-Certified? Preparing Computing Undergraduates for Cloud Certification with Experiential Learning

作者: Ouh, Eng Lieh and Gan, Benjamin Kok Siew
关键词: undergraduate, experiential learning, cloud certification, cloud computing

Abstract

Cloud Computing skills have been increasing in demand. Many software engineers are learning these skills and taking cloud certification examinations to be job competitive. Preparing undergraduates to be cloud-certified remains challenging as cloud computing is a relatively new topic in the computing curriculum, and many of these certifications require working experience. In this paper, we report our experiences designing a course with experiential learning to prepare our computing undergraduates to take the cloud certification. We adopt a university project-based experiential learning framework to engage industry partners who provide project requirements for students to develop cloud solutions and an experiential risk learning model to design the course contents. We prepare these students to take on the Amazon Web Services Solution Architect - Associate (AWS-SAA) while doing the course. We do this over 3 semester terms and report our findings before and after our design with experiential learning. We are motivated by the students’ average 93% passing rates over the terms. Even when the certification is taken out of the graded components, we still see an encouraging 89% participation rate. The quantitative feedback shows increased ratings across the survey questions compared to before experiential learning. We acknowledge concerns about the students’ heavy workload and increased administrative efforts for the faculty members. We summarise our approach with actionable weekly topics, activities and takeaways. We hope this experience report can help other educators design cloud computing content and certifications for computing students in software engineering.

DOI: 10.1109/ICSE-SEET58685.2023.00011

Improving Assessment of Programming Pattern Knowledge through Code Editing and Revision

作者: Nurollahian, Sara and Rafferty, Anna N. and Wiese, Eliane
关键词: code structure, code quality, code readability, code refactoring, programming patterns and anti-patterns, code editing, code revising, code writing

Abstract

How well do code-writing tasks measure students’ knowledge of programming patterns and anti-patterns? How can we assess this knowledge more accurately? To explore these questions, we surveyed 328 intermediate CS students and measured their performance on different types of tasks, including writing code, editing someone else’s code, and, if applicable, revising their own alternatively-structured code. Our tasks targeted returning a Boolean expression and using unique code within an if and else.We found that code writing sometimes under-estimated student knowledge. For tasks targeting returning a Boolean expression, over 55% of students who initially wrote with non-expert structure successfully revised to expert structure when prompted - even though the prompt did not include guidance on how to improve their code. Further, over 25% of students who initially wrote non-expert code could properly edit someone else’s nonexpert code to expert structure. These results show that nonexpert code is not a reliable indicator of deep misconceptions about the structure of expert code.Finally, although code writing is correlated with code editing, the relationship is weak: a model with code writing as the sole predictor of code editing explains less than 15% of the variance. Model accuracy improves when we include additional predictors that reflect other facets of knowledge, namely the identification of expert code and selection of expert code as more readable than non-expert code. Together, these results indicate that a combination of code writing, revising, editing, and identification tasks can provide a more accurate assessment of student knowledge of programming patterns than code writing alone.

DOI: 10.1109/ICSE-SEET58685.2023.00012

Speak, Memory! Analyzing Historical Accidents to Sensitize Software Testing Novices

作者: Silvis-Cividjian, Natalia and Hager, Fritz
关键词: Therac-25, history of computing, witness accounts, STAMP, safety science, accident investigations, soft skills, assignments, software testing education

Abstract

Accidents tend to be traumatic events that one would rather forget than remember. Software testing novices at the Vrije Universiteit in Amsterdam, on the contrary, rewind the past and learn how to safeguard the future.In this paper we will present FAIL, a rather unconventional assignment that methodically investigate 13 historical software-related accidents, varying from the Ariane-5 rocket explosion to the Knight Capital trading glitch. Innovative is that software testing students use STAMP, a modern systems-theory-based accident causality model and have the possibility to interview a witness of the famous Therac-25 radiation overexposures.A recent deployment to 96 CS graduates received positive evaluations. We learned that even a lightweight, yet systematic investigation of failures (1) motivates students, by sensitizing them to the consequences of suboptimal testing, and (2) reveals key soft-skills testers need to prevent disasters, such as defensive pessimism and a strong backbone. Other, more subtle benefits of the proposed approach include (3) really-happened, instead of artificial case-studies that increase a teacher’s credibility, and (4) extraordinary test scenarios students will always remember.These results invite software engineering educators to include safety assessment elements in their curricula, and call on witnesses of software-related accidents to break the silence and share memories. Future work includes crafting a repository of heritage artifacts (narratives, videos, witness testimonies and physical replicas) to reproduce historical software-related accidents, and make it available to interested educators. Our hope is that motivated professionals will emerge, better prepared to engineer the safe software-intensive systems we all can rely on.

DOI: 10.1109/ICSE-SEET58685.2023.00013

Software Startup within a University - Producing Industry-Ready Graduates

作者: Tenhunen, Saara and M"{a
关键词: internal startup, software engineering education

Abstract

Previous research has demonstrated that preparing students for life in software engineering is not a trivial task. Authentic learning experiences are challenging to provide, and there are gaps between what students have done at the university and what they are expected to master when getting into the industry after graduation. To address this challenge, we present a novel way of teaching industry-relevant skills in a university-led internal software startup called Software Development Academy (SDA). In addition to describing the SDA concept in detail, we have investigated what educational aspects characterise SDA and how it compares to capstone projects. The questions are answered based on 15 semi-structured interviews with alumni of SDA. Working with production-quality software and having a wide range of responsibilities were perceived as the most integral aspects of SDA and provided students with a comprehensive skill set for the future.

DOI: 10.1109/ICSE-SEET58685.2023.00014

Teaching MLOps in Higher Education through Project-Based Learning

作者: Lanubile, Filippo and Mart'{\i
关键词: reproducibility, model deployment, software engineering for AI, data science, machine learning

Abstract

Building and maintaining production-grade ML-enabled components is a complex endeavor that goes beyond the current approach of academic education, focused on the optimization of ML model performance in the lab. In this paper, we present a project-based learning approach to teaching MLOps, focused on the demonstration and experience with emerging practices and tools to automatize the construction of ML-enabled components. We examine the design of a course based on this approach, including laboratory sessions that cover the end-to-end ML component life cycle, from model building to production deployment. Moreover, we report on preliminary results from the first edition of the course. During the present year, an updated version of the same course is being delivered in two independent universities; the related learning outcomes will be evaluated to analyze the effectiveness of project-based learning for this specific subject.

DOI: 10.1109/ICSE-SEET58685.2023.00015

Software Resurrection： Discovering Programming Pearls by Showing Modernity to Historical Software

作者: Dutta, Abhishek
关键词: No keywords

Abstract

Reading computer program code and documentation written by others is, we are told, one of the best ways to learn the art of writing readable, intelligible and maintainable code and documentation. The software resurrection exercise, introduced in this paper, requires a motivated learner to compile and test a historical release (e.g. 20 years old) version of a well maintained and widely adopted open source software on a modern hardware and software platform. The learner develops fixes for the issues encountered during compilation and testing of the software on a platform that could not have been foreseen at the time of its release. The exercise concludes by writing a critique which provides an opportunity to critically reflect on the experience of maintaining the historical software. An illustrative example of the software resurrection exercise pursued on a version of the SQLite database engine released 20 years ago shows that software engineering principles (or, programming pearls) emerge during the reflective learning cycle of the software resurrection exercise. The concept of software resurrection has the potential to lay foundations for a lifelong willingness to explore and learn from existing code and documentation.

DOI: 10.1109/ICSE-SEET58685.2023.00016

Teaching Computer Science Students to Communicate Scientific Findings More Effectively

作者: Wyrich, Marvin and Wagner, Stefan
关键词: soft skills, education, training, presentation, science communication

Abstract

Science communication forms the bridge between computer science researchers and their target audience. Researchers who can effectively draw attention to their research findings and communicate them comprehensibly not only help their target audience to actually learn something, but also benefit themselves from the increased visibility of their work and person. However, the necessary skills for good science communication must also be taught, and this has so far been neglected in the field of software engineering education.We therefore designed and implemented a science communication seminar for bachelor students of computer science curricula. Students take the position of a researcher who, shortly after publication, is faced with having to draw attention to the paper and effectively communicate the contents of the paper to one or more target audiences. Based on this scenario, each student develops a communication strategy for an already published software engineering research paper and tests the resulting ideas with the other seminar participants.We explain our design decisions for the seminar, and combine our experiences with responses to a participant survey into lessons learned. With this experience report, we intend to motivate and enable other lecturers to offer a similar seminar at their university. Collectively, university lecturers can prepare the next generation of computer science researchers to not only be experts in their field, but also to communicate research findings more effectively.

DOI: 10.1109/ICSE-SEET58685.2023.00017

The ABC of Pair Programming： Gender-Dependent Attitude, Behavior and Code of Young Learners

作者: Gra\ss{
关键词: pair programming, gender, scratch

Abstract

Young learners are increasingly introduced to programming, and one of the main challenges for educators is to achieve learning success while also creating enthusiasm. As it is particularly difficult to achieve this enthusiasm initially in young females, prior work has identified gender-specific differences in the programming behavior of young learners. Since pair programming, which turns programming into a more sociable activity, has been proposed as an approach to support programming education, in this paper we aim to investigate whether similar gender-specific characteristics can also be observed during pair programming. Therefore, we designed a gender-neutral introductory Scratch programming course tailored for integrating pair programming principles, and conducted it with a total of 139 students aged between 8 and 14 years. To identify gender-dependent differences and similarities, we measure the attitude towards programming and the course setting, observe the behavior of the students while programming, and analyze the code of the programs for different gender-combinations. Overall, our study demonstrates that pair programming is well suited for young learners and results in a positive attitude. While the resulting programs are similar in quality and complexity independent of gender, differences are evident when it comes to the compliance to pair programming roles, the exploration of code, and the creative customization of programs. These findings contribute to an in-depth understanding of social and technical gender specifics of pair programming, and provide educators with resources and guidance for implementing gender-sensitive pair programming in the classroom.

DOI: 10.1109/ICSE-SEET58685.2023.00018

Engaging Girls in Computer Science： Do Single-Gender Interdisciplinary Classes Help?

作者: Marquardt, Kai and Wagner, Ingo and Happe, Lucia
关键词: introductory courses, data science, gender inclusive, e-learning, k-12, interest, women, interdisciplinary, computer science education, diversity

Abstract

Computing-driven innovation cannot reach its full potential if only a fraction of the population is involved. Without girls and their non-stereotypical contribution, the innovation potential is severely limited. In computer science (CS) and software engineering (SE), the gender gap persists without any positive trend. Many girls find it challenging to identify with the subject of CS. However, we can capitalize on their interests and create environments for girls through interdisciplinary subcultures to spark and foster enthusiasm for CS. This paper presents and discusses the results of an intervention in which we applied a novel interdisciplinary online course in data science to get girls excited about CS and programming by contributing to the grand goal of solving colony collapse disorder from biology and geoecology. The results show the potential of such programs to get girls excited about programming, but also important implications in terms of the learning environment. The startling results show that girls from single-gender classes (SGCs) are significantly more open to CS-related topics and that the intervention evoked significantly more positive feelings in them than in girls from mixed-gender classes (MGCs). The findings highlight the importance of how CS-related topics are introduced in school and the crucial impact of the learning environment to meet the requirements of truly gender-inclusive education.

DOI: 10.1109/ICSE-SEET58685.2023.00019

BARA： A Dynamic State-Based Serious Game for Teaching Requirements Elicitation

作者: Liu, Yu and Li, Tong and Huang, Zheqing and Yang, Zhen
关键词: goal-based design, empirical study, serious game, requirements elicitation

Abstract

Teaching requirements elicitation to students who do not have practical experience is challenging, as they usually cannot understand the difficulty. Several recent studies have reported their experience of teaching requirements elicitation with a serious game. However, in these games, the fictitious characters have not been carefully designed to reflect real scenarios. For example, they always respond the same no matter how many times a learner interacts with them. Moreover, most existing serious games contain only one specific scenario and cannot be easily extended to cover various cases. In this paper, we design and implement a dynamic state-based serious game (BARA) for teaching requirements elicitation, which can realistically simulate real-world scenarios and automatically record learners’ actions for assessment. Specifically, we model fictitious characters’ behaviors using finite-state machines in order to precisely characterize the dynamic states of stakeholders. We also developed an easy-to-use editor for non-programmers to design fictitious characters and thus construct various simulated scenarios. Finally, BARA records learners’ actions during the game, based on which we can gain an in-depth understanding of learners’ performance and our teaching effectiveness. We evaluated BARA with 60 participants using a simulated scenario. The result shows that most participants are immersed in BARA and can reasonably complete the requirements elicitation task within the simulated scenario.

DOI: 10.1109/ICSE-SEET58685.2023.00020

A Theorem Proving Approach to Programming Language Semantics

作者: Roy, Subhajit
关键词: No keywords

Abstract

The semantics of programming languages is one of the core topics in computer science. This topic is formalism-heavy and requires the student to attempt numerous proofs for a deep understanding. We argue that modern theorem provers are excellent aids to teaching and understanding programming language semantics. As pen-and-paper proofs get automated via the theorem prover, it allows an experiment-driven strategy at exploring this topic.This article provides an encoding of the semantics of the While language in the most popular styles—operational, denotational, and axiomatic—within the F* proof assistant. We show that once the program and its semantics are encoded, modern proof assistants can prove exciting language features with minimal human assistance. We believe that teaching programming languages via proof assistants will not only provide a more concrete understanding of this topic but also prepare future programming language researchers to use theorem provers as fundamental tools in their research and not as an afterthought.

DOI: 10.1109/ICSE-SEET58685.2023.00021

Overcoming Challenges in DevOps Education through Teaching Methods

作者: Ferino, Samuel and Fernandes, Marcelo and Cirilo, Elder and Agnez, Lucas and Batista, Bruno and Kulesza, Uir'{a
关键词: mixed methods, challenges, teaching methods, DevOps

Abstract

DevOps is a set of practices that deals with coordination between development and operation teams and ensures rapid and reliable new software releases that are essential in industry. DevOps education assumes the vital task of preparing new professionals in these practices using appropriate teaching methods. However, there are insufficient studies investigating teaching methods in DevOps. We performed an analysis based on interviews to identify teaching methods and their relationship with DevOps educational challenges. Our findings show that project-based learning and collaborative learning are emerging as the most relevant teaching methods.

DOI: 10.1109/ICSE-SEET58685.2023.00022

On the Use of Static Analysis to Engage Students with Software Quality Improvement： An Experience with PMD

作者: AlOmar, Eman Abdullah and AlOmar, Salma Abdullah and Mkaouer, Mohamed Wiem
关键词: quality, education, static analysis tool

Abstract

Static analysis tools are frequently used to scan the source code and detect deviations from the project coding guidelines. Given their importance, linters are often introduced to classrooms to educate students on how to detect and potentially avoid these code anti-patterns. However, little is known about their effectiveness in raising students’ awareness, given that these linters tend to generate a large number of false positives. To increase the awareness of potential coding issues that violate coding standards, in this paper, we aim to reflect on our experience with teaching the use of static analysis for the purpose of evaluating its effectiveness in helping students with respect to improving software quality. This paper discusses the results of an experiment in the classroom, over a period of 3 academic semesters, involving 65 submissions that carried out code review activity of 690 rules using PMD. The results of the quantitative and qualitative analysis show that the presence of a set of PMD quality issues influences the acceptance or rejection of the issues, design, and best practices-related categories that take longer time to be resolved, and students acknowledge the potential of using static analysis tools during code review. Through this experiment, code review can turn into a vital part of the educational computing plan. We envision our findings enabling educators to support students with code review strategies in order to raise students’ awareness about static analysis tools and scaffold their coding skills.

DOI: 10.1109/ICSE-SEET58685.2023.00023

GradeStyle： GitHub-Integrated and Automated Assessment of Java Code Style

作者: Iddon, Callum and Giacaman, Nasser and Terragni, Valerio
关键词: Java programming language, GitHub, automated marking, programming courses, code style, computing education

Abstract

Every programming language has its own style conventions and best practices, which help developers to write readable and maintainable code. Learning code style is an essential skill that every professional software engineer should master. As such, students should develop good habits for code style early on, when they start learning how to program. Unfortunately, manually assessing students’ code with timely and detailed feedback is often infeasible, and professional static analysis tools are unsuitable for educational contexts. This paper presents GradeStyle, a tool for automatically assessing the code style of Java assignments. GradeStyle automatically checks for violations of some of the most important Google Java Style conventions, and Java best practices. Students receive a report with a code style mark, a list of violations, and their source code locations. GradeStyle nicely integrates with GitHub and GitHub Classroom, and can be configured to provide continuous feedback every time a student pushes new code. We adopted our tool in a second-year software engineering programming course with 327 students and observed consistent improvements in the code style of their assignments.

DOI: 10.1109/ICSE-SEET58685.2023.00024

Persona-Based Assessment of Software Engineering Student Research Projects： An Experience Report

作者: Arora, Chetan and Tubino, Laura and Cain, Andrew and Lee, Kevin and Malhotra, Vasudha
关键词: portolio-based assessment, research projects, software engineering education

Abstract

Students enrolled in software engineering degrees are generally required to undertake a research project in their final year through which they demonstrate the ability to conduct research, communicate outcomes, and build in-depth expertise in an area. Assessment in these projects typically involves evaluating the product of their research via a thesis or a similar artifact. However, this misses a range of other factors that go into producing successful software engineers and researchers. Incorporating aspects such as process, attitudes, project complexity, and supervision support into the assessment can provide a more holistic evaluation of the performance likely to better align with the intended learning outcomes. In this paper, we present on our experience of adopting an innovative assessment approach to enhance learning outcomes and research performance in our software engineering research projects. Our approach adopted a task-oriented approach to portfolio assessment that incorporates student personas, frequent formative feedback, delayed summative grading, and standards-aligned outcomes-based assessment. We report upon our continuous improvement journey in adapting tasks and criteria to address the challenges of assessing student research projects. Our lessons learnt demonstrate the value of personas to guide the development of holistic rubrics, giving meaning to grades and focusing staff and student attention on attitudes and skills rather than a product only.

DOI: 10.1109/ICSE-SEET58685.2023.00025

Exposing Software Engineering Students to Stressful Projects： Does Diversity Matter?

作者: Gra\ss{
关键词: team work, diversity, project management, software engineering education

Abstract

Software development teams have to face stress caused by deadlines, staff turnover, or individual differences in commitment, expertise, and time zones. While students are typically taught the theory of software project management, their exposure to such stress factors is usually limited. However, preparing students for the stress they will have to endure once they work in project teams is important for their own sake, as well as for the sake of team performance in the face of stress. Team performance has been linked to the diversity of software development teams, but little is known about how diversity influences the stress experienced in teams. In order to shed light on this aspect, we provided students with the opportunity to self-experience the basics of project management in self-organizing teams, and studied the impact of six diversity dimensions on team performance, coping with stressors, and positive perceived learning effects. Three controlled experiments at two universities with a total of 65 participants suggest that the social background impacts the perceived stressors the most, while age and work experience have the highest impact on perceived learnings. Most diversity dimensions have a medium correlation with the quality of work, yet no significant relation to the team performance. This lays the foundation to improve students’ training for software engineering teamwork based on their diversity-related needs and to create diversity-sensitive awareness among educators, employers and researchers.

DOI: 10.1109/ICSE-SEET58685.2023.00026

“Software is the Easy Part of Software Engineering” - Lessons and Experiences from A Large-Scale, Multi-Team Capstone Course

作者: Li, Ze Shi and Arony, Nowshin Nawar and Devathasan, Kezia and Damian, Daniela
关键词: software engineering education, scrum, agile software development, capstone, software engineering

Abstract

Capstone courses in undergraduate software engineering are a critical final milestone for students. These courses allow students to create a software solution and demonstrate the knowledge they accumulated in their degrees. However, a typical capstone project team is small containing no more than 5 students and function independently from other teams. To better reflect real-world software development and meet industry demands, we introduce in this paper our novel capstone course. Each student was assigned to a large-scale, multi-team (i.e., company) of up to 20 students to collaboratively build software. Students placed in a company gained first-hand experiences with respect to multi-team coordination, integration, communication, agile, and teamwork to build a microservices based project. Furthermore, each company was required to implement plug-and-play so that their services would be compatible with another company, thereby sharing common APIs. Through developing the product in autonomous sub-teams, the students enhanced not only their technical abilities but also their soft skills such as communication and coordination. More importantly, experiencing the challenges that arose from the multi-team project trained students to realize the pitfalls and advantages of organizational culture. Among many lessons learned from this course experience, students learned the critical importance of building team trust. We provide detailed information about our course structure, lessons learned, and propose recommendations for other universities and programs. Our work concerns educators interested in launching similar capstone projects so that students in other institutions can reap the benefits of large-scale, multi-team development.

DOI: 10.1109/ICSE-SEET58685.2023.00027

Attribution-Based Personas in Virtual Software Engineering Education

作者: Madhi, Klaudia and Reimer, Lara Marie and Jonas, Stephan
关键词: virtual project management, teaching agile software development, distributed teams, attribution theory

Abstract

The COVID-19 pandemic and the consequent introduction of virtual collaboration have introduced educators to unexpected situations and challenges. One of these challenges is social distance, which minimizes knowledge of another person’s character, and leaves room for misconceptions. Perceptions of a person’s personality are also referred to as dispositional attributions and, when misplaced, impact the educator-student dynamics. This paper studies dispositional attributions exhibited by software engineering educators in higher education and aims to raise awareness of potential misconceptions affecting the educator-student relationship caused by the virtual setting. We performed an exploratory case study in a practical university course with twelve distributed software engineering teams, each led by one or two educators. The course was conducted entirely virtually during the COVID-19 pandemic. The research process included discovering, categorizing, and modeling attribution-based personas, followed by qualitative and quantitative research methods of semi-structured interviews and survey questionnaires. These personas represent the subjects of potential misconceptions and encapsulate typical behaviors and attributions. Our research created seven personas: the Unprofessional, Ego is the Enemy, The Detached, the Loner, the Underperformer, Hiding but not Seeking, and Distraction Monster. These personas differ primarily in terms of character traits and motivation attributed to them. The results provide evidence that the virtual setting of the course can lead to several dispositional attributions. Educators in virtual software engineering settings should be aware of these attributions and their potential impact on the educator-student relationship.

DOI: 10.1109/ICSE-SEET58685.2023.00028

Leveraging Diversity in Software Engineering Education through Community Engaged Learning and a Supportive Network

作者: Arony, Nowshin Nawar and Devathasan, Kezia and Li, Ze Shi and Damian, Daniela
关键词: design thinking, experiential learning, software engineering education diversity and inclusion

Abstract

While a lack of diversity is a longstanding problem in computer science and engineering, universities and organizations continue to look for solutions to this issue. Among the first of its kind, we launched INSPIRE: STEM for Social Impact, a program at the University of Victoria, Canada, aimed to motivate and empower students from underrepresented groups in computer science and engineering to develop digital solutions for society impactful projects by engaging in experiential learning projects with identified community-partners. The twenty-four students in the program came from diverse backgrounds in terms of academic areas of study, genders, ethnicities, and levels of technical and educational experience. Working with six community partners, these students spent four months learning and developing solutions for a societal and/or environmental problem with potential for local and global impacts. Our experiences indicate that working in a diverse team with real clients on solving pressing issues produces a sense of competence, relatedness, and autonomy which are the basis of self-determination theory. Due to the unique structure of this program, the three principles of self-determination theory emerged through different experiences, ultimately motivating the students to build a network of like-minded people. The importance of such a network is profound in empowering students to succeed and, in retrospect, remain in software engineering fields. We address the diversity problem by providing diverse, underrepresented students with a safe and like-minded environment where they can learn and realize their full potential. Hence, in this paper, we describe the program design, experiences, and lessons learned from this approach. We also provide recommendations for universities and organizations that may want to adapt our approach.

DOI: 10.1109/ICSE-SEET58685.2023.00029

Improving Grading Outcomes in Software Engineering Projects through Automated Contributions Summaries

作者: Presler-Marshall, Kai and Heckman, Sarah and Stolee, Kathryn T.
关键词: software engineering teams, program analysis, grading consistency

Abstract

Teaming is a key aspect of most professional software engineering positions, and consequently, team-based learning (TBL) features heavily in many undergraduate computer science (CS) and software engineering programs. However, while TBL offers many pedagogical benefits, it is not without challenges. One such challenge is assessment, as the course teaching staff must be able to accurately identify individual students’ contributions to both encourage and reward participation. In this paper, we study improvements to grading practises in the context of a CS1.5 introductory software engineering course, where assessing individual students’ contributions to weekly lab assignments is done manually by teaching assistants (TAs). We explore the impact of presenting TAs with automated summaries of individual student contributions to their team’s GitHub repository. To do so, we propose a novel algorithm, and implement a tool based off of it, AutoVCS. We measure the impact on grading metrics in terms of grading speed, grading consistency, and TA satisfaction. We evaluate our algorithm, as implemented in AutoVCS, in a controlled experimental study on Java-based lab assignments from a recent offering of NC State University’s CS1.5 course. We find our automated summaries help TAs grade more consistently and provides students with more actionable feedback. Although TAs grade no faster using automated summaries, they nonetheless strongly prefer grading with the support of them than without. We conclude with recommendations for future work to explore improving consistency in contribution grading for student software engineering teams.

DOI: 10.1109/ICSE-SEET58685.2023.00030

Analyzing the Quality of Submissions in Online Programming Courses

作者: Tigina, Maria and Birillo, Anastasiia and Golubev, Yaroslav and Keuning, Hieke and Vyahhi, Nikolay and Bryksin, Timofey
关键词: large-scale analysis, refactoring, learning programming, MOOC, code quality, programming education

Abstract

Programming education should aim to provide students with a broad range of skills that they will later use while developing software. An important aspect in this is their ability to write code that is not only correct but also of high quality. Unfortunately, this is difficult to control in the setting of a massive open online course. In this paper, we carry out an analysis of the code quality of submissions from JetBrains Academy — a platform for studying programming in an industry-like project-based setting with an embedded code quality assessment tool called Hyperstyle. We analyzed more than a million Java submissions and more than 1.3 million Python submissions, studied the most prevalent types of code quality issues and the dynamics of how students fix them. We provide several case studies of different issues, as well as an analysis of why certain issues remain unfixed even after several attempts. Also, we studied abnormally long sequences of submissions, in which students attempted to fix code quality issues after passing the task. Our results point the way towards the improvement of online courses, such as making sure that the task itself does not incentivize students to write code poorly.

DOI: 10.1109/ICSE-SEET58685.2023.00031

A Metric for Measuring Software Engineering Post-Graduate Outcomes

作者: Breaux, Travis D. and Moritz, Jennifer
关键词: career outcomes, education, software engineering

Abstract

Professional software engineering (SE) degree programs provide students with the education and skills needed to enter a new SE career, or take on increasing responsibility within their current career. An important metric for evaluating such programs is the impact that completing the program has on postgraduate, career outcomes. Apart from hiring rates and median salaries, this is challenging to measure, because alumni survey response rates are frequently low, and without alumni feedback, insight into individual career advancement after graduation is difficult to observe. In this paper, we propose a new metric, called Career Velocity, that measures the impact of a degree program on alumni promotion into senior positions. The metric requires tracing alumni directory information, consisting of a person’s full name, degree name, and graduation year, to public data that includes employment histories, before computing the number of months prior to promotion into a senior SE position. The metric was developed and evaluated on a mix of six degree programs, including undergraduate and graduate computer science, software engineering and data science programs. The metric was further evaluated by assessing the impact of a graduate’s number of months of industry experience prior to graduation. The results suggest that, independent of prior industry experience, specialized education that targets advancement in a specific career class, e.g., software engineering, leads to faster career progression than general education.

DOI: 10.1109/ICSE-SEET58685.2023.00032

Using Focus to Personalise Learning and Feedback in Software Engineering Education

作者: Modi, Bansri Amish and Cain, Andrew and Wood-Bradley, Guy and Renzella, Jake
关键词: formative feedback, rubrics, assessment

Abstract

Learning can be greatly enhanced by effective feedback. Traditional assessment approaches in higher education often result in feedback being used to justify marks awarded, which is often disregarded once the assessment is complete. In this paper, we explore the idea of incorporating a focus mechanism to connect feedback between assessment tasks and units, discuss how this can be applied to enhance software engineering education, and present results from several staff focus groups exploring the idea. The focus groups discussed the model, its application within software engineering units, and its limitations, with staff helping co-create the enhancements to the model through discussing experiences/sharing opinions/providing insights on assessment within their units. Results indicate that staff believe that the changes will benefit their teaching and highlighted several opportunities for this initiative to encourage students to have a more holistic view of their studies. The main challenges identified were staff workload and complexity for students which must be addressed in implementing this idea.

DOI: 10.1109/ICSE-SEET58685.2023.00033

Shaping a Tool for Developing Computing Students’ Professional Identity - Industry Perspectives

作者: Tubino, Laura and Morgan, Kerri and Wood-Bradley, Guy and Cain, Andrew
关键词: cultural fit, holistic education, professional capability, professional identity

Abstract

Obtaining employment is a major aim for many students completing a computing degree. However, students often fail to develop a comprehensive plan to achieve this goal as they have insufficient awareness of what is required to become a professional. In addition, computing degrees and curriculum generally focus on the necessary computing knowledge and skills, often ignoring development around identity, belonging to a community of practice, and connecting with professional role models - components necessary to build a viable professional identity. This paper explores ideas on how to broaden the perspective of students undertaking computing degrees, helping them understand the broader picture of their education, beyond coursework units, that is needed to ensure their successful transition to industry. Literature on professional identity is used to inform the initial design concepts for a tool, DreamBig, that aims to support the development of an emerging professional identity for students undertaking computing degrees. A focus group with industry representatives was used to test the concept. Findings of this study highlight the value of this initiative and indicate the importance of a big picture, holistic view of professional development with a particular focus on the social dimension for computing students and graduates.

DOI: 10.1109/ICSE-SEET58685.2023.00034

REFERENT： Transformer-Based Feedback Generation Using Assignment Information for Programming Course

作者: Heo, Jinseok and Jeong, Hohyeon and Choi, Dongwook and Lee, Eunseok
关键词: assignment information, transfer learning, transformer, automated feedback generation, programming assignment

Abstract

Students require feedback on programming assignments to improve their programming skills. An Automated feedback generation (AFG) technique proposes to provide feedback-corrected submissions for incorrect student programming submissions in programming courses. However, these techniques are limited as they rely on the availability of correct submissions as a reference to generate feedback. In situations where correct submissions are not available, they resort to using mutation operators, which can lead to a search space explosion problem. In this work, we propose REFERENT, Transformer-based feedback generation using assignment information. REFERENT uses transfer learning on a pre-trained model with data from students’ submission history from the past assignment. To generate assignment-related feedback, we use a title, tag, assignment description, and test case as assignment information. REFERENT can generate feedback without a reference program in limited resources. We conducted a preliminary study to confirm the effectiveness of REFERENT and the feasibility of using assignment information. REFERENT generated feedback for 32.7% of incorrect submissions without reference programs and that its performance increased up to 50.7% when reference programs were used. We also check whether the submission history, assignment information, and repair knowledge of open-source software help generate feedback.

DOI: 10.1109/ICSE-SEET58685.2023.00035

PCR-Chain： Partial Code Reuse Assisted by Hierarchical Chaining of Prompts on Frozen Copilot

作者: Huang, Qing and Zhu, Jiahui and Li, Zhilong and Xing, Zhenchang and Wang, Changjing and Xu, Xiwei
关键词: hierarchical prompts, AI chain, frozen copilot, pre-trained language model, in-context learning

Abstract

API documentation, technical blogs and programming Q&A sites contain a large amount of partial code that can be reused in programming tasks. However, due to unresolved simple names and last-mile syntax errors, such partial code is frequently not compilable. To facilitate partial code reuse, we develop PCR-Chain for resolving FQNs and fixing last-mile syntax errors in partial code based on a giant pre-trained code model (e.g., Copilot). Methodologically, PCR-Chain is backed up by the underlying global-level prompt architecture (which combines three design ideas: hierarchical task breakdown, prompt composition including sequential and conditional structures, and a mix of prompt-based AI and non-AI units) and the local-level prompt design. Technically, we propose PCR-Chain, which employs in-context learning rather than supervised fine-tuning with gradient updates on downstream task data. This approach enables the frozen, giant pre-trained code model to learn the desired behavior for a specific task through behavior-describing prompts and imitate it to complete the task. Experimental results show that PCR-Chain automatically resolves the FQNs and fixes last-mile syntax errors in 50 partial code samples collected from Stack Overflow with high success rates, without requiring any program analysis. The correct execution of the unit, module, and PCR-Chain demonstrates the effectiveness of the prompt design, prompt composition, and prompt architecture.Website:https://github.com/SE-qinghuang/PCR-ChainDemo Video: https://youtu.be/6HGRNdc2_JE

DOI: 10.1109/ICSE-Companion58688.2023.00013

JAttack： Java JIT Testing Using Template Programs

作者: Zang, Zhiqiang and Yu, Fu-Yao and Wiatrek, Nathan and Gligoric, Milos and Shi, August
关键词: templates, compiler, program generation, test generation, testing

Abstract

We present JAttack, a framework that enables compiler testing using templates. JAttack allows compiler developers to write a template program that describes a set of concrete programs to be used to test compilers. Such a template-based approach leverages developers’ intuition on testing compilers, by allowing developers to write a template program in the host programming language (Java), which contains a basic program structure while provides an opportunity to express variants of specific language constructs in holes. Each hole, written in a domain-specific language embedded in the host language, is used to construct an extended abstract syntax tree (eAST), which defines the search space of a language construct, e.g., a set of numbers, expressions, statements, etc. JAttack executes the template program to fill every hole by randomly choosing a number, expression, or statement within the search space defined by the hole, and it generates concrete programs with all holes filled. We used JAttack to test Java just-in-time (JIT) compilers, and we have found seven critical bugs in Oracle JDK JIT compiler. Oracle developers confirmed and fixed all seven bugs, five of which were previously unknown, including two CVEs (Common Vulnerabilities and Exposures). JAttack blends developers’ intuition via templates with random testing to detect bugs in compilers. The demo video for JAttack can be found at https://www.youtube.com/watch?v=meCFPxucqk4.

DOI: 10.1109/ICSE-Companion58688.2023.00014

ActionsRemaker： Reproducing GitHub Actions

作者: Zhu, Hao-Nan and Guan, Kevin Z. and Furth, Robert M. and Rubio-Gonz'{a
关键词: software reproducibility, software build, software mining, GitHub actions, CI/CD

Abstract

Mining Continuous Integration and Continuous Delivery (CI/CD) has enabled new research opportunities for the software engineering (SE) research community. However, it remains a challenge to reproduce CI/CD build processes, which is crucial for several areas of research within SE such as fault localization and repair. In this paper, we present ActionsRemaker, a reproducer for GitHub Actions builds. We describe the challenges on reproducing GitHub Actions builds and the design of ActionsRemaker. Evaluation of ActionsRemaker demonstrates its ability to reproduce fail-pass pairs: of 180 pairs from 67 repositories, 130 (72.2%) from 43 repositories are reproducible. We also discuss reasons for unreproducibility. ActionsRemaker is publicly available at https://github.com/bugswarm/actions-remaker, and a demo of the tool can be found at https://youtu.be/flblSqoxeAk.

DOI: 10.1109/ICSE-Companion58688.2023.00015

HOME： Heard-of Based Formal Modeling and Verification Environment for Consensus Protocols

作者: Zhai, Shumao and Li, Xiaozhou and Ge, Ning
关键词: SAT, formal verification, heard-of modeling language, byzantine fault tolerant, consensus protocol

Abstract

Consensus protocol plays an important role in ensuring the reliability of distributed systems. How to formally model and verify it is a hot research issue. Due to the limitation of verification performance, it can usually verify consensus algorithms with a small number of processes. The Heard-Of (HO) model is well-performing in formal verification. However, existing works only support HO modeling for Crash Fault Tolerant (CFT) protocols and rely on SMT-based verification. It cannot model Byzantine Fault Tolerant (BFT) protocols, nor can it support SAT solving. This paper designs and implements an HO-based formal modeling and verification environment (HOME) for consensus protocols. We developed a modeling tool to support the HOML (HO modeling language) for formally modeling the threshold-guarded distributed BFT protocols. We get through the formal verification process from HOML to SAT/SMT solving to improve the verification performance. HOME integrates HOML’s translator and SAT/SMT solvers, which can facilitate the design of consensus protocols and help discover safety issues. The evaluation results show that HOME supports the modeling and verification of various consensus protocols, and SAT solving can effectively improve the verification performance.Repo: https://github.com/tempAcc000/HOMEVideo: https://www.youtube.com/watch?v=ZiaVLs-VGwE

DOI: 10.1109/ICSE-Companion58688.2023.00016

CoVeriTeam Service： Verification as a Service

作者: Beyer, Dirk and Kanav, Sudeep and Wachowitz, Henrik
关键词: continuous integration, API, web service, verification tools, automatic verification, software verification, incremental verification, tool development, cooperative verification

Abstract

The research community has developed numerous tools for solving verification problems, but we are missing a common web interface for executing them. This means, users have to commit to install and execute each new tool (version) on their local machine. We propose to use CoVeriTeam Service to make it easy for verification researchers to experiment with new verification tools. CoVeriTeam has already unified the command-line interface, and reduced the burden by taking care of tool installation and isolated execution. The new web service in addition enables tool developers to make their tools accessible on the web and users to include verification tools in their work flow. There are already further applications of our service: The 2023 competitions on software verification and testing used the service for their integration testing, and we propose to use CoVeriTeam Service for incremental verification as part of a continuous-integration process.Demonstration video: https://youtu.be/0Ao0ZogSu1UDemonstration service: https://coveriteam-service.sosy-lab.org

DOI: 10.1109/ICSE-Companion58688.2023.00017

Proofster： Automated Formal Verification

作者: Agrawal, Arpan and First, Emily and Kaufman, Zhanna and Reichel, Tom and Zhang, Shizhuo and Zhou, Timothy and Sanchez-Stern, Alex and Ringer, Talia and Brun, Yuriy
关键词: No keywords

Abstract

Formal verification is an effective but extremely work-intensive method of improving software quality. Verifying the correctness of software systems often requires significantly more effort than implementing them in the first place, despite the existence of proof assistants, such as Coq, aiding the process. Recent work has aimed to fully automate the synthesis of formal verification proofs, but little tool support exists for practitioners. This paper presents Proofster, a web-based tool aimed at assisting developers with the formal verification process via proof synthesis. Proofster inputs a Coq theorem specifying a property of a software system and attempts to automatically synthesize a formal proof of the correctness of that property. When it is unable to produce a proof, Proofster outputs the proof-space search tree its synthesis explored, which can guide the developer to provide a hint to enable Proofster to synthesize the proof. Proofster runs online at https://proofster.cs.umass.edu/ and a video demonstrating Proofster is available at https://youtu.be/xQAi661RfwI/.

DOI: 10.1109/ICSE-Companion58688.2023.00018

作者: Zhang, Jiashuo and Li, Yue and Gao, Jianbo and Guan, Zhi and Chen, Zhong
关键词: vulnerability detection, software analysis, digital signature, smart contract

Abstract

Ethereum smart contract enables developers to enforce access control policies of critical functions using built-in signature verification interfaces, i.e., ecrecover. However, due to the lack of best practices for these interfaces, improper verifications commonly exist in deployed smart contracts, leaving potential unauthorized access and financial losses. Even worse, the attack surface is ignored by both developers and existing smart contract security analyzers. In this paper, we take a close look at signature-related vulnerabilities and de-mystify them with clear classification and characterization. We present Siguard, the first automatic tool to detect these vulnerabilities in real-world smart contracts. Specifically, Siguard explores signature-related paths in the smart contract and extracts data dependencies based on symbolic execution and taint analysis. Then, it conducts vulnerability detection based on a systematic search for violations of standard patterns including EIP-712 and EIP-2621. The preliminary evaluation validated the efficacy of Siguard by reporting previously unknown vulnerabilities in deployed smart contracts on Ethereum. A video of Siguard is available at https://youtu.be/xXAEhqXWOu0.

DOI: 10.1109/ICSE-Companion58688.2023.00019

RM2DM： A Tool for Automatic Generation of OO Design Models from Requirements Models

作者: Tian, Zhen and Yang, Yilong and Cheng, Sheng
关键词: model transformation, requirements, design model

Abstract

Enterprise information systems focus on dealing with the complex business logic of collecting, filtering, processing, and distributing data to improve productivity and service in our daily lives. The successful development of enterprise information systems is a labor-intensive activity in software engineering, and it requires sophisticated human efforts for requirements validation and system design. Our previous work RM2PT can help to achieve a validated requirements model by automatically generating prototypes from requirements models to support incremental and rapid requirements validation. In this paper, we present a tool named RM2DM to further alleviate the problem of system development by supporting automatically generating a OO (Object-Oriented) design model of enterprise information system from the validated requirements model. We evaluate the tool through four case studies. The experimental result shows that all class diagram classes and 93.8% of sequence diagram messages can be correctly generated within 10 seconds. Overall, the results were satisfactory. The proposed approach can be further extended and applied for system development in the industry.The tool can be downloaded at http://rm2pt.com/advs/rm2dm, and a demo video casting its features is at https://www.youtube.com/watch?v=lrs57CjzmU8

DOI: 10.1109/ICSE-Companion58688.2023.00020

RexStepper： A Reference Debugger for JavaScript Regular Expressions

作者: Almeida, Lu'{\i
关键词: debuggers, regular expressions, JavaScript

Abstract

Regular expressions are notoriously difficult to get right, with developers often having to resort to trial-and-error approaches. Even so, little attention has been given by the research community to the development of effective debugging tools for regular expressions. We present RexStepper, a reference debugger for troubleshooting JavaScript regular expressions in the browser. RexStepper is implemented on top of RexRef, a trusted reference implementation of JavaScript (ECMAScript 5) regular expressions, which works by transpiling the given regular expression to a JavaScript function that recognises its expansions. We demonstrate the usefulness of RexStepper by successfully using it to troubleshoot a benchmark of 18 faulty regular expressions obtained from the Stack Overflow and Stack Exchange websites.

DOI: 10.1109/ICSE-Companion58688.2023.00021

iTrace-Toolkit： A Pipeline for Analyzing Eye-Tracking Data of Software Engineering Studies

作者: Behler, Joshua and Weston, Praxis and Guarnera, Drew T. and Sharif, Bonita and Maletic, Jonathan I.
关键词: pipeline, fixations, empirical studies, eye tracking

Abstract

iTrace is community eye-tracking infrastructure that enables conducting eye-tracking studies within an Integrated Development Environment (IDE). It consists of a set of tools for gathering eye-tracking data on large real software projects within an IDE during studies on source code. Once the raw eye-tracking data is collected, processing is necessary before it can be used for analysis. Rather than provide the raw data for researchers to analyze and write their own customize scripts, we introduce iTrace-Toolkit - a suite of tools that assists with combining different data files generated from iTrace and its IDE plugins (namely Visual Studio, Atom, and Eclipse). iTrace-Toolkit also provides the crucial mapping of the valid raw eye-tracking data to source code tokens and finally generates fixations (an important metric in eye-tracking for comprehension) using three commonly used algorithms based on distance and velocity of eye movements. iTrace-Toolkit keeps track of all participant data and tasks during a given study and produces a complete lightweight database of the raw, mapped, and fixation data that is standardized and ready to be used by statistical tools. A simple GUI interface is provided for quick access to filter the data after an eye-tracking study. iTrace-Toolkit also allows for the export of the data or subset of the data to text formats for further statistical processing.YouTube Video:https://www.youtube.com/watch?v=9i2OsOANh8w

DOI: 10.1109/ICSE-Companion58688.2023.00022

SoapOperaTG： A Tool for System Knowledge Graph Based Soap Opera Test Generation

作者: Su, Yanqi and Han, Zheming and Xing, Zhenchang and Xu, Xiwei and Zhu, Liming and Lu, Qinghua
关键词: user tasks and failures, exploratory testing, knowledge graph

Abstract

Exploratory testing is an effective testing approach for the system-level testing from the end user’s perspective, which is widely practiced and appreciated in the software industry. Although many concrete principles and guidelines for performing exploratory testing have been proposed, there are no effective tools for automatic generation of exploratory test scenarios (a.k.a soap opera tests). In this paper, we propose a tool named SoapOperaTG for automatic soap opera test generation by leveraging the scenario and oracle knowledge in bug reports. We first construct a system knowledge graph (KG) of user tasks and failures from the preconditions, steps to reproduce (S2Rs), expected behavior (EB) and observed behavior (OB) in bug reports. Then, we create soap opera tests by combining the scenarios of relevant bugs based on the system knowledge graph. SoapOperaTG is implemented as a web tool to present the generated test scenarios. In our user study, 5 users find 18 bugs in Mozilla Firefox (a mature, well-maintained software system) in 2 hours using SoapOperaTG, while the control group finds only 5 bugs based on the recommended similar bugs. SoapOperaTG can be found at https://github.com/SuYanqi/SYS-KG. Demo video can be found at https://youtu.be/xcXmY8qGDSc.

DOI: 10.1109/ICSE-Companion58688.2023.00023

GUI Testing to the Power of Parallel Q-Learning

作者: Mobilio, Marco and Clerissi, Diego and Denaro, Giovanni and Mariani, Leonardo
关键词: GUI testing, web testing, Q-learning

Abstract

Q-learning is an attractive option for GUI testing, allowing for sophisticated test generation strategies that learn and exploit effective GUI interactions. However, learning comprehensive models requires long test sessions. This issue is exacerbated by the needs of both testers, who might want to run multiple testing sessions to fine-tune the test strategy to their applications under test, and researchers, who might want to experiment with multiple alternative approaches. To address these concerns, this paper presents GTPQL, a testing tool that supports GUI testing with a parallel deployment Q-learning, and that can be flexibly configured and extended with multiple state-space abstractions and Q-leaning variants.

DOI: 10.1109/ICSE-Companion58688.2023.00024

A Multi-Faceted Vulnerability Searching Website Powered by Aspect-Level Vulnerability Knowledge Graph

作者: Sun, Jiamou and Xing, Zhenchang and Lu, Qinghua and Xu, Xiwei and Zhu, Liming
关键词: No keywords

Abstract

Vulnerabilities can cause damages to users. With heavy dependencies among software, it is particularly important to safely select the dependent libraries and maintain security of software in a targeted manner, which require deep understanding of potential weakness of third-party libraries. Current vulnerability advisories only support rough-level description-based vulnerability information searching, which cannot cater the needs of in-depth investigation and understanding of vulnerabilities. Driven by the real needs, we propose a vulnerability aspect-level vulnerability knowledge graph integrating diversified vulnerability key aspect information from heterogeneous vulnerability databases. Based on the knowledge graph, we implement a multi-faceted vulnerability searching website for statistics and details acquiring of vulnerabilities. Our use cases demonstrate the usefulness of our knowledge graph and website to the software security.Demo Video: https://youtu.be/vYSy7MYIU48Source Code: https://github.com/sjmsjmdsg/Multi_faceted_WebWebsite: see GitHub repository.

DOI: 10.1109/ICSE-Companion58688.2023.00025

DeepJudge： A Testing Framework for Copyright Protection of Deep Learning Models

作者: Chen, Jialuo and Sun, Youcheng and Wang, Jingyi and Cheng, Peng and Ma, Xingjun
关键词: No keywords

Abstract

Deep learning (DL) models have become one of the most valuable assets in modern society, and those most complex ones require millions of dollars for the model development. As a result, unauthorized duplication or reproduction of DL models can lead to copyright infringement and cause huge economic losses to model owners.In this work, we present DeepJudge, a testing framework for DL copyright protection. DeepJudge quantitatively tests the similarities between two DL models: a victim model and a suspect model. It leverages a diverse set of testing metrics and efficient test case generation algorithms to produce a chain of supporting evidence to help determine whether a suspect model is a copy of the victim model. Our experiments confirm the effectiveness of DeepJudge under typical model copyright infringement scenarios. The tool has been made publicly available at https://github.com/Testing4AI/DeepJudge. A demo video can be found at https://www.youtube.com/watch?v=LhNeo615YOE.

DOI: 10.1109/ICSE-Companion58688.2023.00026

DeepCrime： From Real Faults to Mutation Testing Tool for Deep Learning

作者: Humbatova, Nargiz and Jabangirova, Gunel and Tonella, Paolo
关键词: real faults, mutation testing, deep learning

Abstract

The recent advance of Deep Learning (DL) due to its human-competitive performance in complex and often safety-critical tasks, reveals many gaps in their testing. There exist a number of DL-specific testing approaches, and yet none has presented the possibility of simulating the occurrence of real DL faults for the mutation testing of DL systems. We propose 35 and implement 24 mutation operators that were systematically extracted from the existing studies on real DL faults. Our evaluation shows that the proposed operators produce non-redundant, killable, and non-trivial mutations while being more sensitive to the change in the quality of test data than the existing mutation testing approaches. Video demonstration is available at: https://youtu.be/WOvuPaXH6Jk

DOI: 10.1109/ICSE-Companion58688.2023.00027

Cerberus： A Program Repair Framework

作者: Shariffdeen, Ridwan and Mirchev, Martin and Noller, Yannic and Roychoudhury, Abhik
关键词: repair platform, automated program repair

Abstract

Automated Program Repair (APR) represents a suite of emerging technologies which attempt to automatically fix bugs and vulnerabilities in programs. APR is a rapidly growing field with new tools and benchmarks being added frequently. Yet a language agnostic repair framework is not available. We introduce Cerberus, a program repair framework integrated with 20 program repair tools and 9 repair benchmarks, coexisting in the same framework. Cerberus is capable of executing diverse set of program repair tasks, using multitude of program repair tools and benchmarks.Video: https://www.youtube.com/watch?v=bYtShpsGL68

DOI: 10.1109/ICSE-Companion58688.2023.00028

TSVD4J： Thread-Safety Violation Detection for Java

作者: Rahman, Shanto and Li, Chengpeng and Shi, August
关键词: No keywords

Abstract

Concurrency bugs are difficult to detect and debug. One class of concurrency bugs are thread-safety violations, where multiple threads access thread-unsafe data structure at the same time, resulting in unexpected behavior. Prior work proposed an approach TSVD to detect thread-safety violations. TSVD injects delays at API calls that read/write to specific thread-unsafe data structures, tracking whether multiple threads can overlap in their accesses to the same data structure through the delays, showing potential thread-safety violations. We additionally enhance the TSVD approach to also consider read/write operations to object fields. We implement the TSVD approach in Java in our tool TSVD4J. TSVD4J can be integrated as a Maven plugin that can be included in any Maven-based application. Our evaluation on 12 applications shows that TSVD4J can detect 55 pairs of code locations accessing the same shared data structure across multiple threads, representing potential thread-safety violations. We find that the addition of tracking field accesses contributed the most to detecting these pairs. TSVD4J also detects more such pairs than existing tool RV-Predict. The demo video for TSVD4J is available at https://www.youtube.com/watch?v=-wSMzlj5cMY.

DOI: 10.1109/ICSE-Companion58688.2023.00029

Avgust： A Tool for Generating Usage-Based Tests from Videos of App Executions

作者: Talebipour, Saghar and Park, Hyojae and Baral, Kesina and Yee, Leon and Khan, Safwat Ali and Moran, Kevin and Brun, Yuriy and Medvidovic, Nenad and Zhao, Yixue
关键词: AI/ML, test generation, mobile testing, UI understanding, mobile application

Abstract

Creating UI tests for mobile applications is a difficult and time-consuming task. As such, there has been a considerable amount of work carried out to automate the generation of mobile tests—largely focused upon the goals of maximizing code coverage or finding crashes. However, comparatively fewer automated techniques have been proposed to generate a highly sought after type of test: usage-based tests. These tests exercise targeted app functionalities for activities such as regression testing. In this paper, we present the Avgust tool for automating the construction of usage-based tests for mobile apps. Avgust learns usage patterns from videos of app executions collected by beta testers or crowd-workers, translates these into an app-agnostic state-machine encoding, and then uses this encoding to generate new test cases for an unseen target app. We evaluated Avgust on 374 videos of use cases from 18 popular apps and found that it can successfully exercise the desired usage in 69% of the tests. Avgust is an open-source tool available at https://github.com/felicitia/UsageTesting-Repo/tree/demo. A video illustrating the capabilities of Avgust can be found at: https://youtu.be/LPICxVd0YAg.

DOI: 10.1109/ICSE-Companion58688.2023.00030

DeepLog： Deep-Learning-Based Log Recommendation

作者: Zhang, Yang and Chang, Xiaosong and Fang, Lining and Lu, Yifan
关键词: similarity analysis, recommendation, deep learning, log location

Abstract

Log recommendation plays a vital role in analyzing run-time issues including anomaly detection, performance monitoring, and security evaluation. However, existing deep-learning-based approaches for log recommendation suffer from insufficient features and low F1. To this end, this paper proposes a prototype called DeepLog to recommend log location based on a deep learning model. DeepLog parses the source code into an abstract syntax tree and then converts each method into a block hierarchical tree in which DeepLog extracts both semantic and syntactic features. By doing this, we construct a dataset with more than 110K samples. DeepLog employs a double-branched neural network model to recommend log locations. We evaluate the effectiveness of DeepLog by answering four research questions. The experimental results demonstrate that it can recommend 8,725 logs for 23 projects and the F1 of DeepLog is 28.17% higher than that of the existing approaches, which improves state-of-the-art.

DOI: 10.1109/ICSE-Companion58688.2023.00031

ShellFusion： An Answer Generator for Shell Programming Tasks via Knowledge Fusion

作者: Chen, Zhongqi and Zhang, Neng and Si, Pengyue and Chen, Qinde and Liu, Chao and Zheng, Zibin
关键词: knowledge fusion, answer generation, shell programming

Abstract

Shell programming is widely used to accomplish various tasks in Unix and Linux platforms. However, the large number of shell commands available, e.g., 50,000+ commands are documented in the Ubuntu Manual Pages (MPs), makes it a big challenge for programmers to find appropriate commands for a task. Although there are some tutorials (e.g., TLDR) with examples manually created to address the challenge, the tutorials only cover a limited number of frequently used commands for shell beginners and provide limited support for users to search commands by a task. In this paper, we introduce a novel web-based tool, ShellFusion, which can automatically generate comprehensive answers (including relevant commands, scripts, and explanations) for shell programming tasks by fusing multi-source knowledge mined from Q&A posts, Ubuntu MPs, and TLDR tutorials. Our evaluation on 434 shell programming tasks shows that ShellFusion significantly outperforms the state-of-the-art approaches by at least 179.6% in terms of MRR@K and MAP@K. A user study conducted with 20 shell programmers further shows that ShellFusion can help users address programming tasks more efficiently and accurately.ShellFusion Tool: http://shellfusion.cn/Demo Video: https://youtu.be/P0YJzpKBmnA

DOI: 10.1109/ICSE-Companion58688.2023.00032

AIRepair： A Repair Platform for Neural Networks

作者: Song, Xidan and Sun, Youcheng and Mustafa, Mustafa A. and Cordeiro, Lucas C.
关键词: No keywords

Abstract

We present AIRepair, a platform for repairing neural networks. It features the integration of existing network repair tools. Based on AIRepair, one can run different repair methods on the same model, thus enabling the fair comparison of different repair techniques. In this paper, we evaluate AIRepair with five recent repair methods on popular deep-learning datasets and models. Our evaluation confirms the utility of AIRepair, by comparing and analyzing the results from different repair techniques. A demonstration is available at https://youtu.be/UkKw5neeWhw.

DOI: 10.1109/ICSE-Companion58688.2023.00033

RIdiom： Automatically Refactoring Non-Idiomatic Python Code with Pythonic Idioms

作者: Zhang, Zejun and Xing, Zhenchang and Xu, Xiwei and Zhu, Liming
关键词: abstract syntax grammar, code refactoring, Python idioms

Abstract

Pythonic idioms are widely adopted in the Python community because of their advantages such as conciseness and performance. However, when Python programmers use pythonic idioms, they face many challenges such as being unaware of certain pythonic idioms or not knowing how to use them properly. Based on an analysis of 7,638 Python repositories on GitHub, we find that non-idiomatic Python code that can be refactored with pythonic idioms occurs frequently and widely. Unfortunately, there is no tool to automatically refactor such non-idiomatic code into idiomatic code. In this paper, we design and implement a tool called RIdiom to make Python code idiomatic with nine pythonic idioms. Python developers can not only refactor projects easily via a visual interface of the PyCharm plugin but also can refactor projects using the command line without relying on an integrated development environment. We test and review over 4,115 refactorings applied to 1,065 Python projects from GitHub, and submit 90 pull requests for the 90 randomly sampled refactorings to 84 projects. These evaluations confirm the high-accuracy, practicality and usefulness of our refactoring tool on real-world Python code.Demo Tool: https://github.com/idiomaticrefactoring/RIdiomDemo Video: https://youtu.be/KG-nXGR8DIA

DOI: 10.1109/ICSE-Companion58688.2023.00034

Seldonian Toolkit： Building Software with Safe and Fair Machine Learning

作者: Hoag, Austin and Kostas, James E. and Silva, Bruno Castro da and Thomas, Philip S. and Brun, Yuriy
关键词: No keywords

Abstract

We present the Seldonian Toolkit, which enables software engineers to integrate provably safe and fair machine learning algorithms into their systems. Software systems that use data and machine learning are routinely deployed in a wide range of settings from medical applications, autonomous vehicles, the criminal justice system, and hiring processes. These systems, however, can produce unsafe and unfair behavior, such as suggesting potentially fatal medical treatments, making racist or sexist predictions, or facilitating radicalization and polarization. To reduce these undesirable behaviors, software engineers need the ability to easily integrate their machine-learning-based systems with domain-specific safety and fairness requirements defined by domain experts, such as doctors and hiring managers. The Seldonian Toolkit provides special machine learning algorithms that enable software engineers to incorporate such expert-defined requirements of safety and fairness into their systems, while provably guaranteeing those requirements will be satisfied. A video demonstrating the Seldonian Toolkit is available at https://youtu.be/wHR-hDm9jX4/.

DOI: 10.1109/ICSE-Companion58688.2023.00035

What Would You Do? An Ethical AI Quiz

作者: Teo, Wei and Teoh, Ze and Arabi, Dayang Abang and Aboushadi, Morad and Lai, Khairenn and Ng, Zhe and Pant, Aastha and Hoda, Rashina and Tantithamthavorn, Chakkrit and Turhan, Burak
关键词: ethical AI quiz, self-assessment tools, AI practitioners, AI ethics, ethics

Abstract

The resurgence of Artificial Intelligence (AI) has been accompanied by a rise in ethical issues. AI practitioners either face challenges in making ethical choices when designing AI-based systems or are not aware of such challenges in the first place. Increasing the level of awareness and understanding of the perceptions of those who develop AI systems is a critical step toward mitigating ethical issues in AI development. Motivated by these challenges, needs, and the lack of engaging approaches to address these, we developed an interactive, scenario-based ethical AI quiz. It allows AI practitioners, including software engineers who develop AI systems, to self-assess their awareness and perceptions about AI ethics. The experience of taking the quiz, and the feedback it provides, will help AI practitioners understand the gap areas, and improve their overall ethical practice in everyday development scenarios. To demonstrate these expected outcomes and the relevance of our tool, we also share a preliminary user study. The video demo can be found at https://zenodo.org/record/7601169#.Y9xgA-xBxhF.

DOI: 10.1109/ICSE-Companion58688.2023.00036

A Web-Based Tool for Using Storyboard of Android Apps

作者: Zhang, Yuxin and Chen, Sen and Fan, Lingling
关键词: Android app, app review, GUI exploration, app exploration, storyboard

Abstract

The development team usually makes painstaking efforts to review and analyze many existing apps with similar purposes such as competitive analysis, design recommendation, code generation, and app testing. To assist different roles in doing these tasks, in our prior work, two advanced approaches (i.e., StoryDroid and StoryDistiller) have been proposed to automatically generate the storyboards for Android apps with rich features such as UI pages, UI components, layout code, and logic code. These approaches both aim at exploring and parsing as many app pages as possible but lack some consideration of the presentation and interpretability of the results for different users such as PMs, designers, and developers. To improve usability and scalability, this paper presents a web-based offline tool, named StoryDroid+, which provides an operation-friendly platform for using storyboards and helps different stakeholders (e.g., designers, package managers, developers) explore and understand apps from different perspectives through rich visual pages. The tool and datasets are available at: https://github.com/tjusenchen/StoryDroid and the demonstration video can be found at: https://youtu.be/prszxRdkdYU.

DOI: 10.1109/ICSE-Companion58688.2023.00037

InputGen： A Tool for Automatic Generation of Prototype Inputs to Support Rapid Requirements Validation

作者: Chang, Shuanglong and Gao, Juntao and Yang, Yilong
关键词: automatically generate, input data, requirements validation, requirements model, prototype

Abstract

Prototyping is an effective and efficient way of requirement validation to avoid introducing errors in the early stage of software development. Our previous work RM2PT can automatically generate prototypes from requirements models to support incremental and rapid requirements validation. Although the stakeholders can validate requirements through executing the system operations of the generated prototype, the input parameters of system operations still need to be manually typed by the stakeholders. Unlike software testing, the input of system operation must be valid and reasonable for the stakeholders under the specific scenario of use case. This is usually hard to be achieved by the stakeholders who have less knowledge and concern about the state and interface of target system. In this paper, we propose a tool named InputGen to automatically refactor and enhance the generated prototype from RM2PT. The enhanced prototype can automatically generate valid input data of the system operations for requirement validation. In addition, the enhanced prototype provide an external interface to load the initial data from an external file, which can save the time of modeling the data functionality for the administrator. We demonstrate that the enhanced prototype can improve requirements validation efficiency by 13.77 times over the originally generated prototype from RM2PT. Overall, the results were satisfactory. The proposed tool can be further extended and applied for the requirements validation in the software industry.The tool can be downloaded at https://rm2pt.com/advs/inputgen, and a demo video casting its features is at https://youtu.be/iR_ojHyzDvQ

DOI: 10.1109/ICSE-Companion58688.2023.00038

FlaPy： Mining Flaky Python Tests at Scale

作者: Gruber, Martin and Fraser, Gordon
关键词: flakiness detection, flaky tests

Abstract

Flaky tests obstruct software development, and studying and proposing mitigations against them has therefore become an important focus of software engineering research. To conduct sound investigations on test flakiness, it is crucial to have large, diverse, and unbiased datasets of flaky tests. A common method to build such datasets is by rerunning the test suites of selected projects multiple times and checking for tests that produce different outcomes. While using this technique on a single project is mostly straightforward, applying it to a large and diverse set of projects raises several implementation challenges such as (1) isolating the test executions, (2) supporting multiple build mechanisms, (3) achieving feasible run times on large datasets, and (4) analyzing and presenting the test outcomes. To address these challenges we introduce FlaPy, a framework for researchers to mine flaky tests in a given or automatically sampled set of Python projects by rerunning their test suites. FlaPy isolates the test executions using containerization and fresh execution environments to simulate real-world CI conditions and to achieve accurate results. By supporting multiple dependency installation strategies, it promotes diversity among the studied projects. FlaPy supports parallelizing the test executions using SLURM, making it feasible to scan thousands of projects for test flakiness. Finally, FlaPy analyzes the test outcomes to determine which tests are flaky and depicts the results in a concise table. A demo video of FlaPy is available at https://youtu.be/ejy-be-FvDY

DOI: 10.1109/ICSE-Companion58688.2023.00039

TECHSUMBOT： A Stack Overflow Answer Summarization Tool for Technical Query

作者: Yang, Chengran and Xu, Bowen and Liu, Jiakun and Lo, David
关键词: question retrieval, summarization

Abstract

Stack Overflow is a popular platform for developers to seek solutions to programming-related problems. However, prior studies identified that developers may suffer from the redundant, useless, and incomplete information retrieved by the Stack Overflow search engine. To help developers better utilize the Stack Overflow knowledge, researchers proposed tools to summarize answers to a Stack Overflow question. However, existing tools use hand-craft features to assess the usefulness of each answer sentence and fail to remove semantically redundant information in the result. Besides, existing tools only focus on a certain programming language and cannot retrieve up-to-date new posted knowledge from Stack Overflow. In this paper, we propose TechSumBot, an automatic answer summary generation tool for a technical problem. Given a question, TechSumBot first retrieves answers using the Stack Overflow search engine, then TechSumBot 1) ranks each answers sentence based on the sentence’s usefulness, 2) estimates the centrality of each sentence to all candidates, and 3) removes the semantic redundant information. Finally, TechSumBot returns the top 5 ranked answer sentences as the answer summary. We implement TechSumBot in the form of a search engine website. To evaluate TechSumBot in both automatic and manual manners, we construct the first Stack Overflow multi-answer summarization benchmark and design a manual evaluation study to assess the effectiveness of TechSumBot and state-of-the-art baselines from the NLP and SE domain. Both results indicate that the summaries generated by TechSumBot are more diverse, useful, and similar to the ground truth summaries.Tool Link: www.techsumbot.comVideo Link: https://youtube.coni/watch7v-ozuJOp_vILMReplication Package: https://github.com/TechSumBot/TechSumBot

DOI: 10.1109/ICSE-Companion58688.2023.00040

Randomized Differential Testing of RDF Stores

作者: Yang, Rui and Zheng, Yingying and Tang, Lei and Dou, Wensheng and Wang, Wei and Wei, Jun
关键词: SPARQL, differential testing, RDF store

Abstract

As a special kind of graph database systems, RDF stores have been widely used in many applications, e.g., knowledge graphs and semantic web. RDF stores utilize SPARQL as their standardized query language to store and retrieve RDF graphs. Incorrect implementations of RDF stores can introduce logic bugs that cause RDF stores to return incorrect query results. These logic bugs can lead to severe consequences and are likely to go unnoticed by developers. However, no available tools can detect logic bugs in RDF stores.In this paper, we propose RD2, a Randomized Differential testing approach of RDF stores, to reveal discrepancies among RDF stores, which indicate potential logic bugs in RDF stores. The core idea of RD2 is to build an equivalent RDF graph for multiple RDF stores, and verify whether they can return the same query result for a given SPARQL query. Guided by the SPARQL syntax and the generated RDF graph, we automatically generate syntactically valid SPARQL queries, which can return non-empty query results with high probability. We further unify the formats of SPARQL query results from different RDF stores and find discrepancies among them. We evaluate RD2 on three popular and widely-used RDF stores. In total, we have detected 5 logic bugs in them. A video demonstration of RD2 is available at https://youtu.be/da7XlsdbRR4.

DOI: 10.1109/ICSE-Companion58688.2023.00041

CryptOpt： Automatic Optimization of Straightline Code

作者: Kuepper, Joel and Wu, David and Erbsen, Andres and Gross, Jason and Conoly, Owen and Sun, Chuyue and Tian, Samuel and Chlipala, Adam and Chuengsatiansup, Chitchanok and Genkin, Daniel and Wagner, Markus and Yarom, Yuval
关键词: elliptic curve cryptography, local search, search based software engineering, automatic performance optimization

Abstract

Manual engineering of high-performance implementations typically consumes many resources and requires in-depth knowledge of the hardware. Compilers try to address these problems; however, they are limited by design in what they can do. To address this, we present CryptOpt, an automatic optimizer for long stretches of straightline code. Experimental results across eight hardware platforms show that CryptOpt achieves a speedup factor of up to 2.56 over current off-the-shelf compilers.

DOI: 10.1109/ICSE-Companion58688.2023.00042

elessDT： A Digital Twin Platform for Real-Time Evaluation of Wireless Software Applications

作者: Lai, Zhongzheng and Yuan, Dong and Chen, Huaming and Zhang, Yu and Bao, Wei
关键词: emulation tool, wireless signal emulation, wireless software evaluation, digital twin

Abstract

Wireless technology has become one of the most important parts of our daily routine. Besides being used for communication, the wireless signal has been applied to various Wireless Software Applications (WSAs). The signal fluctuation caused by the measurement system or the environmental dynamic can significantly influence WSAs’ performance, making it challenging to evaluate WSAs in real-world scenarios. To overcome these challenges, we propose WirelsssDT, a wireless digital twin platform, using digital twin and real-time ray tracing technologies to emulate the wireless signals propagation and generate emulation data for real-time WSAs evaluation. In this demonstration, we evaluate a wireless indoor localisation mobile application with two typical prediction algorithms: 1) Kalman Filter-based Trilateration and 2) Deep Recurrent Neural Network, as a case study to demonstrate the capabilities of WirelessDT. The source code is available at https://github.com/codelzz/WirelessDT, and the demonstration video is available at https://youtu.be/9Kl-3jgMBUA.

DOI: 10.1109/ICSE-Companion58688.2023.00043

MROS： A Framework for Robot Self-Adaptation

作者: Silva, Gustavo Rezende and Garcia, Nadia Hammoudeh and Bozhinoski, Darko and Deshpande, Harshavardhan and Oviedo, Mario Garzon and Wasowski, Andrzej and Montero, Mariano Ram'{\i
关键词: robotics, MROS, metacontrol, self-adaptation, self-adaptive systems

Abstract

Self-adaptation can be used in robotics to increase system robustness and reliability. This work describes the Metacontrol method for self-adaptation in robotics. Particularly, it details how the MROS (Metacontrol for ROS Systems) framework implements and packages Metacontrol, and it demonstrate how MROS can be applied in a navigation scenario where a mobile robot navigates in a factory floor. Video: https://www.youtube.com/watch?v=ISe9aMskJuE

DOI: 10.1109/ICSE-Companion58688.2023.00044

Task Context： A Tool for Predicting Code Context Models for Software Development Tasks

作者: Wang, Yifeng and Lin, Yuhang and Wan, Zhiyuan and Yang, Xiaohu
关键词: context prediction, interaction, task, context models

Abstract

A code context model consists of code elements and their relations relevant to a development task. Previous studies found that the explicit formation of code context models can benefit software development practices, e.g., code navigation and searching. However, little focus has been put on how to proactively form code context models. In this paper, we propose a tool named Task Context for predicting code context models and implement it as an Eclipse plug-in. Task Context uses the abstract topological patterns of how developers investigate structurally connected code elements when performing tasks. The tool captures the code elements navigated and searched by a developer to construct an initial code context model. The tool then applies abstract topological patterns with the initial code context model as input and recommends code elements up to 3 steps away in the code structure from the initial code context model. The experimental results indicate that our approach can predict code context models effectively, with a significantly higher F-measure than the state-of-the-art (0.57 over 0.23 on average). Furthermore, the user study suggests that our tool can help practitioners complete development tasks faster and more often as compared to standard Eclipse mechanism.Demo video: https://youtu.be/3yEPh6uvHI8Repository: https://github.com/icsoft-zju/Task_Context

DOI: 10.1109/ICSE-Companion58688.2023.00045

pytest-Inline： An Inline Testing Tool for Python

作者: Liu, Yu and Thurston, Zachary and Han, Alan and Nie, Pengyu and Gligoric, Milos and Legunsen, Owolabi
关键词: pytest, Python, software testing, inline tests

Abstract

We present pytest-inline, the first inline testing framework for Python. We recently proposed inline tests to make it easier to test individual program statements. But, there is no framework-level support for developers to write inline tests in Python. To fill this gap, we design and implement pytest-inline as a plugin for pytest, the most popular Python testing framework. Using pytest-inline, a developer can write an inline test by assigning test inputs to variables in a target statement and specifying the expected test output. Then, pytest-inline runs each inline test and fails if the target statement’s output does not match the expected output. In this paper, we describe our design of pytest-inline, the testing features that it provides, and the intended use cases. Our evaluation on inline tests that we wrote for 80 target statements from 31 open-source Python projects shows that using pytest-inline incurs negligible overhead, at 0.012x. pytest-inline is integrated into the pytest-dev organization, and a video demo is at https://www.youtube.com/watch?v=pZgiAxR_uJg.

DOI: 10.1109/ICSE-Companion58688.2023.00046

DaMAT： A Data-Driven Mutation Analysis Tool

作者: Vigan`{o
关键词: mutation analysis, CPS

Abstract

We present DaMAT, a tool that implements data-driven mutation analysis. In contrast to traditional code-driven mutation analysis tools it mutates (i.e., modifies) the data exchanged by components instead of the source of the software under test. Such an approach helps ensure that test suites appropriately exercise components interoperability — essential for safety-critical cyber-physical systems. A user-provided fault model drives the mutation process. We have successfully evaluated DaMAT on software controlling a microsatellite and a set of libraries used in deployed CubeSats. A demo video of DaMAT is available at https://youtu.be/s5M52xWCj84

DOI: 10.1109/ICSE-Companion58688.2023.00047

Burt： A Chatbot for Interactive Bug Reporting

作者: Song, Yang and Mahmud, Junayed and De Silva, Nadeeshan and Zhou, Ying and Chaparro, Oscar and Moran, Kevin and Marcus, Andrian and Poshyvanyk, Denys
关键词: No keywords

Abstract

This paper introduces Burt, a web-based chatbot for interactive reporting of Android app bugs. Burt is designed to assist Android app end-users in reporting high-quality defect information using an interactive interface. Burt guides the users in reporting essential bug report elements, i.e., the observed behavior, expected behavior, and the steps to reproduce the bug. It verifies the quality of the text written by the user and provides instant feedback. In addition, Burt provides graphical suggestions that the users can choose as alternatives to textual descriptions.We empirically evaluated Burt, asking end-users to report bugs from six Android apps. The reporters found that Burt’s guidance and automated suggestions and clarifications are useful and Burt is easy to use. Burt is an open-source tool, available at github.com/sea-lab-wm/burt/tree/tool-demo.A video showing the full capabilities of Burt can be found at https://youtu.be/SyfOXpHYGRo.

DOI: 10.1109/ICSE-Companion58688.2023.00048

Patchmatch： A Tool for Locating Patches of Open Source Project Vulnerabilities

作者: Shen, Kedi and Zhang, Yun and Bao, Lingfeng and Wan, Zhiyuan and Li, Zhuorong and Wu, Minghui
关键词: manage tool, model application, vulnerability

Abstract

With the rapid development of open source projects, the continuous emergence of vulnerabilities in the project brings great challenges to the security of the project. Security patches are one of the best ways to deal with vulnerabilities, but are not well applied currently. Although there are sites like CVE/NVD that provide information about vulnerabilities, many of the vulnerabilities disclosed by CVE/NVD are not accompanied by security patches. This makes it difficult for developers to apply patches. In the present study, a sorting method based on extracting multidimensional features from auxiliary information in CVE/NVD was proposed. And we made a further step, we proposed VCmatch, a model for mining semantic information in vulnerability description and code commit messages, which has good recall rate and applicability across projects. On this basis, we established Patchmatch, a tool for helping developers to quickly locate patches. Given a vulnerability, Patchmatch can forecast the implicit patches in the code repository’s commits. Patchmatch also has a visual webpage for information statistics and a display web page to help developers manage all kinds of information in the code repository. A demo video of Patch-match is at https://www.youtube.com/watch?v=nOBSMFtZV8A. Patchmatch is in https://github.com/Sklud1456/patchmatch.

DOI: 10.1109/ICSE-Companion58688.2023.00049

LicenseRec： Knowledge Based Open Source License Recommendation for OSS Projects

作者: Xu, Weiwei and Wu, Xin and He, Runzhi and Zhou, Minghui
关键词: open source license recommendation, open source license, open source software

Abstract

Open Source license is a prerequisite for open source software, which regulates the use, modification, redistribution, and attribution of the software. Open source license is crucial to the community development and commercial interests of an OSS project, yet choosing a proper license from hundreds of licenses remains challenging. Tools assisting developers to understand the terms and pick the right license have been emerging, while inferring license compatibility on the dependency tree and satisfying the complex needs of developers are beyond the capability of most of them. Thus we propose LicenseRec, an open source license recommendation tool that helps to bridge the gap. LicenseRec performs fine-grained license compatibility checks on OSS projects’ code and dependencies, and assists developers to choose the optimal license through an interactive wizard with guidelines of three aspects: personal open source style, business pattern, and community development. The usefulness of LicenseRec is confirmed by the consistent positive feedback from 10 software developers with academic and industrial backgrounds. Our tool is accessible at https://licenserec.com and a video showcasing the tool is available at https://video.licenserec.com.

DOI: 10.1109/ICSE-Companion58688.2023.00050

Detecting Scattered and Tangled Quality Concerns in Source Code to Aid Maintenance and Evolution Tasks

作者: Krasniqi, Rrezarta
关键词: software maintenance, scattered quality concerns, tangled quality concerns, quality bugs, quality concerns

Abstract

Quality concerns, such as reliability, security, usability concerns, among others, are typically well-defined and prioritized at the requirement level with the set goal of achieving high quality, robust, user-friendly, and trustworthy systems. However, quality concerns are challenging to address at the implementation level. Often they are scattered across multiple modules in the codebase. In other instances, they are tangled with functional ones within a single module. Reasoning about quality concerns and their interactions with functional ones while being hindered by the effects of scattered and tangled code can only yield to more unseen problems. For example, developers can inadvertently retrofit new bugs or wrongly implement new features that deviate from original system requirement specifications. The goal of this thesis is twofold. First, we aim to detect quality concerns implemented at code level to differentiate them from functional ones when they are scattered across the codebase. Second, we aim to untangle quality concerns from unrelated changes to gain a detailed knowledge about the history of specific quality changes. This knowledge is crucial to support consistency between the requirements-and-design and to verify architecture conformance. From the practical stance, developers could gain a breadth of understanding about quality concerns and their relations with other artifacts. Thus, with more confidence, they could perform code modifications, improve module traceability, and provide a better holistic assessment of change impact analysis.

DOI: 10.1109/ICSE-Companion58688.2023.00051

Complementing Secure Code Review with Automated Program Analysis

作者: Charoenwet, Wachiraphan
关键词: automated program analysis, code review assistant, secure code review, modern code review

Abstract

Code review is an important activity in software engineering process to reduce software defects before the production phase. It is crucial that software defects are identified as soon as they are introduced because their impact can be amplified if they are discovered in the later stages. However, previous studies have observed that vulnerability, a software weakness that could be exploited by an attacker, can slip through the code review process because of the limited resources and security awareness of reviewers. Approaches such as automated program analysis have been recommended to assist this problem. Yet, it is unclear about the capability of automated program analysis to augment human reviewers on the security aspects. This research project aims to investigate to what extent, and how, can different automated program analysis approaches complement human reviewers in the code review process.

DOI: 10.1109/ICSE-Companion58688.2023.00052

Automating Code Review

作者: Tufano, Rosalia
关键词: deep learning, code review

Abstract

Code reviews are popular in both industrial and open source projects. The benefits of code reviews are widely recognized and include better code quality and lower likelihood of introducing bugs. However, code review comes at the cost of spending developers’ time on reviewing their teammates’ code. The goal of this research is to investigate the possibility of using Deep Learning (DL) to automate specific code review tasks. We started by training vanilla Transformer models to learn code changes performed by developers during real code review activities. This gives the models the possibility to automatically (i) revise the code submitted for review without any input from the reviewer; and (ii) implement changes required to address a specific reviewer’s comment. While the preliminary results were encouraging, in this first work we tested DL models in rather simple code review scenarios, substantially simplifying the targeted problem. This was also due to the choices we made when designing both the technique and the experiments. Thus, in a subsequent work, we exploited a pre-trained Text-To-Text-Transfer-Transformer (T5) to overcome some of these limitations and experiment DL models for code review automation in more realistic and challenging scenarios. The achieved results show the improvements brought by T5 both in terms of applicability (i.e., scenarios in which it can be applied) and performance. Despite this, we are still far from performance levels making these techniques deployable in practice, thus calling for additional research in this area, as we discuss in our future work agenda.

DOI: 10.1109/ICSE-Companion58688.2023.00053

作者: Khedkar, Mugdha
关键词: GDPR compliance, data protection and privacy, static analysis

Abstract

Many Android applications collect data from users. When they do, they must protect this collected data according to the current legal frameworks. Such data protection has become even more important since the European Union rolled out the General Data Protection Regulation (GDPR). App developers have limited tool support to reason about data protection throughout their app development process. Although many Android applications state a privacy policy, privacy policy compliance checks are currently manual, expensive, and prone to error. One of the major challenges in privacy audits is the significant gap between legal privacy statements (in English text) and technical measures that Android apps use to protect their user’s privacy. In this thesis, we will explore to what extent we can use static analysis to answer important questions regarding data protection. Our main goal is to design a tool based approach that aids app developers and auditors in ensuring data protection in Android applications, based on automated static program analysis.

DOI: 10.1109/ICSE-Companion58688.2023.00054

Incident Prevention through Reliable Changes Deployment

作者: Kapel, Eileen
关键词: incident prevention, change risk, traceability, service management, change management, incident management

Abstract

Ensuring the reliability of changes deployment is essential to prevent incidents in businesses that strongly depend on software and services. Incidents should be avoided since they may lead to customer dissatisfaction, financial losses and reputational damage. Currently, the majority of outages are being caused by changes, so we believe there is a need for a higher focus on the risk management pre-change deployment. This paper presents a research plan that proposes a risk management AIOps framework utilising real-world change, CI/CD pipeline and incident data for incident prevention through reliable changes deployment. This research will explore 1) obtaining background information on the current state of practice of service management with a case study on a software-defined business; 2) a risk management AIOps framework that utilises the traces of change, incident and CI/CD pipeline code for predicting the risk of changes deployment; and 3) testing the generalisability of the framework for reducing the risk of change deployment.

DOI: 10.1109/ICSE-Companion58688.2023.00055

Addressing Performance Regressions in DevOps： Can We Escape from System Performance Testing?

作者: Liao, Lizhi
关键词: performance engineering, performance modeling, field testing, performance regression root cause, performance regression

Abstract

Performance regression is an important type of performance issue in software systems. It indicates that the performance of the same features in the new version of the system becomes worse than that of previous versions, such as increased response time or higher resource utilization. In order to prevent performance regressions, current practices often rely on conducting extensive system performance testing before releasing the system into production based on the testing results. However, faced with a great demand for resources and time to perform system performance testing, it is often challenging to adopt such approaches to the practice of fast-paced development and release cycles, e.g., DevOps. This thesis focuses on addressing software performance regressions in DevOps without relying on expensive system performance tests. More specifically, I first propose a series of approaches to helping developers detect performance regressions and locate their root causes by only utilizing the readily-available operational data when the software system is running in the field and used by real end users. I then leverage small-scale performance testing and architectural modeling to estimate the impact of source code changes on the end-to-end performance of the system in order to detect performance regressions early in the software development phase. Through various case studies on open-source projects and successful adoptions by our industrial research collaborator, we expect that our study will provide helpful insights for researchers and practitioners who are interested in addressing performance regressions in DevOps without expensive system performance testing.

DOI: 10.1109/ICSE-Companion58688.2023.00056

Toward More Effective Deep Learning-Based Automated Software Vulnerability Prediction, Classification, and Repair

作者: Fu, Michael
关键词: software security, software vulnerability, cybersecurity

Abstract

Software vulnerabilities are prevalent in software systems and the unresolved vulnerable code may cause system failures or serious data breaches. To enhance security and prevent potential cyberattacks on software systems, it is critical to (1) early detect vulnerable code, (2) identify its vulnerability type, and (3) suggest corresponding repairs. Recently, deep learning-based approaches have been proposed to predict those tasks based on source code. In particular, software vulnerability prediction (SVP) detects vulnerable source code; software vulnerability classification (SVC) identifies vulnerability types to explain detected vulnerable programs; neural machine translation (NMT)-based automated vulnerability repair (AVR) generates patches to repair detected vulnerable programs. However, existing SVPs require much effort to inspect their coarse-grained predictions; SVCs encounter an unresolved data imbalance issue; AVRs are still inaccurate. I hypothesize that by addressing the limitations of existing SVPs, SVCs and AVRs, we can improve the accuracy and effectiveness of DL-based approaches for the aforementioned three prediction tasks. To test this hypothesis, I will propose (1) a finer-grained SVP approach that can point out vulnerabilities at the line level; (2) an SVC approach that mitigates the data imbalance issue; (3) NMT-based AVR approaches to address limitations of previous NMT-based approaches. Finally, I propose integrating these novel approaches into an open-source software security framework to promote the adoption of the DL-powered security tool in the industry.

DOI: 10.1109/ICSE-Companion58688.2023.00057

Enhancing Deep Reinforcement Learning with Executable Specifications

作者: Yerushalmi, Raz
关键词: domain expertise, rule-based specifications, scenario-based modeling, deep reinforcement learning, machine learning

Abstract

Deep reinforcement learning (DRL) has become a dominant paradigm for using deep learning to carry out tasks where complex policies are learned for reactive systems. However, these policies are “black-boxes”, e.g., opaque to humans and known to be susceptible to bugs. For example, it is hard — if not impossible — to guarantee that the trained DRL agent adheres to specific safety and fairness properties that may be required. This doctoral dissertation’s first and primary contribution is a novel approach to developing DRL agents, which will improve the DRL training process by pushing the learned policy toward high performance on its main task and compliance with such safety and fairness properties, guaranteeing a high probability of compliance while not compromising the performance of the resulting agent. The approach is realized by incorporating domain-specific knowledge captured as key properties defined by domain experts directly into the DRL optimization process while leveraging behavioral languages that are natural to the domain experts. We have validated the proposed approach by extending the AI-Gym Python framework [1] for training DRL agents and integrating it with the BP-Py framework [2] for specifying scenario-based models [3] in a way that allows scenario objects to affect the training process through reward and cost functions, demonstrating dramatic improvement in the safety and performance of the agent. In addition, we have validated the resulting DRL agents using the Marabou verifier [4], confirming that the resulting agents indeed comply (in full) with the required safety and fairness properties. We have applied the approach, training DRL agents for use cases from network communication and robotic navigation domains, exhibiting strong results. A second contribution of this doctoral dissertation is to develop and leverage probabilistic verification methods for deep neural networks to overcome the current scalability limitations of neural network verification technology, limiting the applicability of verification to practical DRL agents. We carried out an initial validation of the concept in the domain of image classification, showing promising results.

DOI: 10.1109/ICSE-Companion58688.2023.00058

Boosting Symbolic Execution for Heap-Based Vulnerability Detection and Exploit Generation

作者: Tu, Haoxin
关键词: symbolic execution, automatic exploit generation, vulnerability detection, software reliability, software security

Abstract

Heap-based vulnerabilities such as buffer overflow and use after free are severe flaws in various software systems. Detecting heap-based vulnerabilities and demonstrating their severity via generating exploits for them are of critical importance. Existing symbolic execution-based approaches have shown their potential in the above tasks. However, they still have some fundamental limitations in path exploration, memory modeling, and environment modeling, which significantly impede existing symbolic execution engines from efficiently and effectively detecting and exploiting heap-based vulnerabilities. The objective of this thesis is to design and implement a boosted symbolic execution engine named HeapX to facilitate the automatic detection and exploitation of heap-based vulnerabilities. Specifically, a new path exploration strategy, a new memory model, and a new environment modeling solution are expected to be designed in HeapX, so that the new boosted symbolic execution engine can detect heap-based vulnerabilities and generate working exploits for them more efficiently and effectively.

DOI: 10.1109/ICSE-Companion58688.2023.00059

Automating Code Generation for MDE Using Machine Learning

作者: Xue, Qiaomu
关键词: symbolic machine learning, model-driven engineering, model transformation by example, code generation

Abstract

The overall aim of our research is to improve the techniques for synthesizing code generators in the Model-Driven Engineering (MDE) context. Code generation is one of the main elements of Model-Driven Engineering, involving transformation from specification models to produce executable code. A code generator is designed to reduce the manual program construction work used to implement a software system, but building a code generator itself still currently needs much manual effort. Meanwhile, existing code generators are typically not flexible to adjust for changing development requirements and are hard to reuse for different target languages.Therefore, we aim to provide techniques to improve the process of building code generators, and let them be more reusable.Currently, we researched the related new and traditional approaches for generating code and projects using AI for program translation, code completion or program generation. Based on this research we decided to focus on a symbolic machine learning method related to the programming-by-example concept to build code generators. We use this “Code Generation By Example” (CGBE) concept with tree-to-tree structure mappings as the information format. CGBE has good performance in terms of training dataset size and time when applied to learning a UML-to-Java code generator, but further work is needed to extend it to generate different programming languages and to evaluate these cases, and to handle the optimisation of generated code.

DOI: 10.1109/ICSE-Companion58688.2023.00060

Towards Automated Embedded Systems Programming

作者: Yusuf, Imam Nur Bani
关键词: hardware configuration, embedded system, library recommendation, code generation

Abstract

Writing code for embedded systems poses unique challenges due to hardware involvement. Developers often need to learn domain-specific knowledge to write embedded codes. Learning such knowledge is time-consuming and hinders developers’ productivity. This paper presents a proposal for an automated code generation approach, specifically designed for embedded systems. The work is composed of three milestones, i.e., understanding the needs of embedded developers by analyzing posts from discussion forums, developing a tool to recommend driver libraries of I/O hardware and generate its interface configurations and usage patterns, and improving the generation accuracy of the prior tool using program analysis techniques. The tool will be evaluated using various metrics from machine translation, classification, and information retrieval fields.

DOI: 10.1109/ICSE-Companion58688.2023.00061

Assessing Cognitive Load in Software Development with Wearable Sensors

作者: Stolp, Fabian
关键词: software metrics, wearable sensors, cognitive load, code understandability, program comprehension

Abstract

The understandability of source code influences software quality, and being able to measure it could greatly benefit software development and maintenance. There is an ongoing debate about the validity of using software metrics for this purpose. In this context, software developers’ cognitive load during code comprehension is increasingly often investigated. The concept of cognitive load provides information about the usage of mental resources. Previous research has shown that cognitive load is derivable from physiological measurements. This paper proposes using wearable body sensors that can be easily deployed in software development settings and for empirical software engineering research to provide a cognitive perspective on code understandability and software metrics.

DOI: 10.1109/ICSE-Companion58688.2023.00062

Towards Strengthening Software Library Interfaces with Granular and Interactive Type Migrations

作者: Szalay, Rich'{a
关键词: library interface, strong typing, type safety, static analysis, software refactoring, C++ programming language

Abstract

The interface boundaries of software projects are a crucial perimeter from both a design and security point of view. Design decisions of libraries will inadvertently affect client code which can neither legally nor technically change the library’s contract. Mistakes allowed by the interface design, such as argument selection defects, can only be caught with existing tools once made. This is made worse in C++, as functions may take parameters which types are implicitly convertible to one another. Instead, I proposed a proactive step to detect and improve when a library interface exhibits properties that may lead to inadvertent misuse. Actionable fixes for the reports are possible through an interactive type refactoring method. Existing refactorings for type migration required the new types and the migration process to be well-defined in advance; otherwise, the code might be changed to a non-compilable state. This paper summarises my thesis proposal in which the problem of weak function interfaces is detected and solved by the Fictive Types method, which first “colours” the software project with new type information, deferring the complete rewrite once the required interface of the new type is fully explored.

DOI: 10.1109/ICSE-Companion58688.2023.00063

Designing Adaptive Developer-Chatbot Interactions： Context Integration, Experimental Studies, and Levels of Automation

作者: Melo, Glaucia
关键词: interactions, autonomous systems, levels of automation, chatbot, context, software engineering

Abstract

The growing demand for software developers and the increasing development complexity have emphasized the need for support in software engineering projects. This is especially relevant in light of advancements in artificial intelligence, such as conversational systems. A significant contributor to the complexity of software development is the multitude of tools and methods used, creating various contexts in which software developers must operate. Moreover, there has been limited investigation into the interaction between context-based chatbots and software developers through experimental user studies. Assisting software developers in their work becomes essential. In particular, understanding the context surrounding software development and integrating this context into chatbots can lead to novel insight into what software developers expect concerning these human-chatbot interactions and their levels of automation. In my research, I study the design of context-based adaptive interactions between software developers and chatbots to foster solutions and knowledge to support software developers at work.

DOI: 10.1109/ICSE-Companion58688.2023.00064

Towards Machine Learning Guided by Best Practices

作者: Mojica-Hanke, Anamaria
关键词: software engineering, good practices, machine learning

Abstract

Nowadays, machine learning (ML) is being used in software systems with multiple application fields, from medicine to software engineering (SE). On the one hand, the popularity of ML in the industry can be seen in the statistics showing its growth and adoption. On the other hand, its popularity can also be seen in research, particularly in SE, where multiple studies related to the use of Machine Learning in Software Engineering have been published in conferences and journals. At the same time, researchers and practitioners have shown that machine learning has some particular challenges and pitfalls. In particular, research has shown that ML-enabled systems have a different development process than traditional software, which also describes some of the challenges of ML applications. In order to mitigate some of the identified challenges and pitfalls, white and gray literature has proposed a set of recommendations based on their own experiences and focused on their domain (e.g., biomechanics), but for the best of our knowledge, there is no guideline focused on the SE community. This thesis aims to reduce the gap of not having clear guidelines in the SE community by using possible sources of practices such as question-and-answer communities and also previous research studies. As a result, we will present a set of practices with an SE perspective, for researchers and practitioners, including a tool for searching them.

DOI: 10.1109/ICSE-Companion58688.2023.00065

Learning Program Models from Generated Inputs

作者: Mammadov, Tural
关键词: deep learning, reverse engineering, security testing, software testing

Abstract

Recent advances in Machine Learning (ML) show that Neural Machine Translation (NMT) models can mock the program behavior when trained on input-output pairs. Such models can mock the functionality of existing programs and serve as quick-to-deploy reverse engineering tools. Still, the problem of automatically learning such predictive and reversible models from programs needs to be solved. This work introduces a generic approach for automated and reversible program behavior modeling. It achieves 94% of overall accuracy in the conversion of Markdown-to-HTML and HTML-to-Markdown markups.

DOI: 10.1109/ICSE-Companion58688.2023.00066

Learning Test Input Constraints from Branch Conditions

作者: Bettscheider, Leon
关键词: symbolic execution, context-free grammars, dynamic program analysis, fuzzing, software testing

Abstract

Precise input specifications are the holy grail of blackbox test generation. In order to test programs that process structured inputs effectively, inputs should match the expected input format. Otherwise, they are likely to be rejected during initial input validation, and fail to reach the main application logic. While the structure and constraints of widely used data formats such as XML are known, the input constraints imposed by application logic are vast, unstructured, and encoded in branch conditions. Hence, they are rarely specified manually, leaving large parts of the program unexplored by blackbox techniques. We propose to address this issue by dynamically externalizing local constraints and exposing them to system-level test generators. These could combine such constraints with an existing input specification in order to find global solutions. This could provide a means to explore application logic systematically.

DOI: 10.1109/ICSE-Companion58688.2023.00067

Software Supply Chain Risk Assessment Framework

作者: Zahan, Nusrat
关键词: risk assessment framework, weak link signal, security metrics, software supply chain security

Abstract

Sonatype has recorded an average 700% jump in software supply chain attacks [1], measured by the number of newly-published malicious packages in open-source repositories. The 2022 Synopsys report [2] assessed the reliance of the software industry on open-source software (OSS), and estimated that 97% of applications use OSS and 78% of the code comes from OSS. Practitioners did not anticipate how the software supply chain would become a deliberate attack vector and how the risk of the software supply chain would keep growing. Practitioners are more aware of the supply chain risks and want to know how to detect the implementation of package security practices and the security risk so they can make informed decisions to select dependencies for their projects. The goal of this research is to aid practitioners in producing more secure software products that are resistant to supply chain attacks through the identification and evaluation of actionable security metrics to detect risky components in the dependency graph. To achieve this goal, the thesis presents research on software security metrics evaluation in different ecosystems by leveraging software security frameworks, malicious attack vectors, and the OpenSSF Scorecard project to detect the implementation of secure practices and their significance to security outcomes.

DOI: 10.1109/ICSE-Companion58688.2023.00068

A Framework to Communicate Software Engineering Data Effectively with Dashboards

作者: Milani, Alessandra Maciel Paz
关键词: data storytelling, software productivity data, dashboards, information visualization, software engineering

Abstract

Different approaches have been explored to capture Software Engineering (SE) data and to understand which indicators or metrics are essential to be observed in this process. However, the presentation of this data using Information Visualization (InfoVis) systems, such as Dashboards, must be carried out more effectively. Dashboard users often face challenges interpreting the essence of the presented information. Moreover, keeping this audience engaged and leading them to act is still an open topic for investigation. Hence, my research investigates how SE data can be communicated to inform and inspire meaningful actions in a software development organization. The expected contributions are threefold: (1) an overview of the current state of how SE data is communicated across the industry; (2) an exploration of InfoVis approaches combined with a set of practices that can be extended for different applications; and (3) a theoretical framework to guide how SE data can be effectively communicated using Dashboards.

DOI: 10.1109/ICSE-Companion58688.2023.00069

Some Investigations of Machine Learning Models for Software Defects

作者: Bhutamapuram, Umamaheswara Sharma
关键词: feasibility study, performance measures, software reliability, software defect severity prediction, cross-project defect prediction, software defect prediction

Abstract

Software defect prediction (SDP) and software defect severity prediction (SDSP) models alleviate the burden on the testers by providing the automatic assessment of a newly-developed program in a short amount of time. The research on defect prediction or defect severity prediction is primarily focused on proposing classification frameworks or addressing challenges in developing prediction models; however, the primary yet significant gap in the literature is interpreting the predictions in terms of project objectives. Furthermore, the literature indicates that these models have poor predictive performance. In this thesis, we investigate the use of a diversity-based ensemble learning mechanism for the cross-project defect prediction (CPDP) task and self-training semi-supervised learning for the software defect severity prediction, respectively, for obtaining better prediction performances. We also propose a few project-specific performance measures to interpret the predictions in terms of project objectives (such as a reduction in expenditure, time, and failure chances). Through the empirical analysis, we observe that (1) the diversity-based ensemble learning mechanism improves the prediction performance in terms of both the traditional and proposed measures, and (2) the self-training semi-supervised learning model has a positive impact on predicting the severity of a defective module.Once a potential prediction model is developed, any software organisation may utilise its services. How can an organisation showcase their trust in the developed prediction model? To this end, we investigate the feasibility of SDP models in real-world testing environments by providing proofs using the probabilistic bounds. The proofs summarised show that even if the prediction model has a lower failure probability, the probability of obtaining fewer failures in SDP-tested software than in similar but manually tested software is still exponentially small. This result enables the researchers in SDP to avoid proposing prediction models.

DOI: 10.1109/ICSE-Companion58688.2023.00070

Evolutionary Computation and Reinforcement Learning for Cyber-Physical System Design

作者: Lu, Chengjie
关键词: uncertainty, reinforcement learning, evolutionary computation, cyber-physical system

Abstract

Cyber-physical systems (CPSs) are designed to integrate computation and physical processes through constantly interacting with the physical environment. The complexity and uncertainty of the environment often come up with unpredictable situations, which place high demands on the dynamic adaptability of CPSs. Further, as the environment evolves, the CPS needs to constantly evolve itself to adapt to the changing environment. This paper presents a research plan that aims to develop a novel framework to address CPS design challenges under uncertain environments. We propose to utilize evolutionary computation and reinforcement learning techniques to design control policies that can adapt to the dynamic changes and uncertainties of the environment. Further, novel testing and evaluation approaches that can generate test cases while adapting to dynamic changes in the system and the environment will be explored.

DOI: 10.1109/ICSE-Companion58688.2023.00071

作者: Zhu, Fengmin
关键词: constraint solving, subtyping, type checking, context-free grammars, refinement types

Abstract

Programmers use strings to represent variates of data that contain internal structure or syntax. However, existing mainstream programming languages do not provide users with means to further narrow down the set of valid values for a string. An invalid string input may cause runtime errors or even severe security vulnerabilities. To address that, this paper presents a Ph.D. research proposal on the type checking of grammar-based string refinement types, a kind of fine-grained types for specifying the set of valid string values via grammar. The string refinement type system uses subtyping to capture the inclusion relation between the languages of grammars. Based on that, we follow a well-known bidirectional type checking framework to combine the checking and inference of string refinement types into one. Evaluations on real-world codebases will be conducted to measure the practicality of this approach.

DOI: 10.1109/ICSE-Companion58688.2023.00072

Evaluation of Stakeholder Mapping and Personas for Sustainable Software Development

作者: Ayoola, Bimpe
关键词: software development, personas, stakeholder map, stakeholder, social sustainability, sustainability

Abstract

Sustainable software development is a major challenge in the software engineering industry. Software practitioners lack practical guidance or tools for integrating social sustainability in software development processes. This study proposes stakeholder mapping and the use of sustainability personas as a framework to guide software practitioners in making decisions that support socially sustainable software development. We will evaluate the effectiveness in a randomized controlled experiment with 104 final-year undergraduate computer science students who would select features to be included in the development of a software application. We aim to show how these interventions helps to improve software practitioners’ perspective of social sustainability in software development.

DOI: 10.1109/ICSE-Companion58688.2023.00073

Improving Automatic C-to-Rust Translation with Static Analysis

作者: Hong, Jaemin
关键词: No keywords

Abstract

While popular in system programming, C has been infamous for its poor language-level safety mechanisms, leading to critical bugs and vulnerabilities. C programs can still have memory and thread bugs despite passing type checking. To resolve this long-standing problem, Rust has been recently developed with rich safety mechanisms, including its notable ownership type system. It prevents memory and thread bugs via type checking. By rewriting legacy C programs in Rust, their developers can discover unknown bugs and avoid adding new bugs. However, the adaptation of Rust in legacy programs is still limited due to the high cost of manual C-to-Rust translation. Rust’s safe features are semantically different from C’s unsafe features and require programmers to precisely understand the behavior of their programs for correct rewriting. Existing C-to-Rust translators do not relieve this burden because they syntactically translate C features into unsafe Rust features, leaving further refactoring for programmers. In this paper, we propose the problem of improving the state-of-the-art C-to-Rust translation by automatically replacing unsafe features with safe features. Specifically, we identify two important unsafe features to be replaced: lock API and output parameters. We show our results on lock API and discuss plans for output parameters.

DOI: 10.1109/ICSE-Companion58688.2023.00074

Towards an AI-Centric Requirements Engineering Framework for Trustworthy AI

作者: Ronanki, Krishna
关键词: guidelines, ethical AI, AI co-worker, frameworks, requirements engineering, EU AI act, trustworthy AI

Abstract

Ethical guidelines are an asset for artificial intelligence(AI) development and conforming to them will soon be a procedural requirement once the EU AI Act gets ratified in the European parliament. However, developers often lack explicit knowledge on how to apply these guidelines during the system development process. A literature review of different ethical guidelines from various countries and organizations has revealed inconsistencies in the principles presented and the terminology used to describe such principles. This research begins by identifying the limitations of existing ethical AI development frameworks in performing requirements engineering(RE) processes during the development of trustworthy AI. Recommendations to address those limitations will be proposed to make the frameworks more applicable in the RE process to foster the development of trustworthy AI. This could lead to wider adoption, greater productivity of the AI systems, and reduced workload on humans for non-cognitive tasks. Considering the impact of some of the newer foundation models like GitHub Copilot and ChatGPT, the vision for this research project is to work towards the development of holistic operationalisable RE guidelines for the development and implementation of trustworthy AI not only on a product level but also on process level.

DOI: 10.1109/ICSE-Companion58688.2023.00075

Cost-Effective Strategies for Building Energy Efficient Mobile Applications

作者: Bangash, Abdul Ali
关键词: mining software repositories, energy estimation, static analysis, mobile application

Abstract

Smartphone users rely on applications to perform various functionalities through their phones, but these functionalities may cause a significant drain on the device’s battery. To ensure that an app does not consume unnecessary energy, app developers measure and optimize the energy consumption of their apps before releasing them to the end users. However, current optimization and measurement techniques have several limitations. The energy optimization techniques only focus on refactoring energy-greedy patterns related to system events, such as garbage collection and process switching, and on providing recommendation models for API usage. Despite the fact that the energy consumption of a single API can vary depending on its configuration, and API events account for 85% of energy consumption in smartphone apps, existing optimization techniques do not provide guidance on how to configure APIs for energy-efficient usage. Moreover, energy measurement techniques are cumbersome because they require developers to generate test cases and execute them on expensive, sophisticated hardware. My thesis argues that we can develop a general methodology that researchers may follow to extract energy-efficient guidelines pertaining to an API, and developers may use such guidelines to develop energy-efficient apps. Additionally, it argues that we can use static analysis to estimate an app’s energy consumption. Such methodology will elevate the need for a physical smartphone and test case generation and execution. The insights and techniques that my thesis presents are particularly useful within the context of an Integrated Development Environment (IDE) or a Continuous Integration/Continuous Deployment (CI/CD) pipeline, where developers require results within a matter of milliseconds. Using our technique, developers would quickly receive warnings about high energy consumption caused by their code modifications, specifically those related to API usage.

DOI: 10.1109/ICSE-Companion58688.2023.00076

Towards Utilizing Natural Language Processing Techniques to Assist in Software Engineering Tasks

作者: Ding, Zishuo
关键词: No keywords

Abstract

Machine learning-based approaches have been widely used to address natural language processing (NLP) problems. Considering the similarity between natural language text and source code, researchers have been working on applying techniques from NLP to deal with code. On the other hand, source code and natural language are by nature different. For example, code is highly structured and executable. Thus, directly applying the NLP techniques may not be optimal, and how to effectively optimize these NLP techniques to adapt to software engineering (SE) tasks remains a challenge. Therefore, to tackle the challenge, in this dissertation, we focus on two research directions: 1) distributed code representations, and 2) logging statements, which are two important intersections between the natural language and source code. For distributed code representations, we first discuss the limitations of existing code embedding techniques, and then, we propose a novel approach to learn more generalizable code embeddings in a task-agnostic manner. For logging statements, we first propose an automated deep learning-based approach to automatically generate accurate logging texts by translating the related source code into short textual descriptions. Then, we make the first attempt to comprehensively study the temporal relations between logging and its corresponding source code, which is later used to detect issues in logging statements. We anticipate that our study can provide useful suggestions and support to developers in utilizing NLP techniques to assist in SE tasks.

DOI: 10.1109/ICSE-Companion58688.2023.00077

Graph Solver as a Service

作者: Ahmad, Fozail
关键词: software as a service, partial graph models, consistent graph generation, graph solver

Abstract

Graphs can be a key abstraction for formal verification challenges. As such, graph solvers are essential tools for synthesizing scalable domain-specific consistent graph models, which are both realistic and diverse. The main goal of this doctoral research plan is to develop a graph solver framework based on a state-of-the-art graph solver in order to provide a graph solver as a service. We expect this will improve the overall scalability of graph solvers whilst increasing the usage and adoption of such tools. The scalability of the framework will be investigated in several case studies of different complexity.

DOI: 10.1109/ICSE-Companion58688.2023.00078

Toward Automated Tools to Support Ethical GUI Design

作者: Mansur, S M Hasan
关键词: No keywords

Abstract

DOI: 10.1109/ICSE-Companion58688.2023.00079

Domain Specific Languages for Optimisation Modelling

作者: Wijesundara, Sameela Suharshani
关键词: on-boarding domain experts, problem modelling, language workbenches, combinatorial optimisation, domain specific languages, framework

Abstract

Despite significant advances in computational approaches for modelling and solving combinatorial optimisation problems, there are still considerable barriers that prevent domain experts (domain-users/end-users) from adopting these technologies in their decision support systems [1]. We see the lack of involvement of domain experts in the model defining process, which seriously compromises the models’ flexibility and transparency, as a significant contributor to the barriers between modern optimisation technology and its usage.This research proposal aims to reduce the above barriers by introducing Domain Specific Languages (DSLs) that can easily be used by domains experts for modelling optimisation problems (which we refer to as MDSLs). In particular, our aim is to develop a framework to design these MDSLs, and to explore code reuse techniques that can simplify the modelling of problems within the same domain and their associated MDSLs. Further, we will develop a proof-of-concept language workbench specialised for MDSL development.

DOI: 10.1109/ICSE-Companion58688.2023.00080

From Input to Failure： Explaining Program Behavior via Cause-Effect Chains

作者: Smytzek, Marius
关键词: diagnostics, debugging aids, testing and debugging, software engineering, software/software engineering

Abstract

Debugging a fault in a program is an error-prone and resource-intensive process that requires considerable work. My doctoral research aims at supporting developers during this process by integrating test generation as a feedback loop into a novel fault diagnosis to narrow down the causality by validating or disproving suggested hypotheses. I will combine input, output, and state to detect relevant relations for an immersive fault diagnosis. Further, I want to introduce an approach for a targeted test that leverages statistical fault localization to extract oracles based on execution features to identify failing tests.

DOI: 10.1109/ICSE-Companion58688.2023.00081

The Distribution and Disengagement of Women Contributors in Open-Source： 2008–2021

作者: Zhao, Zihe H
关键词: No keywords

Abstract

The underrepresentation of women contributors in the open-source software (OSS) community has been a widely recognized problem. Past research has found that, in OSS collaboration, a gender-diverse team can enhance productivity and lower community smell [1]–[3]. However, these benefits will be hindered when a team lacks gender diversity. To better address this gender imbalance, we need to first understand the overall gender representation.

DOI: 10.1109/ICSE-Companion58688.2023.00082

Path Complexity of Recursive Functions

作者: Pregerson, Eli
关键词: No keywords

Abstract

Path coverage is of critical importance in software testing and verification, and further, path explosion is a well-known challenge for automatic software analysis techniques like symbolic execution [7]. Asymptotic Path Complexity (APC), a code complexity metric developed in my research lab, formalizes the quantitative measurement of path explosion.

DOI: 10.1109/ICSE-Companion58688.2023.00083

Skill Recommendation for New Contributors in Open-Source Software

作者: Santos, Fabio
关键词: ontology matching, machine learning, open-source software, social network analysis, mining software repositories, skills, labelling

Abstract

Selecting an appropriate task is challenging for newcomers to Open Source Software (OSS) projects. Therefore, researchers and OSS projects have proposed strategies to label tasks (a.k.a. issues). Several approaches relying on machine learning techniques, historical information, and textual analysis have been submitted. However, the results vary, and these approaches are still far from mainstream adoption, possibly because of a lack of good predictors. Inspired by previous research, we advocate that the prediction models might benefit from leveraging social metrics.In this research, we investigate how to assist the new contributors in finding a task when onboarding a new project. To achieve our goal, we predict the skills needed to solve an open issue by labeling them with the categories of APIs declared in the source code (API-domain labels) that should be updated or implemented. Starting from a case study using one project and an empirical experiment, we found the API-domain labels were relevant to select an issue for a contribution. In the sequence, we investigated employing interviews and a survey of what strategies maintainers the strategies believe the communities have to adopt to assist the new contributors in finding a task. We also studied how maintainers think about new contributors’ strategies to pick a task. We found maintainers, frequent contributors, and new contributors diverge about the importance of the communities and new contributors’ strategies.The ongoing research works in three directions: 1) generalization of the approach, 2) Use of conversation data metrics for predictions, 3) Demonstration of the approach, and 4) Matching contributors and tasks skills.By addressing the lack of knowledge about the skills in tasks, we hope to assist new contributors in picking tasks with more confidence.

DOI: 10.1109/ICSE-Companion58688.2023.00084

AIGROW： A Feedback-Driven Test Generation Framework for Hardware Model Checkers

作者: Deng, Wenjing
关键词: hardware model checker, test generation

Abstract

This research abstract introduces an effective and efficient approach to automatically generate high-quality hardware model checker benchmarks. The key contribution of this work is to model the input format of hardware model checkers using a tree-based structure named ARTree and build an effective feedback-driven test generation framework based on ARTree named AIGROW. The evaluation shows that AIGROW generates very small but high-quality benchmarks for coverage-oriented and performance-oriented testing and outperforms the existing generation-based testing tools.

DOI: 10.1109/ICSE-Companion58688.2023.00085

Test Scenario Generation for Autonomous Driving Systems with Reinforcement Learning

作者: Lu, Chengjie
关键词: reinforcement learning, critical scenario, autonomous driving system testing

Abstract

We have seen rapid development of autonomous driving systems (ADSs) in recent years. These systems place high requirements on safety and reliability for their mass adoption, and ADS testing is one of the crucial approaches to ensure the success of ADSs. To this end, this paper presents RLTester, a novel ADS testing approach, which adopts reinforcement learning (RL) to learn critical environment configurations (i.e., test scenarios) of the operating environment of ADSs that could reveal their unsafe behaviors. To generate diverse and critical test scenarios, we defined 142 environment configuration actions, and adopted the Time-To-Collision metric to construct the reward function. Our evaluation shows that RLTester discovered a total of 256 collisions, of which 192 are unique collisions, and took on average 11.59 seconds for each collision. Further, RLTester is effective in generating more diverse test scenarios compared to a state-of-the art approach, DeepCollision.

DOI: 10.1109/ICSE-Companion58688.2023.00086

GLAD： Neural Predicate Synthesis to Repair Omission Faults

作者: Kang, Sungmin and Yoo, Shin
关键词: debugging, machine learning, program repair

Abstract

Existing template and learning-based Automated Program Repair (APR) tools have successfully found patches for many benchmark faults. However, our analysis of existing results shows that omission faults pose a significant challenge. For template based approaches, omission faults provide no location to apply templates to; for learning based approaches that formulate repair as Neural Machine Translation (NMT), omission faults similarly do not provide faulty code to translate. To address these issues, we propose GLAD, a novel learning-based repair technique that targets if-clause synthesis. GLAD does not require a concrete faulty line as it is based on generative Language Models (LMs) instead of machine translation; consequently, it can repair omission faults. To provide the LM with project-specific information critical to synthesis, we incorporate two components: a type-based grammar that constrains the model, and a dynamic ranking system that evaluates candidate patches using a debugger. Our evaluation shows GLAD is highly orthogonal to existing techniques, correctly fixing 26 Defects4J v1.2 faults that previous NMT-based techniques could not, while maintaining a small runtime cost, underscoring its potential as a lightweight tool to complement existing tools in practice. An inspection of the bugs that GLAD fixes reveals that GLAD can quickly generate expressions that would be challenging for other techniques.

DOI: 10.1109/ICSE-Companion58688.2023.00087

Poster： Distribution-Aware Fairness Test Generation

作者: Rajan, Sai Sathiesh and Soremekun, Ezekiel and Chattopadhyay, Sudipta and Traon, Yves Le
关键词: No keywords

Abstract

This work addresses how to validate group fairness in image recognition software. We propose a distribution-aware fairness testing approach (called DistroFair) that systematically exposes class-level fairness violations in image classifiers via a synergistic combination of out-of-distribution (OOD) testing and semantic-preserving image mutation. DistroFair automatically learns the distribution (e.g., number/orientation) of objects in a set of images and systematically mutates objects in the images to become OOD using three semantic-preserving image mutations - object deletion, object insertion and object rotation. We evaluate DistroFair with two well-known datasets (CityScapes and MS-COCO) and three commercial image recognition software (namely, Amazon Rekognition, Google Cloud Vision and Azure Computer Vision) and find that at least 21% of images generated by DistroFair result in class-level fairness violations. DistroFair is up to 2.3x more effective than the baseline (generation of images within the observed distribution). Finally, we evaluated the semantic validity of our approach via a user study with 81 participants, using 30 real images and 30 corresponding mutated images generated by DistroFair and found that the generated images are 80% as realistic as the original images.

DOI: 10.1109/ICSE-Companion58688.2023.00088

Don’t Complete It! Preventing Unhelpful Code Completion for Productive and Sustainable Neural Code Completion Systems

作者: Sun, Zhensu and Du, Xiaoning and Song, Fu and Wang, Shangwen and Ni, Mingze and Li, Li
关键词: No keywords

Abstract

Currently, large pre-trained language models are widely applied in neural code completion systems. Though large code models significantly outperform their smaller counterparts, around 70% of displayed code completions from Copilot are not accepted by developers. Being reviewed but not accepted, their help to developer productivity is considerably limited. Even worse, considering the high cost of the large code models, it is a huge waste of computing resources and energy. To fill this significant gap, we propose an early-rejection mechanism to turn down low-return prompts by foretelling the code completion qualities without sending them to the code completion system. Furthermore, we propose a lightweight Transformer-based estimator to demonstrate the feasibility of the mechanism. The experimental results show that the proposed estimator helps save 23.3% of computational cost measured in floating-point operations for the code completion systems, and 80.2% of rejected prompts lead to unhelpful completion.

DOI: 10.1109/ICSE-Companion58688.2023.00089

Closing the Loop for Software Remodularisation - REARRANGE： An Effort Estimation Approach for Software Clustering-Based Remodularisation

作者: Tan, Alvin Jian Jia and Chong, Chun Yong and Aleti, Aldeida
关键词: refactoring, software clustering, software remodularisation, effort estimation

Abstract

Software remodularisation through clustering is a common practice to improve internal software quality. However, the true benefit of software clustering is only realized if developers follow through with the recommended refactoring suggestions, which can be complex and time-consuming. Simply producing clustering results is not enough to realize the benefits of remodularisation. For the recommended refactoring operations to have an impact, developers must follow through with them. However, this is often a difficult task due to certain refactoring operations’ complexity and time-consuming nature.

DOI: 10.1109/ICSE-Companion58688.2023.00090

Revisiting Information Retrieval and Deep Learning Approaches for Code Summarization

作者: Zhu, Tingwei and Li, Zhong and Pan, Minxue and Shi, Chaoxuan and Zhang, Tian and Pe, Yu and Li, Xuandong
关键词: No keywords

Abstract

Code summarization refers to the procedure of creating short descriptions that outline the semantics of source code snippets. Existing code summarization approaches can be broadly classified into Information Retrieval (IR)-based and Deep Learning (DL)-based approaches. However, their effectiveness, and especially their strengths and weaknesses, remain largely understudied. Existing evaluations use different benchmarks and metrics, making performance comparisons of these approaches susceptible to bias and potentially yielding misleading results. For example, the DL-based approaches typically show better code summarization performance in their original papers [1], [2]. However, Gros et al. [3] report that a naive IR approach could achieve comparable (or even better) performance to the DL-based ones. In addition, some recent work [4], [5] suggests that incorporating IR techniques can improve the DL-based approaches. To further advance code summarization techniques, it is critical that we have a good understanding of how IR-based and DL-based approaches perform on different datasets and in terms of different metrics.Prior works have studied some aspects of code summarization, such as the factors affecting performance evaluation [6] and the importance of data preprocessing [7], etc. In this paper, we focus on the study of the IR-based and DL-based code summarization approaches to enhance the understanding and design of more advanced techniques. We first compare the IR-based and DL-based approaches under the same experimental settings and benchmarks, then study their strengths and limitations through quantitative and qualitative analyses. Finally, we propose a simpler but effective strategy to combine IR and DL to further improve code summarization.Four IR-based approaches and two DL-based approaches are investigated with regard to representativeness and diversity. For IR-based approaches, we select three BM25-based approaches (i.e., BM25-spl, BM25-ast, BM25-alpha) and one nearest neighbor-based approach NNGen [8], which are often compared as baselines in prior works. They retrieve the most similar code from the database and directly output the corresponding summary. BM25-based approaches are implemented by Lucene [9]. Taking code forms as input, BM25-spl splits the CamelCase and snake_case in original source code tokens, BM25-ast obtains sequence representations using pre-order Abstract Syntax Tree (AST) traversal, and BM25-alpha keeps only the alpha tokens in the code. For DL-based approaches, we choose the state-of-the-art pre-trained model PLBART [2] and the trained-from-scratch model SiT[1].We adopt four widely used Java datasets, namely TLC [10], CSN [11], HDC [12], and FCM [13] as our subject datasets. TLC and HDC are method-split datasets, where methods in the same project are randomly split into training/validation/test sets. CSN and FCM are project-split datasets, where examples from the same project exist in only one partition. We further process the four datasets to build cleaner datasets by removing examples that have syntax errors, empty method bodies, and too long or too short sequence lengths, etc. We also remove the duplicate examples in the validation and test sets.To comprehensively and systematically evaluate the performance of the code summarization approaches, we adopt three widely used metrics, i.e., BLEU (both C-BLEU and S-BLEU are included), ROUGE, and METEOR, in our experiments. Effectiveness. We conduct a comprehensive comparison of the six studied approaches under exactly the same settings and datasets. Table I shows the experimental results obtained on four subject datasets in terms of four metrics.Comparing by metrics from Table I, we can observe that overall there are large variations in the approach rankings and score gaps when using different metrics for evaluation. DL-based approaches generally achieve better performance than IR-based approaches in terms of METEOR and ROUGE-L. IR-based approaches achieve comparable or even better C-BLEU scores, but have lower S-BLEU scores than the DL-based approaches. This shows that different metrics have a large impact on the results of the approach evaluation. Different metrics are needed to evaluate the code summarization approaches.Considering different datasets, the pre-trained DL-based approach PLBART performs best among the six approaches studied. On the other hand, we notice that the IR-based approaches, despite their simplicity, also achieve comparable or even better performance, especially on method-split datasets. For example, the C-BLEU scores of BM25-spl on TLC and HDC are the highest among all approaches. Therefore, although DL-based methods usually show better performance for code summarization, we should not overlook the capabilities of IR-based methods.Strengths. To evaluate how the similarity between the training and test codes affects the performance of the approaches, we use the Retrieval-Similarity metric, as Rencos [4] did, to measure the token-level similarity between a test code and its most similar training code. Based on this, we examine how the BLEU score of each approach varies as the Retrieval-Similarity value changes on the four subject datasets.Figure 1 shows the results, from which we observe that IR-based approaches perform better than DL-based ones when the Retrieval-Similarity values are higher. Through qualitative analysis of examples with high retrieval similarity, we find that due to the cloning phenomenon, similar codes have similar summaries, so IR-based approaches tend to perform better on examples with high Retrieval-Similarity values.Integration. Based on previous findings, we are motivated to design a simpler integration approach. We propose to take advantage of Retrieval-Similarity to decide whether to use the IR or DL approach to generate a summary for the input code. Specifically, we first use Lucene to retrieve a similar code for the input and compute a Retrieval-Similarity value between them. If the value is higher than a similarity threshold, we directly use the IR summary. Otherwise, we choose the DL model to get the output. To determine the similarity threshold, we conduct grid search on the validation set. The similarity achieving the highest metric score on validation set is set as the final threshold.We choose the best DL model PLBART and the best IR approach BM25-spl for integration and evaluate our approach on all four cleaned datasets. The effectiveness results are shown in the ‘Ours’ row of Table I. From the table, we can see that our integration is effective and achieves state-of-the-art results. Not only does it outperform a single approach, but the scores it achieves are higher than all the previous highest scores in our experiments on all metrics across all datasets.In summary, our study shows that the IR and DL approaches have their own strengths in terms of performance using different metrics on different datasets. Although IR-based approaches are simpler, they can still achieve comparable or even better performance in some cases, especially in the presence of high-similarity code. Based on the results, we propose a simple integration approach that achieves state-of-the-art results. Our study shows that it is not enough to focus on the DL model alone. Taking advantage of IR approaches is a promising direction for improving code summarization. Future work should explore the incorporation of more types of information and more advanced integration methods.

DOI: 10.1109/ICSE-Companion58688.2023.00091

A Control-Theoretic Approach to Auto-Tuning Dynamic Analysis for Distributed Services

作者: Dhal, Chandan and Fu, Xiaoqin and Cai, Haipeng
关键词: distributed system, dynamic analysis, control theory, dependence analysis, cost-effectiveness, auto-tuning

Abstract

Traditional dynamic dependence analysis approaches have limited utilities for continuously running distributed systems (i.e., distributed services) because of their low cost-effectiveness. A recent technique, Seads, was developed to improve the cost-effectiveness by adjusting analysis configurations on the fly using a general Q-learning algorithm. However, Seads is unable to utilize the user budget as far as needed for pushing up precision. To overcome this problem, we propose Cadas, an adaptive dynamic dependency analysis framework for distributed services. To realize the adaptation, we are exploring a control-theoretical method which uses a feedback mechanism to predict optimal analysis configurations. Then, we evaluated Cadas against six real-world Java distributed services. We compared Cadas against Seads as the baseline and show that Cadas outperforms the baseline in both precision and budget utilization. Our results suggest a new door opening for future research on adaptive dynamic program analysis.

DOI: 10.1109/ICSE-Companion58688.2023.00092

Quantum Software Testing： A Brief Introduction

作者: Ali, Shaukat and Yue, Tao
关键词: quantum circuits, quantum software testing, quantum programs

Abstract

Quantum software testing concentrates on testing quantum programs to discover quantum faults in the programs cost-effectively. Given the foundations in quantum mechanics, the way quantum computations are performed is significantly different than classical computing. Therefore, quantum software testing also differs from classical software testing. There has been quite an interest in building quantum software testing techniques since 2019 in the software engineering (SE) community. Thus, we aim to introduce quantum software testing to the SE community. In particular, we will present the basic foundations of quantum computing and quantum programming as circuits, followed by the current state of the art on quantum software testing. Next, we will present some basic quantum software testing techniques and finally give the research directions that deserve attention from the SE community.

DOI: 10.1109/ICSE-Companion58688.2023.00093

DevCertOps： Strategies to Realize Continuous Delivery of Safe Software in Regulated Domain

作者: Zeller, Marc
关键词: continuous delivery, DevOps, agile, safety

Abstract

Traditionally, promoted by the internet companies, DevOps is more and more appealing to industries which develop systems with safety-critical functions. Since safety-critical systems must meet regulatory requirements and require specific safety assurance processes in addition to the normal development steps, enabling continuous delivery of software in safety-critical systems requires the automation of the safety assurance process in the delivery pipeline. In this technical briefing, we describe relevant challenges to speed-up the development of safety-critical software-intensive systems from an industrial point of view. Moreover, we outline how to integrate software/system engineering and safety assurance life-cycle into a so-called DevCertOps concept realizing continuous safety assurance safety-critical systems using MBSE and model-based safety assurance concepts.

DOI: 10.1109/ICSE-Companion58688.2023.00094

SAIN： A Community-Wide Software Architecture Infrastructure

作者: Garcia, Joshua and Mirakhorli, Mehdi and Xiao, Lu and Malek, Sam and Kazman, Rick and Cai, Yuanfang and Medvidovi'{c
关键词: empirical software engineering, reproducible, software architecture

Abstract

Software Architecture is the most important determinant of the functional and non-functional attributes of a system [1]–[3]. Put simply, software systems “live and die” by their architectures [4]. Despite the importance, the architecture of a software system is often not explicitly documented, especially in the prevalent Agile methods in the past decades. Instead, the architecture of a system often becomes hidden in the myriad system implementation details, and gradually decays and accumulates grime—causing significant challenges to its long-term evolution and maintenance [5]–[8]. Recovering, understanding, and updating a system’s architecture is an important facet of overcoming this challenge to support the evolution and maintenance of long-lived software systems.Responding to the above challenge, software architecture research has yielded many different tools and techniques in the past two decades. However, the disjoint research effort and diverse lab environments where different tools and techniques are created have impeded technology transfer for reproducible empirical studies for the community. In other words, there is a lack of shared infrastructure with available tools and datasets for systematic synthesis and empirical validation of new or existing techniques. As such, researchers and practitioners in need of cutting-edge tools tend to re-invent, re-implement research infrastructure, or ignore particular research avenues altogether.

DOI: 10.1109/ICSE-Companion58688.2023.00095

Methodology and Guidelines for Evaluating Multi-Objective Search-Based Software Engineering

作者: Li, Miqing and Chen, Tao
关键词: quality indicators, multi-objective optimization, search-based software engineering

Abstract

Search-Based Software Engineering (SBSE) has been becoming an increasingly important research paradigm for automating and solving different software engineering tasks. When the considered tasks have more than one objective/criterion to be optimized, they are called multi-objective ones. In such a scenario, the outcome is typically a set of incomparable solutions (i.e., being Pareto nondominated to each other), and then a common question faced by many SBSE practitioners is: how to evaluate the obtained sets by using the right methods and indicators in the SBSE context? In this comprehensive technical brief, we seek to provide a systematic methodology and guidelines for answering this question. We start off by discussing why we need formal evaluation methods/indicators for multi-objective optimization problems in general, and the result of a survey on how they have been dominantly used in SBSE. This is then followed by a detailed introduction of representative evaluation methods and quality indicators used in SBSE, including their behaviors and preferences. In the meantime, we demonstrate the patterns and examples of potentially misleading usages/choices of evaluation methods and quality indicators from the SBSE community, highlighting their consequences. Afterward, we present a systematic methodology that can guide the selection and use of evaluation methods and quality indicators for a given SBSE problem in general, together with pointers that we hope to spark dialogues about some future directions on this important research topic for SBSE. Lastly, we showcase several real-world multi-objective SBSE case studies, in which we demonstrate the consequences of incorrect usage and exemplify the implementation of the guidance provided.

DOI: 10.1109/ICSE-Companion58688.2023.00096

Conducting Eye Tracking Studies in Software Engineering - Methodology and Pipeline

作者: Sharif, Bonita and Begel, Andrew and Maletic, Jonathan I.
关键词: program comprehension, empirical studies, eye-tracking

Abstract

This ICSE 2023 technical briefing is on state-of-the-art techniques to conduct eye tracking studies in software engineering. It is organized as a hands-on 180-minute briefing broken up into two 85-minute modules with a short break in between. The first module will teach participants the terminology and theories needed to understand eye tracking. The second will engage participants in hands-on groupwork to collect and analyze eye tracking data through a software pipeline. Our goal is to help participants one-on-one learn how to get starting using eye tracking to support their own research goals. Our team has been working with eye tracking for over 15 years. The briefing will be targeted towards researchers, practitioners, and educators. Our eye tracking software infrastructure, iTrace, will be demonstrated in person with several state-of-the-art eye trackers. Eye tracking is gaining a lot of traction in the community. We want to use the ICSE platform to communicate the current state-of-the-art (including limitations and workarounds) in a highly interactive setting starting from the theory, data collection, and processing pipeline. Sample data and scripts will be made available.

DOI: 10.1109/ICSE-Companion58688.2023.00097

The Landscape of Source Code Representation Learning in AI-Driven Software Engineering Tasks

作者: Chimalakonda, Sridhar and Das, Debeshee and Mathai, Alex and Tamilselvam, Srikanth and Kumar, Atul
关键词: machine learning, code representation

Abstract

Appropriate representation of source code and its relevant properties form the backbone of Artificial Intelligence (AI)/ Machine Learning (ML) pipelines for various software engineering (SE) tasks such as code classification, bug prediction, code clone detection, and code summarization. In the literature, researchers have extensively experimented with different kinds of source code representations (syntactic, semantic, integrated, customized) and ML techniques such as pre-trained BERT models. In addition, it is common for researchers to create hand-crafted and customized source code representations for an appropriate SE task. In a 2018 survey [1], Allamanis et al. listed nearly 35 different ways of of representing source code for different SE tasks like Abstract Syntax Trees (ASTs), customized ASTs, Control Flow Graphs (CFGs), Data Flow Graphs (DFGs) and so on. The main goal of this tutorial is two-fold (i) Present an overview of the state-of-the-art of source code representations and corresponding ML pipelines with an explicit focus on the merits and demerits of each of the representations (ii) Practical challenges in infusing different code-views in the state-of-the-art ML models and future research directions.

DOI: 10.1109/ICSE-Companion58688.2023.00098

Technical Briefing on Socio-Technical Grounded Theory for Qualitative Data Analysis

作者: Hoda, Rashina
关键词: mixed methods, qualitative research, STGT, socio-technical grounded theory, qualitative data analysis

Abstract

Analysis of qualitative data as part of qualitative and mixed-method research studies is becoming increasingly common. Deriving rich and robust findings from qualitative data analysis requires a good understanding of the nature of qualitative data, relevant collection techniques, and the application of robust and systematic qualitative data analysis. This technical briefing will cover these, and work through real examples of qualitative data analysis using socio-technical grounded theory (STGT) in software engineering research studies.

DOI: 10.1109/ICSE-Companion58688.2023.00099

Personalized Action Suggestions in Low-Code Automation Platforms

作者: Gupta, Saksham and Verbruggen, Gust and Singh, Mukul and Gulwani, Sumit and Le, Vu
关键词: recommendation system, decoder, prediction, personalization, transformers

Abstract

Automation platforms aim to automate repetitive tasks using workflows, which start with a trigger and then perform a series of actions. However, with many possible actions, the user has to search for the desired action at each step, which hinders the speed of flow development. We propose a personalized transformer model that recommends the next item at each step. This personalization is learned end-to-end from user statistics that are available at inference time. We evaluated our model on workflows from Power Automate users and show that personalization improves top-1 accuracy by 22%. For new users, our model performs similar to a model trained without personalization.

DOI: 10.1109/ICSE-Companion58688.2023.00100

Automated Feature Document Review via Interpretable Deep Learning

作者: Ye, Ming and Chen, Yuanfan and Zhang, Xin and He, Jinning and Cao, Jicheng and Liu, Dong and Gao, Jing and Dai, Hailiang and Cheng, Shengyu
关键词: interpretable deep learning, neural networks, agile methodology, feature documents

Abstract

A feature in the agile methodology is a function of a product that delivers business value and meets stakeholders’ requirements. Developers compile and store the content of features in a structured feature document. Feature documents play a critical role in controlling software development at a macro level. It is therefore important to ensure the quality of feature documents so that defects are not introduced at the outset. Manual review is an effective activity to ensure quality, but it is human-intensive and challenging. In this paper, we propose a feature document review tool to automate the process of manual review (quality classification, and suggestion generation) based on neural networks and interpretable deep learning. Our goal is to reduce human effort in reviewing feature documents and to prompt authors to craft better feature documents. We have evaluated our tool on a real industrial project from ZTE Corporation. The results show that our quality classification model achieved 75.6% precision and 94.4% recall for poor quality feature documents. For the suggestion generation model, about 70% of the poor quality feature documents could be improved to the qualified level in three rounds of revision based on the suggestions. User feedback shows that our tool helps users save an average of 15.9% of their time.

DOI: 10.1109/ICSE-Companion58688.2023.00101

Challenges of Testing an Evolving Cancer Registration Support System in Practice

作者: Laaber, Christoph and Yue, Tao and Ali, Shaukat and Schwitalla, Thomas and Nyg\r{a
关键词: evolution, software testing, cancer registry, healthcare, research challenges

Abstract

The Cancer Registry of Norway (CRN) is a public body responsible for capturing and curating cancer patient data histories to provide a unified access to research data and statistics for doctors, patients, and policymakers. For this purpose, CRN develops and operates a complex, constantly-evolving, and socio-technical software system. Recently, machine learning (ML) algorithms have been introduced into this system to augment the manual decisions made by humans with automated decision support from learned models. To ensure that the system is correct and robust and cancer patients’ data are properly handled and do not violate privacy concerns, automated testing solutions are being developed. In this paper, we share the challenges that we identified when developing automated testing solutions at CRN. Such testing potentially impacts the quality of cancer data for years to come, which is also used by the system’s stakeholders to make critical decisions. The challenges identified are not specific to CRN but are also valid in the context of other healthcare registries. We also provide some details on initial solutions that we are investigating to solve the identified challenges.

DOI: 10.1109/ICSE-Companion58688.2023.00102

Towards Formal Repair and Verification of Industry-Scale Deep Neural Networks

作者: Munakata, Satoshi and Tokumoto, Susumu and Yamamoto, Koji and Munakata, Kazuki
关键词: formal method, equivalence, repair, quality assurance, deep neural network

Abstract

There is a strong demand in the industry to update a deep neural network (DNN) as quickly, safely, and user-driven as possible for fixing critical prediction failure cases found in a safety-critical ML-enabled system for continuous quality assurance. DNN repair and equivalence verification (hereinafter called “verification”) with formal methods are promising technologies for this demand because they can guarantee desirable properties, such as repair “locality (i.e., predictions do not degrade for known cases not subject to repair)” and verification “detectability (i.e., a degraded unknown case is found if it truly exists in the search space)”. However, the industrial application of these technologies is difficult, mainly due to the increased computational load for mathematical optimization against a large number of DNN parameters.In this paper, we describe a challenge and new solution with our example to realize formal repair and verification of industry-scale DNNs. In this solution, repair and verification target only sparse parameter changes in a particular DNN layer and the space of inputs to that layer (i.e., feature space). By specializing mathematical optimization in repair and verification, the computational load is dramatically reduced. On the other hand, the solution also introduces new challenges regarding using feature space. We show in a practical and quantifiable way how to reasonably apply formal repair and verification and the specific challenges and issues involved, especially for the industry-scale DNN for image classification with live-action images.

DOI: 10.1109/ICSE-Companion58688.2023.00103

Can We Knapsack Software Defect Prediction? Nokia 5G Case

作者: Stradowski, Szymon and Madeyski, Lech
关键词: Nokia 5G, software development life cycle, continuous integration, software testing, software defect prediction, artificial intelligence

Abstract

As software products become larger and more complex, the test infrastructure needed for quality assurance grows similarly, causing a constant increase in operational and maintenance costs. Although rising in popularity, most Artificial Intelligence (AI) and Machine Learning (ML) Software Defect Prediction (SDP) solutions address singular test phases. In contrast, the need to address the whole Software Development Life Cycle (SDLC) is rarely explored. Therefore in this paper, we define the problem of extending the SDP concept to the entire SDLC, as this may be one of the significant next steps for the field. Furthermore, we explore the similarity between the defined challenge and the widely known Multidimensional Knapsack Problem (MKP). We use Nokia’s 5G wireless technology test process to illustrate the proposed concept. Resulting comparison validates the applicability of MKP to optimize the overall test cycle, which can be similarly relevant to any large-scale industrial software development process.

DOI: 10.1109/ICSE-Companion58688.2023.00104

Enhancing Maritime Data Standardization and Integrity Using Docker and Blockchain

作者: Wang, Shuai and Karandikar, Nikita and Knutsen, Knut Erik and Tong, Xiao Gang Tony and Edseth, Tom and Zile, Zealo Xu
关键词: docker, blockchain, tamperproof, integrity, maritime data

Abstract

Massive maritime data is available nowadays to tackle vital industrial challenges such as autonomous vessels and carbon neutrality. These data can be collected on board vessels and shared with maritime stakeholders (e.g., ship owners and ship operators) for various missions. However, the current industrial practice lacks of a standardized framework and infrastructure for data collection, sharing and analysis among different stakeholders. Moreover, the data integrity/trust is essential as tampering maliciously by any unexpected parties can severely undermine data trust and thus reducing the confidence of data utilization (e.g., results with AI). To address the above-mentioned challenges, this paper presents a solution using docker and blockchain named DataSafe to enhance data standardization and integrity. We also discuss several ongoing/potential applications of applying DataSafe in practice.

DOI: 10.1109/ICSE-Companion58688.2023.00105

Prioritizing Industrial Security Findings in Agile Software Development Projects

作者: Voggenreiter, Markus and Sch"{o
关键词: prioritization, software engineering, security findings, agile

Abstract

Automating repetitive activities is a key principle in most software development approaches employed in the industry. This implies that security activities and all related processes should be investigated for automation capabilities, particularly the management of security findings and vulnerabilities. Considering the limited time available for each release and the vast flood of findings by automated security testing, prioritizing security finding responses is essential.In this paper, we present a partially automated process to prioritize security findings in industrial software development projects. We utilize a two-staged calculation process to produce a prioritization score, representing the finding’s severity and factors like stakeholder input alike. This process was evaluated by conducting structured interviews with security professionals while also integrating the approach in ongoing industrial software development projects. The results indicate the potential of the process in terms of usefulness and correctness for agile software development projects.

DOI: 10.1109/ICSE-Companion58688.2023.00106

UnitTestBot： Automated Unit Test Generation for C Code in Integrated Development Environments

作者: Ivanov, Dmitry and Babushkin, Alexey and Grigoryev, Saveliy and Iatchenii, Pavel and Kalugin, Vladislav and Kichin, Egor and Kulikov, Egor and Misonizhnik, Aleksandr and Mordvinov, Dmitry and Morozov, Sergey and Naumenko, Olga and Pleshakov, Alexey and Ponomarev, Pavel and Shmidt, Svetlana and Utkin, Alexey and Volodin, Vadim and Volynets, Arseniy
关键词: integrated development environment, KLEE, symbolic execution, automated unit test generation, software testing

Abstract

Symbolic execution (SE) is one of the most promising techniques for automated unit test generation, which is claimed to streamline the testing process and reduce developers’ effort. There are symbolic execution engines working for Java, C, C#, C++, Python, .NET. The KLEE dynamic symbolic execution engine is one of the most elaborated ones — it is built on top of the LLVM compiler infrastructure and can automatically generate inputs for C code unit testing. There are numerous attempts to apply KLEE to real-life software projects, while the industry experience still shows little transfer from research to practice. The extensions to popular integrated development environments (IDEs) are supposed to be breaking down this barrier. As far as there are not so many working tools like this, we share our experience of implementing the KLEE-based Visual Studio Code and CLion extensions for generating ready-to-use test cases — UnitTestBot for C code — and describe the challenges we had to rise to. We also share the solutions we came up with: without introducing “new” techniques, we made automated unit test generation really automated and supplemented it with the simple wizard interface. That was enough for turning an effective but demanding technology into a user-friendly tool, which is easy to adopt. Finally, we provide examples of running UnitTestBot on the open-source projects as well as Huawei nonpublic code.

DOI: 10.1109/ICSE-Companion58688.2023.00107

Challenges of Evolving Legacy Software in a Small Team

作者: Owens, Bowie and Lee, Geoffrey and Zhu, Zili and Lo, Thomas
关键词: software maintenance

Abstract

This document describes some of the challenges faced by a small team developing software for the finance industry. Several of these challenges arise from being limited in the practices and tools we can utilize. We identify some pragmatic practices we utilize instead of better (but impractical for us) practices. Finally, we conclude by suggesting areas where research could increase the rigor of the practices we utilize.

DOI: 10.1109/ICSE-Companion58688.2023.00108

Future Software for Life in Trusted Futures

作者: Pink, Sarah
关键词: software engineering and society

Abstract

How will people, other species, software and hardware live together in as yet unknown futures? How can we work towards trusted and safe futures where human values and the environment are supported by emerging technologies? Research demonstrates that human values and everyday life priorities, ethics, routines and activities will shape our possible futures. I will draw on ethnographic research to outline how people anticipate and imagine everyday life futures with emerging technologies in their homes and neighbourhoods, and how technology workers envisage futures in their professional lives. If, as social science research shows, technologies cannot solve human and societal problems, what roles should they play in future life? What are the implications for future software? What values should underpin its design? Where should it be developed? By and in collaboration with whom? What role can software play in generating the circumstances for trusted futures?

DOI: 10.1109/ICSE48619.2023.00010

The Road Toward Dependable AI Based Systems

作者: Tonella, Paolo
关键词: software testing, deep learning, reliability and dependability

Abstract

With the advent of deep learning, AI components have achieved unprecedented performance on complex, human competitive tasks, such as image, video, text and audio processing. Hence, they are increasingly integrated into sophisticated software systems, some of which (e.g., autonomous vehicles) are required to deliver certified dependability warranties. In this talk, I will consider the unique features of AI based systems and of the faults possibly affecting them, in order to revise the testing fundamentals and redefine the overall goal of testing, taking a statistical view on the dependability warranties that can be actually delivered. Then, I will consider the key elements of a revised testing process for AI based systems, including the test oracle and the test input generation problems. I will also introduce the notion of runtime supervision, to deal with unexpected error conditions that may occur in the field. Finally, I will identify the future steps that are essential to close the loop from testing to operation, proposing an empirical framework that reconnects the output of testing to its original goals.

DOI: 10.1109/ICSE48619.2023.00011

Software Engineering as the Linchpin of Responsible AI

作者: Zhu, Liming
关键词: responsible AI, ethical AI, trustworthy AI, AI engineering, SE4AI

Abstract

From humanity’s existential risks to safety risks in critical systems to ethical risks, responsible AI, as the saviour, has become a major research challenge with significant real-world consequences. However, achieving responsible AI remains elusive despite the plethora of high-level ethical principles, risk frameworks and progress in algorithmic assurance. In the meantime, software engineering (SE) is being upended by AI, grappling with building system-level quality and alignment from inscrutable machine learning models and code generated from natural language prompts. The upending poses new challenges and opportunities for engineering AI systems responsibly. This talk will share our experiences in helping the industry achieve responsible AI systems by inventing new SE approaches. It will dive into industry challenges (such as risk silos and principle-algorithm gaps) and research challenges (such as lack of requirements, emerging properties and inscrutable systems) and make the point that SE is the linchpin of responsible AI. But SE also requires some fundamental rethinking - shifting from building functions into AI systems to discovering and managing emerging functions from AI systems. Only by doing so can SE take on critical new roles, from understanding human intelligence to building a thriving human-AI symbiosis.

DOI: 10.1109/ICSE48619.2023.00012

One Adapter for All Programming Languages? Adapter Tuning for Code Search and Summarization

作者: Wang, Deze and Chen, Boxing and Li, Shanshan and Luo, Wei and Peng, Shaoliang and Dong, Wei and Liao, Xiangke
关键词: transfer learning, adapter, multilingual task

Abstract

As pre-trained models automate many code intelligence tasks, a widely used paradigm is to fine-tune a model on the task dataset for each programming language. A recent study reported that multilingual fine-tuning benefits a range of tasks and models. However, we find that multilingual fine-tuning leads to performance degradation on recent models UniXcoder and CodeT5.To alleviate the potentially catastrophic forgetting issue in multilingual models, we fix all pre-trained model parameters, insert the parameter-efficient structure adapter, and fine-tune it. Updating only 0.6% of the overall parameters compared to full-model fine-tuning for each programming language, adapter tuning yields consistent improvements on code search and summarization tasks, achieving state-of-the-art results. In addition, we experimentally show its effectiveness in cross-lingual and low-resource scenarios. Multilingual fine-tuning with 200 samples per programming language approaches the results fine-tuned with the entire dataset on code summarization. Our experiments on three probing tasks show that adapter tuning significantly outperforms full-model fine-tuning and effectively overcomes catastrophic forgetting.

DOI: 10.1109/ICSE48619.2023.00013

CCRep： Learning Code Change Representations via Pre-Trained Code Model and Query Back

作者: Liu, Zhongxin and Tang, Zhijie and Xia, Xin and Yang, Xiaohu
关键词: code change, representation learning, commit message generation, patch correctness assessment, just-in-time defect prediction

Abstract

Representing code changes as numeric feature vectors, i.e., code change representations, is usually an essential step to automate many software engineering tasks related to code changes, e.g., commit message generation and just-in-time defect prediction. Intuitively, the quality of code change representations is crucial for the effectiveness of automated approaches. Prior work on code changes usually designs and evaluates code change representation approaches for a specific task, and little work has investigated code change encoders that can be used and jointly trained on various tasks. To fill this gap, this work proposes a novel Code Change Representation learning approach named CCRep, which can learn to encode code changes as feature vectors for diverse downstream tasks. Specifically, CCRep regards a code change as the combination of its before-change and after-change code, leverages a pre-trained code model to obtain high-quality contextual embeddings of code, and uses a novel mechanism named query back to extract and encode the changed code fragments and make them explicitly interact with the whole code change. To evaluate CCRep and demonstrate its applicability to diverse code-change-related tasks, we apply it to three tasks: commit message generation, patch correctness assessment, and just-in-time defect prediction. Experimental results show that CCRep outperforms the state-of-the-art techniques on each task.

DOI: 10.1109/ICSE48619.2023.00014

Keeping Pace with Ever-Increasing Data： Towards Continual Learning of Code Intelligence Models

作者: Gao, Shuzheng and Zhang, Hongyu and Gao, Cuiyun and Wang, Chaozheng
关键词: No keywords

Abstract

Previous research on code intelligence usually trains a deep learning model on a fixed dataset in an offline manner. However, in real-world scenarios, new code repositories emerge incessantly, and the carried new knowledge is beneficial for providing up-to-date code intelligence services to developers. In this paper, we aim at the following problem: How to enable code intelligence models to continually learn from ever-increasing data? One major challenge here is catastrophic forgetting, meaning that the model can easily forget knowledge learned from previous datasets when learning from the new dataset. To tackle this challenge, we propose REPEAT, a novel method for continual learning of code intelligence models. Specifically, REPEAT addresses the catastrophic forgetting problem with representative exemplars replay and adaptive parameter regularization. The representative exemplars replay component selects informative and diverse exemplars in each dataset and uses them to retrain model periodically. The adaptive parameter regularization component recognizes important parameters in the model and adaptively penalizes their changes to preserve the knowledge learned before. We evaluate the proposed approach on three code intelligence tasks including code summarization, software vulnerability detection, and code clone detection. Extensive experiments demonstrate that REPEAT consistently outperforms baseline methods on all tasks. For example, REPEAT improves the conventional fine-tuning method by 1.22, 5.61, and 1.72 on code summarization, vulnerability detection and clone detection, respectively.

DOI: 10.1109/ICSE48619.2023.00015

Detecting JVM JIT Compiler Bugs via Exploring Two-Dimensional Input Spaces

作者: Jia, Haoxiang and Wen, Ming and Xie, Zifan and Guo, Xiaochen and Wu, Rongxin and Sun, Maolin and Chen, Kang and Jin, Hai
关键词: JVM, JIT compiler, JVM testing

Abstract

Java Virtual Machine (JVM) is the fundamental software system that supports the interpretation and execution of Java bytecode. To support the surging performance demands for the increasingly complex and large-scale Java programs, JustIn-Time (JIT) compiler was proposed to perform sophisticated runtime optimization. However, this inevitably induces various bugs, which are becoming more pervasive over the decades and can often cause significant consequences. To facilitate the design of effective and efficient testing techniques to detect JIT compiler bugs. This study first performs a preliminary study aiming to understand the characteristics of JIT compiler bugs and the corresponding triggering test cases. Inspired by the empirical findings, we propose JOpFuzzer, a new JVM testing approach with a specific focus on JIT compiler bugs. The main novelty of JOpFuzzer is embodied in three aspects. First, besides generating new seeds, JOpFuzzer also searches for diverse configurations along the new dimension of optimization options. Second, JOpFuzzer learns the correlations between various code features and different optimization options to guide the process of seed mutation and option exploration. Third, it leverages the profile data, which can reveal the program execution information, to guide the fuzzing process. Such novelties enable JOpFuzzer to effectively and efficiently explore the two-dimensional input spaces. Extensive evaluation shows that JOpFuzzer outperforms the state-of-the-art approaches in terms of the achieved code coverages. More importantly, it has detected 41 bugs in OpenJDK, and 25 of them have already been confirmed or fixed by the corresponding developers.

DOI: 10.1109/ICSE48619.2023.00016

JITfuzz： Coverage-Guided Fuzzing for JVM Just-in-Time Compilers

作者: Wu, Mingyuan and Lu, Minghai and Cui, Heming and Chen, Junjie and Zhang, Yuqun and Zhang, Lingming
关键词: No keywords

Abstract

As a widely-used platform to support various Javabytecode-based applications, Java Virtual Machine (JVM) incurs severe performance loss caused by its real-time program interpretation mechanism. To tackle this issue, the Just-in-Time compiler (JIT) has been widely adopted to strengthen the efficacy of JVM. Therefore, how to effectively and efficiently detect JIT bugs becomes critical to ensure the correctness of JVM. In this paper, we propose a coverage-guided fuzzing framework, namely JITfuzz, to automatically detect JIT bugs. In particular, JITfuzz adopts a set of optimization-activating mutators to trigger the usage of typical JIT optimizations, e.g., function inlining and simplification. Meanwhile, given JIT optimizations are closely coupled with program control flows, JITfuzz also adopts mutators to enrich the control flows of target programs. Moreover, JITfuzz also proposes a mutator scheduler which iteratively schedules mutators according to the coverage updates to maximize the code coverage of JIT. To evaluate the effectiveness of JITfuzz, we conduct a set of experiments based on a benchmark suite with 16 popular JVM-based projects from GitHub. The experimental results suggest that JITfuzz outperforms the state-of-the-art mutation-based and generation-based JVM fuzzers by 27.9% and 18.6% respectively in terms of edge coverage on average. Furthermore, JITfuzz also successfully detects 36 previously unknown bugs (including 23 JIT bugs) and 27 bugs (including 18 JIT bugs) have been confirmed by the developers.

DOI: 10.1109/ICSE48619.2023.00017

Validating SMT Solvers via Skeleton Enumeration Empowered by Historical Bug-Triggering Inputs

作者: Sun, Maolin and Yang, Yibiao and Wen, Ming and Wang, Yongcong and Zhou, Yuming and Jin, Hai
关键词: SMT solver, fuzzing, skeleton enumeration, association rules, bug detection

Abstract

SMT solvers check the satisfiability of logic formulas over first-order theories, which have been utilized in a rich number of critical applications, such as software verification, test case generation, and program synthesis. Bugs hidden in SMT solvers would severely mislead those applications and further cause severe consequences. Therefore, ensuring the reliability and robustness of SMT solvers is of critical importance. Although many approaches have been proposed to test SMT solvers, it is still a challenge to discover bugs effectively. To tackle such a challenge, we conduct an empirical study on the historical bug-triggering formulas in SMT solvers’ bug tracking systems. We observe that the historical bug-triggering formulas contain valuable skeletons (i.e., core structures of formulas) as well as associated atomic formulas which can cast significant impacts on formulas’ ability in triggering bugs. Therefore, we propose a novel approach that utilizes the skeletons extracted from the historical bug-triggering formulas and enumerates atomic formulas under the guidance of association rules derived from historical formulas. In this study, we realized our approach as a practical fuzzing tool HistFuzz and conducted extensive testing on the well-known SMT solvers Z3 and cvc5. To date, HistFuzz has found 111 confirmed new bugs for Z3 and cvc5, of which 108 have been fixed by the developers. More notably, out of the confirmed bugs, 23 are soundness bugs and invalid model bugs found in the solvers’ default mode, which are essential for SMT solvers. In addition, our experiments also demonstrate that HistFuzz outperforms the state-of-the-art SMT solver fuzzers in terms of achieved code coverage and effectiveness.

DOI: 10.1109/ICSE48619.2023.00018

Regression Fuzzing for Deep Learning Systems

作者: You, Hanmo and Wang, Zan and Chen, Junjie and Liu, Shuang and Li, Shuochuan
关键词: regression, fuzzing, deep learning

Abstract

Deep learning (DL) Systems have been widely used in various domains. Similar to traditional software, DL system evolution may also incur regression faults. To find the regression faults between versions of a DL system, we propose a novel regression fuzzing technique called DRFuzz, which facilitates generating inputs that trigger diverse regression faults and have high fidelity. To enhance the diversity of the found regression faults, DRFuzz proposes a diversity-oriented test criterion to explore as many faulty behaviors as possible. Then, DRFuzz incorporates the GAN model to guarantee the fidelity of generated test inputs. We conduct an extensive study on four subjects in four regression scenarios of DL systems. The experimental results demonstrate the superiority of DRFuzz over the two compared state-of-the-art approaches, with an average improvement of 1,177% and 539% in terms of the number of detected regression faults.

DOI: 10.1109/ICSE48619.2023.00019

Operand-Variation-Oriented Differential Analysis for Fuzzing Binding Calls in PDF Readers

作者: Guo, Suyue and Wan, Xinyu and You, Wei and Liang, Bin and Shi, Wenchang and Zhang, Yiwei and Huang, Jianjun and Zhang, Jian
关键词: binding call, PDF reader, type reasoning, fuzzing

Abstract

Binding calls of embedded scripting engines introduce a serious attack surface in PDF readers. To effectively test binding calls, the knowledge of parameter types is necessary. Unfortunately, due to the absence or incompleteness of documentation and the lack of sufficient samples, automatic type reasoning for binding call parameters is a big challenge. In this paper, we propose a novel operand-variation-oriented differential analysis approach, which automatically extracts features from execution traces as oracles for inferring parameter types. In particular, the parameter types of a binding call are inferred by executing the binding call with different values of different types and investigating which types cause an expected effect on the instruction operands. The inferred type information is used to guide the test generation in fuzzing. Through the evaluation on two popular PDF readers (Adobe Reader and Foxit Reader), we demonstrated the accuracy of our type reasoning method and the effectiveness of the inferred type information for improving fuzzing in both code coverage and vulnerability discovery. We found 38 previously unknown security vulnerabilities, 26 of which were certified with CVE numbers.

DOI: 10.1109/ICSE48619.2023.00020

The Untold Story of Code Refactoring Customizations in Practice

作者: Oliveira, Daniel and Assun\c{c
关键词: refactoring, custom refactoring, refactoring tooling support

Abstract

Refactoring is a common software maintenance practice. The literature defines standard code modifications for each refactoring type and popular IDEs provide refactoring tools aiming to support these standard modifications. However, previous studies indicated that developers either frequently avoid using these tools or end up modifying and even reversing the code automatically refactored by IDEs. Thus, developers are forced to manually apply refactorings, which is cumbersome and error-prone. This means that refactoring support may not be entirely aligned with practical needs. The improvement of tooling support for refactoring in practice requires understanding in what ways developers tailor refactoring modifications. To address this issue, we conduct an analysis of 1,162 refactorings composed of more than 100k program modifications from 13 software projects. The results reveal that developers recurrently apply patterns of additional modifications along with the standard ones, from here on called patterns of customized refactorings. For instance, we found customized refactorings in 80.77% of the Move Method instances observed in the software projects. We also investigated the features of refactoring tools in popular IDEs and observed that most of the customization patterns are not fully supported by them. Additionally, to understand the relevance of these customizations, we conducted a survey with 40 developers about the most frequent customization patterns we found. Developers confirm the relevance of customization patterns and agree that improvements in IDE’s refactoring support are needed. These observations highlight that refactoring guidelines must be updated to reflect typical refactoring customizations. Also, IDE builders can use our results as a basis to enable a more flexible application of automated refactorings. For example, developers should be able to choose which method must handle exceptions when extracting an exception code into a new method.

DOI: 10.1109/ICSE48619.2023.00021

Data Quality for Software Vulnerability Datasets

作者: Croft, Roland and Babar, M. Ali and Kholoosi, M. Mehdi
关键词: software vulnerability, data quality, machine learning

Abstract

The use of learning-based techniques to achieve automated software vulnerability detection has been of longstanding interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent impacts that issues can have on software vulnerability prediction models. Surprisingly, we found that all the analyzed datasets exhibit some data quality problems. In particular, we found 20–71% of vulnerability labels to be inaccurate in real-world datasets, and 17–99% of data points were duplicated. We observed that these issues could cause significant impacts on downstream models, either preventing effective model training or inflating benchmark performance. We advocate for the need to overcome such challenges. Our findings will enable better consideration and assessment of software vulnerability data quality in the future.

DOI: 10.1109/ICSE48619.2023.00022

Do Code Refactorings Influence the Merge Effort?

作者: Oliveira, Andr'{e
关键词: software merge, merge effort, refactoring, association rules, data mining

Abstract

In collaborative software development, multiple contributors frequently change the source code in parallel to implement new features, fix bugs, refactor existing code, and make other changes. These simultaneous changes need to be merged into the same version of the source code. However, the merge operation can fail, and developer intervention is required to resolve the conflicts. Studies in the literature show that 10 to 20 percent of all merge attempts result in conflicts, which require the manual developer’s intervention to complete the process. In this paper, we concern about a specific type of change that affects the structure of the source code and has the potential to increase the merge effort: code refactorings. We analyze the relationship between the occurrence of refactorings and the merge effort. To do so, we applied a data mining technique called association rule extraction to find patterns of behavior that allow us to analyze the influence of refactorings on the merge effort. Our experiments extracted association rules from 40,248 merge commits that occurred in 28 popular open-source projects. The results indicate that: (i) the occurrence of refactorings increases the chances of having merge effort; (ii) the more refactorings, the greater the chances of effort; (iii) the more refactorings, the greater the effort; and (iv) parallel refactorings increase even more the chances of having effort, as well as the intensity of it. The results obtained may suggest behavioral changes in the way refactorings are implemented by developer teams. In addition, they can indicate possible ways to improve tools that support code merging and those that recommend refactorings, considering the number of refactorings and merge effort attributes.

DOI: 10.1109/ICSE48619.2023.00023

A Comprehensive Study of Real-World Bugs in Machine Learning Model Optimization

作者: Guan, Hao and Xiao, Ying and Li, Jiaying and Liu, Yepang and Bai, Guangdong
关键词: machine learning, model optimization, bugs

Abstract

Due to the great advance in machine learning (ML) techniques, numerous ML models are expanding their application domains in recent years. To adapt for resource-constrained platforms such as mobile and Internet of Things (IoT) devices, pre-trained models are often processed to enhance their efficiency and compactness, using optimization techniques such as pruning and quantization. Similar to the optimization process in other complex systems, e.g., program compilers and databases, optimizations for ML models can contain bugs, leading to severe consequences such as system crashes and financial loss. While bugs in training, compiling and deployment stages have been extensively studied, there is still a lack of systematic understanding and characterization of model optimization bugs (MOBs).In this work, we conduct the first empirical study to identify and characterize MOBs. We collect a comprehensive dataset containing 371 MOBs from TensorFlow and PyTorch, the most extensively used open-source ML frameworks, covering the entire development time span of their optimizers (May 2019 to August 2022). We then investigate the collected bugs from various perspectives, including their symptoms, root causes, life cycles, detection and fixes. Our work unveils the status quo of MOBs in the wild, and reveals their features on which future detection techniques can be based. Our findings also serve as a warning to the developers and the users of ML frameworks, and an appeal to our research community to enact dedicated countermeasures.

DOI: 10.1109/ICSE48619.2023.00024

Evaluating the Impact of Experimental Assumptions in Automated Fault Localization

作者: Soremekun, Ezekiel and Kirschner, Lukas and B"{o
关键词: fault localization, program repair, user study

Abstract

Much research on automated program debugging often assumes that bug fix location(s) indicate the faults’ root causes and that root causes of faults lie within single code elements (statements). It is also often assumed that the number of statements a developer would need to inspect before finding the first faulty statement reflects debugging effort. Although intuitive, these three assumptions are typically used (55% of experiments in surveyed publications make at least one of these three assumptions) without any consideration of their effects on the debugger’s effectiveness and potential impact on developers in practice. To deal with this issue, we perform controlled experimentation, split testing in particular, using 352 bugs from 46 open-source C programs, 19 Automated Fault Localization (AFL) techniques (18 statistical debugging formulas and dynamic slicing), two (2) state-of-the-art automated program repair (APR) techniques (GenProg and Angelix) and 76 professional developers. Our results show that these assumptions conceal the difficulty of debugging. They make AFL techniques appear to be (up to 38%) more effective, and make APR tools appear to be (2X) less effective. We also find that most developers (83%) consider these assumptions to be unsuitable for debuggers and, perhaps worse, that they may inhibit development productivity. The majority (66%) of developers prefer debugging diagnoses without these assumptions twice as much as with the assumptions. Our findings motivate the need to assess debuggers conservatively, i.e., without these assumptions.

DOI: 10.1109/ICSE48619.2023.00025

Locating Framework-Specific Crashing Faults with Compact and Explainable Candidate Set

作者: Yan, Jiwei and Wang, Miaomiao and Liu, Yepang and Yan, Jun and Zhang, Long
关键词: fault localization, framework-specific exception, crash stack trace, android application

Abstract

Nowadays, many applications do not exist independently but rely on various frameworks or libraries. The frequent evolution and the complex implementation of framework APIs induce lots of unexpected post-release crashes. Starting from the crash stack traces, existing approaches either perform application-level call graph (CG) tracing or construct datasets with similar crash-fixing records to locate buggy methods. However, these approaches are limited by the completeness of CG or dependent on historical fixing records, and some of them only focus on specific manually modeled exception types.To achieve effective debugging on complex framework-specific crashes, we propose a code-separation-based locating approach that weakly relies on CG tracing and does not require any prior knowledge. Our key insight is that one crash trace with the description message can be mapped to a definite exception-thrown point in the framework, the semantics analysis of which can help to figure out the root causes of the crash-triggering procedure. Thus, we can pre-construct reusable summaries for all the framework-specific exceptions to support fault localization in application code. Based on that idea, we design the exception-thrown summary (ETS) that describes both the key variables and key APIs related to the exception triggering. Then, we perform static analysis to automatically compute such summaries and make a data-tracking of key variables and APIs in the application code to get the ranked buggy candidates. In the scenario of locating Android framework-specific crashing faults, our tool CrashTracker exhibited an overall MRR value of 0.91 and outperforms the state-of-the-art tool Anchor with higher precision. It only provides a compact candidate set and gives user-friendly reports with explainable reasons for each candidate.

DOI: 10.1109/ICSE48619.2023.00026

PExReport： Automatic Creation of Pruned Executable Cross-Project Failure Reports

作者: Huang, Sunzhou and Wang, Xiaoyin
关键词: cross-project failure, executable failure report, failure reproduction, build tool, build environment, debloating

Abstract

Modern software development extensively depends on existing libraries written by other developer teams from the same or a different organization. When a developer executes the software, the execution trace may go across the boundaries of multiple software products and create cross-project failures (CPFs). Existing studies show that a stand-alone executable failure report may enable the most effective communication, but creating such a report is often challenging due to the complicated files and dependencies interactions in the software ecosystems. In this paper, to solve the CPF report trilemma, we developed PExReport, which automatically creates stand-alone executable CPF reports. PExReport leverages build tools to prune source code and dependencies, and further analyzes the build process to create a pruned build environment for reproducing the CPF. We performed an evaluation on 74 software project issues with 198 CPFs, and the evaluation results show that PExReport can create executable CPF reports for 184 out of 198 test failures in our dataset, with an average reduction of 72.97% on source classes and the classes in internal JARs.

DOI: 10.1109/ICSE48619.2023.00027

RAT： A Refactoring-Aware Traceability Model for Bug Localization

作者: Niu, Feifei and Assun\c{c
关键词: bug localization, bug report similarity, code refactoring, traceability, commit history, information retrieval

Abstract

A large number of bug reports are created during the evolution of a software system. Locating the source code files that need to be changed in order to fix these bugs is a challenging task. Information retrieval-based bug localization techniques do so by correlating bug reports with historical information about the source code (e.g., previously resolved bug reports, commit logs). These techniques have shown to be efficient and easy to use. However, one flaw that is nearly omnipresent in all these techniques is that they ignore code refactorings. Code refactorings are common during software system evolution, but from the perspective of typical version control systems, they break the code history. For example, a class when renamed then appears as two separate classes with separate histories. Obviously, this is a problem that affects any technique that leverages code history. This paper proposes a refactoring-aware traceability model to keep track of the code evolution history. With this model, we reconstruct the code history by analyzing the impact of code refactorings to correctly stitch together what would otherwise be a fragmented history. To demonstrate that a refactoring aware history is indeed beneficial, we investigated three widely adopted bug localization techniques that make use of code history, which are important components in existing approaches. Our evaluation on 11 open source projects shows that taking code refactorings into account significantly improves the results of these bug localization techniques without significant changes to the techniques themselves. The more refactorings are used in a project, the stronger the benefit we observed. Based on our findings, we believe that much of the state of the art leveraging code history should benefit from our work.

DOI: 10.1109/ICSE48619.2023.00028

How Do We Read Formal Claims? Eye-Tracking and the Cognition of Proofs about Algorithms

作者: Ahmad, Hammad and Karas, Zachary and Diaz, Kimberly and Kamil, Amir and Jeannin, Jean-Baptiste and Weimer, Westley
关键词: formalism comprehension, student cognition, eye-tracking, facial behavior analysis, human study

Abstract

Formal methods are used successfully in high-assurance software, but they require rigorous mathematical and logical training that practitioners often lack. As such, integrating formal methods into software has been associated with numerous challenges. While educators have placed emphasis on formalisms in undergraduate theory courses, such courses often struggle with poor student outcomes and satisfaction. In this paper, we present a controlled eye-tracking human study (n = 34) investigating the problem-solving strategies employed by students with different levels of incoming preparation (as assessed by theory coursework taken and pre-screening performance on a proof comprehension task), and how educators can better prepare low-outcome students for the rigorous logical reasoning that is a core part of formal methods in software engineering. Surprisingly, we find that incoming preparation is not a good predictor of student outcomes for formalism comprehension tasks, and that student self-reports are not accurate at identifying factors associated with high outcomes for such tasks. Instead, and importantly, we find that differences in outcomes can be attributed to performance for proofs by induction and recursive algorithms, and that better-performing students exhibit significantly more attention switching behaviors, a result that has several implications for pedagogy in terms of the design of teaching materials. Our results suggest the need for a substantial pedagogical intervention in core theory courses to better align student outcomes with the objectives of mastery and retaining the material, and thus bettering preparing students for high-assurance software engineering.

DOI: 10.1109/ICSE48619.2023.00029

Which of My Assumptions are Unnecessary for Realizability and Why Should I Care?

作者: Shalom, Rafi and Maoz, Shahar
关键词: No keywords

Abstract

Specifications for reactive systems synthesis consist of assumptions and guarantees. However, some specifications may include unnecessary assumptions, i.e., assumptions that are not necessary for realizability. While the controllers that are synthesized from such specifications are correct, they are also inflexible and fragile; their executions will satisfy the specification’s guarantees in only very specific environments.In this work we show how to detect unnecessary assumptions, and to transform any realizable specification into a corresponding realizable core specification, one that includes the same guarantees but no unnecessary assumptions. We do this by computing an assumptions core, a locally minimal subset of assumptions that suffices for realizability. Controllers that are synthesized from a core specification are not only correct but, importantly, more general; their executions will satisfy the specification’s guarantees in more environments.We implemented our ideas in the Spectra synthesis environment, and evaluated their impact over different benchmarks from the literature. The evaluation provides evidence for the motivation and significance of our work, by showing (1) that unnecessary assumptions are highly prevalent, (2) that in almost all cases the fully-automated removal of unnecessary assumptions pays off in total synthesis time, and (3) that core specifications induce more general controllers whose reachable state space is larger but whose representation more memory efficient.

DOI: 10.1109/ICSE48619.2023.00030

UpCy： Safely Updating Outdated Dependencies

作者: Dann, Andreas and Hermann, Ben and Bodden, Eric
关键词: semantic versioning, library updates, package management, dependency management, software maintenance

Abstract

Recent research has shown that developers hesitate to update dependencies and mistrust automated approaches such as Dependabot, since they are afraid of introducing incompatibilities that break their project. In fact, such approaches only suggest na"{\i

DOI: 10.1109/ICSE48619.2023.00031

APICad： Augmenting API Misuse Detection through Specifications from Code and Documents

作者: Wang, Xiaoke and Zhao, Lei
关键词: No keywords

Abstract

Using API should follow its specifications. Otherwise, it can bring security impacts while the functionality is damaged. To detect API misuse, we need to know what its specifications are. In addition to being provided manually, current tools usually mine the majority usage in the existing codebase as specifications, or capture specifications from its relevant texts in human language. However, the former depends on the quality of the codebase itself, while the latter is limited to the irregularity of the text. In this work, we observe that the information carried by code and documents can complement each other. To mitigate the demand for a high-quality codebase and reduce the pressure to capture valid information from texts, we present APICad to detect API misuse bugs of C/C++ by combining the specifications mined from code and documents. On the one hand, we effectively build the contexts for API invocations and mine specifications from them through a frequency-based method. On the other hand, we acquire the specifications from documents by using lightweight keyword-based and NLP-assisted techniques. Finally, the combined specifications are generated for bug detection. Experiments show that APICad can handle diverse API usage semantics to deal with different types of API misuse bugs. With the help of APICad, we report 153 new bugs in Curl, Httpd, OpenSSL and Linux kernel, 145 of which have been confirmed and 126 have applied our patches.

DOI: 10.1109/ICSE48619.2023.00032

Compatibility Issue Detection for Android Apps Based on Path-Sensitive Semantic Analysis

作者: Yang, Sen and Chen, Sen and Fan, Lingling and Xu, Sihan and Hui, Zhanwei and Huang, Song
关键词: compatibility detection, android app, path-sensitive analysis, semantic analysis

Abstract

Android API-related compatibility issues have become a severe problem and significant challenge for app developers due to the well-known Android fragmentation issues. To address this problem, many effective approaches such as app-based and API lifetime-based methods have been proposed to identify incompatible API usages. However, due to the various implementations of API usages and different API invoking paths, there is still a significant weakness of existing approaches, i.e., introducing a massive number of false positives (FP) and false negatives (FN). To this end, in this paper, we propose PSDroid, an automated compatibility detection approach for Android apps, which aims to reduce FPs and FNs by overcoming several technical bottlenecks. Firstly, we make substantial efforts to carry out a preliminary study to summarize a set of novel API usages with diverse checking implementations. Secondly, we construct a refined API lifetime database by leveraging a semantic resolving analysis on all existing Android SDK frameworks. Based on the above two key phases, we design and implement a novel path-sensitive semantic approach to effectively and automatically detect incompatibility issues. To demonstrate the performance, we compared with five existing approaches (i.e., FicFinder, ACRYL, CIDER, IctAPIFinder, and CID) and the results show that PSDroid outperforms existing tools. We also conducted an in-depth root cause analysis to comprehensively explain the ability of PSDroid in reducing FPs and FNs. Finally, 18/30 reported issues have been confirmed and further fixed by app developers.

DOI: 10.1109/ICSE48619.2023.00033

OSSFP： Precise and Scalable C/C++ Third-Party Library Detection Using Fingerprinting Functions

作者: Wu, Jiahui and Xu, Zhengzi and Tang, Wei and Zhang, Lyuye and Wu, Yueming and Liu, Chengyue and Sun, Kairan and Zhao, Lida and Liu, Yang
关键词: No keywords

Abstract

Third-party libraries (TPLs) are frequently used in software to boost efficiency by avoiding repeated developments. However, the massive using TPLs also brings security threats since TPLs may introduce bugs and vulnerabilities. Therefore, software composition analysis (SCA) tools have been proposed to detect and manage TPL usage. Unfortunately, due to the presence of common and trivial functions in the bloated feature dataset, existing tools fail to precisely and rapidly identify TPLs in C/C++ real-world projects. To this end, we propose OSSFP, a novel SCA framework for effective and efficient TPL detection in large-scale real-world projects via generating unique fingerprints for open source software. By removing common and trivial functions and keeping only the core functions to build the fingerprint index for each TPL project, OSSFP significantly reduces the database size and accelerates the detection process. It also improves TPL detection accuracy since noises are excluded from the fingerprints. We applied OSSFP on a large data set containing 23,427 C/C++ repositories, which included 585,683 versions and 90 billion lines of code. The result showed that it could achieve 90.84% of recall and 90.34% of precision, which outperformed the state-of-the-art tool by 35.31% and 3.71%, respectively. OSSFP took only 0.12 seconds on average to identify all TPLs per project, which was 22 times faster than the other tool. OSSFP has proven to be highly scalable on large-scale datasets.

DOI: 10.1109/ICSE48619.2023.00034

SmartMark： Software Watermarking Scheme for Smart Contracts

作者: Kim, Taeyoung and Jang, Yunhee and Lee, Chanjong and Koo, Hyungjoon and Kim, Hyoungshick
关键词: smart contract, software watermarking, blockchain, software copyrights

Abstract

A smart contract is a self-executing program on a blockchain to ensure an immutable and transparent agreement without the involvement of intermediaries. Despite its growing popularity for many blockchain platforms like Ethereum, no technical means is available even when a smart contract requires to be protected from being copied. One promising direction to claim a software ownership is software watermarking. However, applying existing software watermarking techniques is challenging because of the unique properties of a smart contract, such as a code size constraint, non-free execution cost, and no support for dynamic allocation under a virtual machine environment. This paper introduces a novel software watermarking scheme, dubbed SmartMark, aiming to protect the ownership of a smart contract against a pirate activity. SmartMark builds the control flow graph of a target contract runtime bytecode, and locates a collection of bytes that are randomly elected for representing a watermark. We implement a full-fledged prototype for Ethereum, applying SmartMark to 27,824 unique smart contract bytecodes. Our empirical results demonstrate that SmartMark can effectively embed a watermark into a smart contract and verify its presence, meeting the requirements of credibility and imperceptibility while incurring an acceptable performance degradation. Besides, our security analysis shows that SmartMark is resilient against viable watermarking corruption attacks; e.g., a large number of dummy opcodes are needed to disable a watermark effectively, resulting in producing an illegitimate smart contract clone that is not economical.

DOI: 10.1109/ICSE48619.2023.00035

Turn the Rudder： A Beacon of Reentrancy Detection for Smart Contracts on Ethereum

作者: Zheng, Zibin and Zhang, Neng and Su, Jianzhong and Zhong, Zhijie and Ye, Mingxi and Chen, Jiachi
关键词: smart contract, reentrancy, empirical study

Abstract

Smart contracts are programs deployed on a blockchain and are immutable once deployed. Reentrancy, one of the most important vulnerabilities in smart contracts, has caused millions of dollars in financial loss. Many reentrancy detection approaches have been proposed. It is necessary to investigate the performance of these approaches to provide useful guidelines for their application. In this work, we conduct a large-scale empirical study on the capability of five well-known or recent reentrancy detection tools such as Mythril and Sailfish. We collect 230,548 verified smart contracts from Etherscan and use detection tools to analyze 139,424 contracts after deduplication, which results in 21,212 contracts with reentrancy issues. Then, we manually examine the defective functions located by the tools in the contracts. From the examination results, we obtain 34 true positive contracts with reentrancy and 21,178 false positive contracts without reentrancy. We also analyze the causes of the true and false positives. Finally, we evaluate the tools based on the two kinds of contracts. The results show that more than 99.8% of the reentrant contracts detected by the tools are false positives with eight types of causes, and the tools can only detect the reentrancy issues caused by call.value(), 58.8% of which can be revealed by the Ethereum’s official IDE, Remix. Furthermore, we collect real-world reentrancy attacks reported in the past two years and find that the tools fail to find any issues in the corresponding contracts. Based on the findings, existing works on reentrancy detection appear to have very limited capability, and researchers should turn the rudder to discover and detect new reentrancy patterns except those related to call.value().

DOI: 10.1109/ICSE48619.2023.00036

BSHUNTER： Detecting and Tracing Defects of Bitcoin Scripts

作者: Zheng, Peilin and Luo, Xiapu and Zheng, Zibin
关键词: bitcoin, blockchain, smart contract

Abstract

Supporting the most popular cryptocurrency, the Bitcoin platform allows its transactions to be programmable via its scripts. Defects in Bitcoin scripts will make users lose their bitcoins. However, there are few studies on the defects of Bitcoin scripts. In this paper, we conduct the first systematic investigation on the defects of Bitcoin scripts through three steps, including defect definition, defect detection, and exploitation tracing. First, we define six typical defects of scripts in Bitcoin history, namely unbinded-txid, simple-key, useless-sig, uncertain-sig, impossible-key, and never-true. Three are inspired by the community, and three are new from us. Second, we develop a tool to discover Bitcoin scripts with any of typical defects based on symbolic execution and enhanced by historical exact scripts. By analyzing all Bitcoin transactions from Oct. 2009 to Aug. 2022, we find that 383,544 transaction outputs are paid to the Bitcoin scripts with defects. The total amount of them is 3,115.43 BTC, which is around 60 million dollars at present. Third, in order to trace the exploitation of the defects, we instrument the Bitcoin VM to record the traces of the real-world spending transactions of the buggy scripts. We find that 84,130 output scripts are exploited. The implementation and non-harmful datasets are released.

DOI: 10.1109/ICSE48619.2023.00037

Do I Belong? Modeling Sense of Virtual Community Among Linux Kernel Contributors

作者: Trinkenreich, Bianca and Stol, Klaas-Jan and Sarma, Anita and German, Daniel M. and Gerosa, Marco A. and Steinmacher, Igor
关键词: sense of virtual community, belonging, open source, software developers, survey, PLS-SEM

Abstract

The sense of belonging to a community is a basic human need that impacts an individual’s behavior, long-term engagement, and job satisfaction, as revealed by research in disciplines such as psychology, healthcare, and education. Despite much research on how to retain developers in Open Source Software (OSS) projects and other virtual, peer-production communities, there is a paucity of research investigating what might contribute to a sense of belonging in these communities. To that end, we develop a theoretical model that seeks to understand the link between OSS developer motives and a Sense of Virtual Community (SVC). We test the model with a dataset collected in the Linux Kernel developer community (N=225), using structural equation modeling techniques. Our results for this case study show that intrinsic motivations (social or hedonic motives) are positively associated with a sense of virtual community, but living in an authoritative country and being paid to contribute can reduce the sense of virtual community. Based on these results, we offer suggestions for open source projects to foster a sense of virtual community, with a view to retaining contributors and improving projects’ sustainability.

DOI: 10.1109/ICSE48619.2023.00038

Comparison and Evaluation of Clone Detection Techniques with Different Code Representations

作者: Wang, Yuekun and Ye, Yuhang and Wu, Yueming and Zhang, Weiwei and Xue, Yinxing and Liu, Yang
关键词: clone detection, empirical study, code representation, large scale

Abstract

As one of bad smells in code, code clones may increase the cost of software maintenance and the risk of vulnerability propagation. In the past two decades, numerous clone detection technologies have been proposed. They can be divided into text-based, token-based, tree-based, and graph-based approaches according to their code representations. Different code representations abstract the code details from different perspectives. However, it is unclear which code representation is more effective in detecting code clones and how to combine different code representations to achieve ideal performance.In this paper, we present an empirical study to compare the clone detection ability of different code representations. Specifically, we reproduce 12 clone detection algorithms and divide them into different groups according to their code representations. After analyzing the empirical results, we find that token and tree representations can perform better than graph representation when detecting simple code clones. However, when the code complexity of a code pair increases, graph representation becomes more effective. To make our findings more practical, we perform manual analysis on open-source projects to seek a possible distribution of different clone types in the open-source community. Through the results, we observe that most clone pairs belong to simple code clones. Based on this observation, we discard heavyweight graph-based clone detection algorithms and conduct combination experiments to find out a suitable combination of token-based and tree-based approaches for achieving scalable and effective code clone detection. We develop the suitable combination into a tool called TACC and evaluate it with other state-of-the-art code clone detectors. Experimental results indicate that TACC performs better and has the ability to detect large-scale code clones.

DOI: 10.1109/ICSE48619.2023.00039

Learning Graph-Based Code Representations for Source-Level Functional Similarity Detection

作者: Liu, Jiahao and Zeng, Jun and Wang, Xiang and Liang, Zhenkai
关键词: No keywords

Abstract

Detecting code functional similarity forms the basis of various software engineering tasks. However, the detection is challenging as functionally similar code fragments can be implemented differently, e.g., with irrelevant syntax. Recent studies incorporate program dependencies as semantics to identify syntactically different yet semantically similar programs, but they often focus only on local neighborhoods (e.g., one-hop dependencies), limiting the expressiveness of program semantics in modeling functionalities. In this paper, we present Tailor that explicitly exploits deep graph-structured code features for functional similarity detection. Given source-level programs, Tailor first represents them into code property graphs (CPGs) — which combine abstract syntax trees, control flow graphs, and data flow graphs — to collectively reason about program syntax and semantics. Then, Tailor learns representations of CPGs by applying a CPG-based neural network (CPGNN) to iteratively propagate information on them. It improves over prior work on code representation learning through a new graph neural network (GNN) tailored to CPG structures instead of the off-the-shelf GNNs used previously. We systematically evaluate Tailor on C and Java programs using two public benchmarks. Experimental results show that Tailor outperforms the state-of-the-art approaches, achieving 99.8% and 99.9% F-scores in code clone detection and 98.3% accuracy in source code classification.

DOI: 10.1109/ICSE48619.2023.00040

The Smelly Eight： An Empirical Study on the Prevalence of Code Smells in Quantum Computing

作者: Chen, Qihong and C^{a
关键词: quantum computing, quantum software engineering, empirical study, quantum-specific code smell

Abstract

Quantum Computing (QC) is a fast-growing field that has enhanced the emergence of new programming languages and frameworks. Furthermore, the increased availability of computational resources has also contributed to an influx in the development of quantum programs. Given that classical and QC are significantly different due to the intrinsic nature of quantum programs, several aspects of QC (e.g., performance, bugs) have been investigated, and novel approaches have been proposed. However, from a purely quantum perspective, maintenance, one of the major steps in a software development life-cycle, has not been considered by researchers yet. In this paper, we fill this gap and investigate the prevalence of code smells in quantum programs as an indicator of maintenance issues.We defined eight quantum-specific smells and validated them through a survey with 35 quantum developers. Since no tool specifically aims to detect quantum smells, we developed a tool called QSmell that supports the proposed quantum-specific smells. Finally, we conducted an empirical investigation to analyze the prevalence of quantum-specific smells in 15 open-source quantum programs. Our results showed that 11 programs (73.33%) contain at least one smell and, on average, a program has three smells. Furthermore, the long circuit is the most prevalent smell present in 53.33% of the programs.

DOI: 10.1109/ICSE48619.2023.00041

Reachable Coverage： Estimating Saturation in Fuzzing

作者: Liyanage, Danushka and B"{o
关键词: No keywords

Abstract

Reachable coverage is the number of code elements in the search space of a fuzzer (i.e., an automatic software testing tool). A fuzzer cannot find bugs in code that is unreachable. Hence, reachable coverage quantifies fuzzer effectiveness. Using static program analysis, we can compute an upper bound on the number of reachable coverage elements, e.g., by extracting the call graph. However, we cannot decide whether a coverage element is reachable in general. If we could precisely determine reachable coverage efficiently, we would have solved the software verification problem. Unfortunately, we cannot approach a given degree of accuracy for the static approximation, either.In this paper, we advocate a statistical perspective on the approximation of the number of elements in the fuzzer’s search space, where accuracy does improve as a function of the analysis runtime. In applied statistics, corresponding estimators have been developed and well established for more than a quarter century. These estimators hold an exciting promise to finally tackle the long-standing challenge of counting reachability. In this paper, we explore the utility of these estimators in the context of fuzzing. Estimates of reachable coverage can be used to measure (a) the amount of untested code, (b) the effectiveness of the testing technique, and © the completeness of the ongoing fuzzing campaign (w.r.t. the asymptotic max. achievable coverage). We make all data and our analysis publicly available.

DOI: 10.1109/ICSE48619.2023.00042

Learning Seed-Adaptive Mutation Strategies for Greybox Fuzzing

作者: Lee, Myungho and Cha, Sooyoung and Oh, Hakjoo
关键词: No keywords

Abstract

In this paper, we present a technique for learning seed-adaptive mutation strategies for fuzzers. The performance of mutation-based fuzzers highly depends on the mutation strategy that specifies the probability distribution of selecting mutation methods. As a result, developing an effective mutation strategy has received much attention recently, and program-adaptive techniques, which observe the behavior of the target program to learn the optimized mutation strategy per program, have become a trending approach to achieve better performance. They, however, still have a major limitation; they disregard the impacts of different characteristics of seed inputs which can lead to explore deeper program locations. To address this limitation, we present SeamFuzz, a novel fuzzing technique that automatically captures the characteristics of individual seed inputs and applies different mutation strategies for different seed inputs. By capturing the syntactic and semantic similarities between seed inputs, SeamFuzz clusters them into proper groups and learns effective mutation strategies tailored for each seed cluster by using the customized Thompson sampling algorithm. Experimental results show that SeamFuzz improves both the path-discovering and bug-finding abilities of state-of-the-art fuzzers on real-world programs.

DOI: 10.1109/ICSE48619.2023.00043

Improving Java Deserialization Gadget Chain Mining via Overriding-Guided Object Generation

作者: Cao, Sicong and Sun, Xiaobing and Wu, Xiaoxue and Bo, Lili and Li, Bin and Wu, Rongxin and Liu, Wei and He, Biao and Ouyang, Yu and Li, Jiajia
关键词: java deserialization vulnerability, gadget chain, method overriding, exploit generation

Abstract

Java (de)serialization is prone to causing security-critical vulnerabilities that attackers can invoke existing methods (gadgets) on the application’s classpath to construct a gadget chain to perform malicious behaviors. Several techniques have been proposed to statically identify suspicious gadget chains and dynamically generate injection objects for fuzzing. However, due to their incomplete support for dynamic program features (e.g., Java runtime polymorphism) and ineffective injection object generation for fuzzing, the existing techniques are still far from satisfactory.In this paper, we first performed an empirical study to investigate the characteristics of Java deserialization vulnerabilities based on our manually collected 86 publicly known gadget chains. The empirical results show that 1) Java deserialization gadgets are usually exploited by abusing runtime polymorphism, which enables attackers to reuse serializable overridden methods; and 2) attackers usually invoke exploitable overridden methods (gadgets) via dynamic binding to generate injection objects for gadget chain construction. Based on our empirical findings, we propose a novel gadget chain mining approach, GCMiner, which captures both explicit and implicit method calls to identify more gadget chains, and adopts an overriding-guided object generation approach to generate valid injection objects for fuzzing. The evaluation results show that GCMiner significantly outperforms the state-of-the-art techniques, and discovers 56 unique gadget chains that cannot be identified by the baseline approaches.

DOI: 10.1109/ICSE48619.2023.00044

Evaluating and Improving Hybrid Fuzzing

作者: Jiang, Ling and Yuan, Hengchen and Wu, Mingyuan and Zhang, Lingming and Zhang, Yuqun
关键词: No keywords

Abstract

To date, various hybrid fuzzers have been proposed for maximal program vulnerability exposure by integrating the power of fuzzing strategies and concolic executors. While the existing hybrid fuzzers have shown their superiority over conventional coverage-guided fuzzers, they seldom follow equivalent evaluation setups, e.g., benchmarks and seed corpora. Thus, there is a pressing need for a comprehensive study on the existing hybrid fuzzers to provide implications and guidance for future research in this area. To this end, in this paper, we conduct the first extensive study on state-of-the-art hybrid fuzzers. Surprisingly, our study shows that the performance of existing hybrid fuzzers may not well generalize to other experimental settings. Meanwhile, their performance advantages over conventional coverage-guided fuzzers are overall limited. In addition, instead of simply updating the fuzzing strategies or concolic executors, updating their coordination modes potentially poses crucial performance impact of hybrid fuzzers. Accordingly, we propose CoFuzz to improve the effectiveness of hybrid fuzzers by upgrading their coordination modes. Specifically, based on the baseline hybrid fuzzer QSYM, CoFuzz adopts edge-oriented scheduling to schedule edges for applying concolic execution via an online linear regression model with Stochastic Gradient Descent. It also adopts sampling-augmenting synchronization to derive seeds for applying fuzzing strategies via the interval path abstraction and John walk as well as incrementally updating the model. Our evaluation results indicate that CoFuzz can significantly increase the edge coverage (e.g., 16.31% higher than the best existing hybrid fuzzer in our study) and expose around 2X more unique crashes than all studied hybrid fuzzers. Moreover, CoFuzz successfully detects 37 previously unknown bugs where 30 are confirmed with 8 new CVEs and 20 are fixed.

DOI: 10.1109/ICSE48619.2023.00045

Robustification of Behavioral Designs against Environmental Deviations

作者: Zhang, Changjian and Saluja, Tarang and Meira-G'{o
关键词: No keywords

Abstract

Modern software systems are deployed in a highly dynamic, uncertain environment. Ideally, a system that is robust should be capable of establishing its most critical requirements even in the presence of possible deviations in the environment. We propose a technique called behavioral robustification, which involves systematically and rigorously improving the robustness of a design against potential deviations. Given behavioral models of a system and its environment, along with a set of user-specified deviations, our robustification method produces a redesign that is capable of satisfying a desired property even when the environment exhibits those deviations. In particular, we describe how the robustification problem can be formulated as a multi-objective optimization problem, where the goal is to restrict the deviating environment from causing a violation of a desired property, while maximizing the amount of existing functionality and minimizing the cost of changes to the original design. We demonstrate the effectiveness of our approach on case studies involving the robustness of an electronic voting machine and safety-critical interfaces.

DOI: 10.1109/ICSE48619.2023.00046

A Qualitative Study on the Implementation Design Decisions of Developers

作者: Liang, Jenny T. and Arab, Maryam and Ko, Minhyuk and Ko, Amy J. and LaToza, Thomas D.
关键词: implementation design decisions, software design

Abstract

Decision-making is a key software engineering skill. Developers constantly make choices throughout the software development process, from requirements to implementation. While prior work has studied developer decision-making, the choices made while choosing what solution to write in code remain understudied. In this mixed-methods study, we examine the phenomenon where developers select one specific way to implement a behavior in code, given many potential alternatives. We call these decisions implementation design decisions. Our mixed-methods study includes 46 survey responses and 14 semi-structured interviews with professional developers about their decision types, considerations, processes, and expertise for implementation design decisions. We find that implementation design decisions, rather than being a natural outcome from higher levels of design, require constant monitoring of higher level design choices, such as requirements and architecture. We also show that developers have a consistent general structure to their implementation decision-making process, but no single process is exactly the same. We discuss the implications of our findings on research, education, and practice, including insights on teaching developers how to make implementation design decisions.

DOI: 10.1109/ICSE48619.2023.00047

BFTDETECTOR： Automatic Detection of Business Flow Tampering for Digital Content Service

作者: Kim, I Luk and Wang, Weihang and Kwon, Yonghwi and Zhang, Xiangyu
关键词: JavaScript, business flow tampering, dynamic analysis, vulnerability detection

Abstract

Digital content services provide users with a wide range of content, such as news, articles, or movies, while monetizing their content through various business models and promotional methods. Unfortunately, poorly designed or unprotected business logic can be circumvented by malicious users, which is known as business flow tampering. Such flaws can severely harm the businesses of digital content service providers.In this paper, we propose an automated approach that discovers business flow tampering flaws. Our technique automatically runs a web service to cover different business flows (e.g., a news website with vs. without a subscription paywall) to collect execution traces. We perform differential analysis on the execution traces to identify divergence points that determine how the business flow begins to differ, and then we test to see if the divergence points can be tampered with. We assess our approach against 352 real-world digital content service providers and discover 315 flaws from 204 websites, including TIME, Fortune, and Forbes. Our evaluation result shows that our technique successfully identifies these flaws with low false-positive and false-negative rates of 0.49% and 1.44%, respectively.

DOI: 10.1109/ICSE48619.2023.00048

FedSlice： Protecting Federated Learning Models from Malicious Participants with Model Slicing

作者: Zhang, Ziqi and Li, Yuanchun and Liu, Bingyan and Cai, Yifeng and Li, Ding and Guo, Yao and Chen, Xiangqun
关键词: No keywords

Abstract

Crowdsourcing Federated learning (CFL) is a new crowdsourcing development paradigm for the Deep Neural Network (DNN) models, also called “software 2.0”. In practice, the privacy of CFL can be compromised by many attacks, such as free-rider attacks, adversarial attacks, gradient leakage attacks, and inference attacks. Conventional defensive techniques have low efficiency because they deploy heavy encryption techniques or rely on Trusted Execution Environments (TEEs). To improve the efficiency of protecting CFL from these attacks, this paper proposes FedSlice to prevent malicious participants from getting the whole server-side model while keeping the performance goal of CFL. FedSlice breaks the server-side model into several slices and delivers one slice to each participant. Thus, a malicious participant can only get a subset of the server-side model, preventing them from effectively conducting effective attacks. We evaluate FedSlice against these attacks, and results show that FedSlice provides effective defense: the server-side model leakage is reduced from 100% to 43.45%, the success rate of adversarial attacks is reduced from 100% to 11.66%, the average accuracy of membership inference is reduced from 71.91% to 51.58%, and the data leakage from shared gradients is reduced to the level of random guesses. Besides, FedSlice only introduces less than 2% accuracy loss and about 14% computation overhead. To the best of our knowledge, this is the first paper to discuss defense methods against these attacks to the CFL framework.

DOI: 10.1109/ICSE48619.2023.00049

PTPDroid： Detecting Violated User Privacy Disclosures to Third-Parties of Android Apps

作者: Tan, Zeya and Song, Wei
关键词: android app, privacy policy, third-party entities, violation detection, taint analysis, empirical study

Abstract

Android apps frequently access personal information to provide customized services. Since such information is sensitive in general, regulators require Android app vendors to publish privacy policies that describe what information is collected and why it is collected. Existing work mainly focuses on the types of the collected data but seldom considers the entities that collect user privacy, which could falsely classify problematic declarations about user privacy collected by third-parties into clear disclosures. To address this problem, we propose PTPDroid, a flow-to-policy consistency checking approach and an automated tool, to comprehensively uncover from the privacy policy the violated disclosures to third-parties. Our experiments on real-world apps demonstrate the effectiveness and superiority of PTPDroid, and our empirical study on 1,000 popular real-world apps reveals that violated user privacy disclosures to third-parties are prevalent in practice.

DOI: 10.1109/ICSE48619.2023.00050

AdHere： Automated Detection and Repair of Intrusive Ads

作者: Yan, Yutian and Zheng, Yunhui and Liu, Xinyue and Medvidovic, Nenad and Wang, Weihang
关键词: ad experience, advertising practice, better ads standards

Abstract

Today, more than 3 million websites rely on online advertising revenue. Despite the monetary incentives, ads often frustrate users by disrupting their experience, interrupting content, and slowing browsing. To improve ad experiences, leading media associations define Better Ads Standards for ads that are below user expectations. However, little is known about how well websites comply with these standards and whether existing approaches are sufficient for developers to quickly resolve such issues. In this paper, we propose AdHere, a technique that can detect intrusive ads that do not comply with Better Ads Standards and suggest repair proposals. AdHere works by first parsing the initial web page to a DOM tree to search for potential static ads, and then using mutation observers to monitor and detect intrusive (dynamic/static) ads on the fly. To handle ads’ volatile nature, AdHere includes two detection algorithms for desktop and mobile ads to identify different ad violations during three phases of page load events. It recursively applies the detection algorithms to resolve nested layers of DOM elements inserted by ad delegations. We evaluate AdHere on Alexa Top 1 Million Websites. The results show that AdHere is effective in detecting violating ads and suggesting repair proposals. Comparing to the current available alternative, AdHere detected intrusive ads on 4,656 more mobile websites and 3,911 more desktop websites, and improved recall by 16.6% and accuracy by 4.2%.

DOI: 10.1109/ICSE48619.2023.00051

Bad Snakes： Understanding and Improving Python Package Index Malware Scanning

作者: Vu, Duc-Ly and Newman, Zachary and Meyers, John Speed
关键词: open-source software (OSS) supply chain, malware detection, PyPI, qualitative study, quantitative study

Abstract

Open-source, community-driven package repositories see thousands of malware packages each year, but do not currently run automated malware detection systems. In this work, we explore the security goals of the repository administrators and the requirements for deploying such malware scanners via a case study of the Python ecosystem and PyPI repository, including interviews with administrators and maintainers. Further, we evaluate existing malware detection techniques for deployment in this setting by creating a benchmark dataset and comparing several existing tools: the malware checks implemented in PyPI, Bandit4Mal, and OSSGadget’s OSS Detect Backdoor.We find that repository administrators have exacting requirements for such malware detection tools. Specifically, they consider a false positive rate of even 0.1% to be unacceptably high, given the large number of package releases that might trigger false alerts. Measured tools have false positive rates between 15% and 97%; increasing thresholds for detection rules to reduce this rate renders the true positive rate useless.While automated tools are far from reaching these demands, we find that a socio-technical malware detection system has emerged to meet these needs: external security researchers perform repository malware scans, filter for useful results, and report the results to repository administrators. These parties face different incentives and constraints on their time and tooling. We conclude with recommendations for improving detection capabilities and strengthening the collaboration between security researchers and software repository administrators.

DOI: 10.1109/ICSE48619.2023.00052

FedDebug： Systematic Debugging for Federated Learning Applications

作者: Gill, Waris and Anwar, Ali and Gulzar, Muhammad Ali
关键词: software debugging, federated learning, testing, client, fault localization, neural networks, CNN

Abstract

In Federated Learning (FL), clients independently train local models and share them with a central aggregator to build a global model. Impermissibility to access clients’ data and collaborative training make FL appealing for applications with data-privacy concerns, such as medical imaging. However, these FL characteristics pose unprecedented challenges for debugging. When a global model’s performance deteriorates, identifying the responsible rounds and clients is a major pain point. Developers resort to trial-and-error debugging with subsets of clients, hoping to increase the global model’s accuracy or let future FL rounds retune the model, which are time-consuming and costly.We design a systematic fault localization framework, FedDebug, that advances the FL debugging on two novel fronts. First, FedDebug enables interactive debugging of realtime collaborative training in FL by leveraging record and replay techniques to construct a simulation that mirrors live FL. FedDebug’s breakpoint can help inspect an FL state (round, client, and global model) and move between rounds and clients’ models seamlessly, enabling a fine-grained step-by-step inspection. Second, FedDebug automatically identifies the client(s) responsible for lowering the global model’s performance without any testing data and labels—both are essential for existing debugging techniques. FedDebug’s strengths come from adapting differential testing in conjunction with neuron activations to determine the client(s) deviating from normal behavior. FedDebug achieves 100% accuracy in finding a single faulty client and 90.3% accuracy in finding multiple faulty clients. FedDebug’s interactive debugging incurs 1.2% overhead during training, while it localizes a faulty client in only 2.1% of a round’s training time. With FedDebug, we bring effective debugging practices to federated learning, improving the quality and productivity of FL application developers.

DOI: 10.1109/ICSE48619.2023.00053

Practical and Efficient Model Extraction of Sentiment Analysis APIs

作者: Wu, Weibin and Zhang, Jianping and Wei, Victor Junqiu and Chen, Xixian and Zheng, Zibin and King, Irwin and Lyu, Michael R.
关键词: model extraction, sentiment analysis APIS, active learning, architecture search

Abstract

Despite their stunning performance, developing deep learning models from scratch is a formidable task. Therefore, it popularizes Machine-Learning-as-a-Service (MLaaS), where general users can access the trained models of MLaaS providers via Application Programming Interfaces (APIs) on a pay-per-query basis. Unfortunately, the success of MLaaS is under threat from model extraction attacks, where attackers intend to extract a local model of equivalent functionality to the target MLaaS model. However, existing studies on model extraction of text analytics APIs frequently assume adversaries have strong knowledge about the victim model, like its architecture and parameters, which hardly holds in practice. Besides, since the attacker’s and the victim’s training data can be considerably discrepant, it is non-trivial to perform efficient model extraction. In this paper, to advance the understanding of such attacks, we propose a framework, PEEP, for practical and efficient model extraction of sentiment analysis APIs with only query access. Specifically, PEEP features a learning-based scheme, which employs out-of-domain public corpora and a novel query strategy to construct proxy training data for model extraction. Besides, PEEP introduces a greedy search algorithm to settle an appropriate architecture for the extracted model. We conducted extensive experiments with two victim models across three datasets and two real-life commercial sentiment analysis APIs. Experimental results corroborate that PEEP can consistently outperform the state-of-the-art baselines in terms of effectiveness and efficiency.

DOI: 10.1109/ICSE48619.2023.00054

CrossCodeBench： Benchmarking Cross-Task Generalization of Source Code Models

作者: Niu, Changan and Li, Chuanyi and Ng, Vincent and Luo, Bin
关键词: pre-training of source code, cross-task transfer learning, few-shot learning, AI for SE

Abstract

Despite the recent advances showing that a model pre-trained on large-scale source code data is able to gain appreciable generalization capability, it still requires a sizeable amount of data on the target task for fine-tuning. And the effectiveness of the model generalization is largely affected by the size and quality of the fine-tuning data, which is detrimental for target tasks with limited or unavailable resources. Therefore, cross-task generalization, with the goal of improving the generalization of the model to unseen tasks that have not been seen before, is of strong research and application value.In this paper, we propose a large-scale benchmark that includes 216 existing code-related tasks. Then, we annotate each task with the corresponding meta information such as task description and instruction, which contains detailed information about the task and a solution guide. This also helps us to easily create a wide variety of “training/evaluation” task splits to evaluate the various cross-task generalization capabilities of the model. Then we perform some preliminary experiments to demonstrate that the cross-task generalization of models can be largely improved by in-context learning methods such as few-shot learning and learning from task instructions, which shows the promising prospects of conducting cross-task learning research on our benchmark. We hope that the collection of the datasets and our benchmark will facilitate future work that is not limited to cross-task generalization.

DOI: 10.1109/ICSE48619.2023.00055

ECSTATIC： An Extensible Framework for Testing and Debugging Configurable Static Analysis

作者: Mordahl, Austin and Zhang, Zenong and Soles, Dakota and Wei, Shiyi
关键词: program analysis, testing and debugging

Abstract

Testing and debugging the implementation of static analysis is a challenging task, often involving significant manual effort from domain experts in a tedious and unprincipled process. In this work, we propose an approach that greatly improves the automation of this process for static analyzers with configuration options. At the core of our approach is the novel adaptation of the theoretical partial order relations that exist between these options to reason about the correctness of actual results from running the static analyzer with different configurations. This allows for automated testing of static analyzers with clearly defined oracles, followed by automated delta debugging, even in cases where ground truths are not defined over the input programs. To apply this approach to many static analysis tools, we design and implement ECSTATIC, an easy-to-extend, open-source framework. We have integrated four popular static analysis tools, SOOT, WALA, DOOP, and FlowDroid, into ECSTATIC. Our evaluation shows running ECSTATIC detects 74 partial order bugs in the four tools and produces reduced bug-inducing programs to assist debugging. We reported 42 bugs; in all cases where we received responses, the tool developers confirmed the reported tool behavior was unintended. So far, three bugs have been fixed and there are ongoing discussions to fix more.

DOI: 10.1109/ICSE48619.2023.00056

Responsibility in Context： On Applicability of Slicing in Semantic Regression Analysis

作者: Badihi, Sahar and Ahmed, Khaled and Li, Yi and Rubin, Julia
关键词: program slicing, slice minimization, regression failures, case study

Abstract

Numerous program slicing approaches aim to help developers troubleshoot regression failures - one of the most time-consuming development tasks. The main idea behind these approaches is to identify a subset of interdependent program statements relevant to the failure, minimizing the amount of code developers need to inspect. Accuracy and reduction rate achieved by slicing are the key considerations toward their applicability in practice: inspecting only the statements in a slice should be faster and more efficient than inspecting the code in full.In this paper, we report on our experiment applying one of the most recent and accurate slicing approaches, dual slicing, to the task of troubleshooting regression failures. As subjects, we use projects from the popular Defects4J benchmark and a systematically-collected set of eight large, open-source client-library project pairs with at least one library upgrade failure, which we refer to as LibRench. The results of our experiments show that the produced slices, while effective in reducing the scope of manual inspection, are still very large to be comfortably analyzed by a human. When inspecting these slices, we observe that most statements in a slice deal with the propagation of information between changed code blocks; these statements are essential for obtaining the necessary context for the changes but are not responsible for the failure directly.Motivated by this insight, we propose a novel approach, implemented in a tool named InPreSS, for further reducing the size of a slice by accurately identifying and summarizing the propagation-related code blocks. Our evaluation of InPreSS shows that it is able to produce slices that are 76% shorter than the original ones (207 vs. 2,007 execution statements, on average), thus, reducing the amount of information developers need to inspect without losing the necessary contextual information.

DOI: 10.1109/ICSE48619.2023.00057

Does the Stream API Benefit from Special Debugging Facilities? A Controlled Experiment on Loops and Streams with Specific Debuggers

作者: Reichl, Jan and Hanenberg, Stefan and Gruhn, Volker
关键词: software engineering, programming techniques, debugging aids, usability testing

Abstract

Java’s Stream API, that massively makes use of lambda expressions, permits a more declarative way of defining operations on collections in comparison to traditional loops. While experimental results suggest that the use of the Stream API has measurable benefits with respect to code readability (in comparison to loops), a remaining question is whether it has other implications. And one of such implications is, for example, tooling in general and debugging in particular because of the following: While the traditional loop-based approach applies filters one after another to single elements, the Stream API applies filters on whole collections. In the meantime there are dedicated debuggers for the Stream API, but it remains unclear whether such a debugger (on the Stream API) has a measurable benefit in comparison to the traditional stepwise debugger (on loops). The present papers introduces a controlled experiment on the debugging of filter operations using a stepwise debugger versus a stream debugger. The results indicate that under the experiment’s settings the stream debugger has a significant (p<.001) and large, positive effect [EQUATION]. However, the experiment reveals that additional factors interact with the debugger treatment such as whether or not the failing object is known upfront. The mentioned factor has a strong and large disordinal interaction effect with the debugger (p<.001; η2p=.928): In case an object is known upfront that can be used to identify a failing filter, the stream debugger is even less efficient than the stepwise debugger [EQUATION]. Hence, while we found overall a positive effect of the stream debugger, the answer whether or not debugging is easier on loops or streams cannot be answered without taking the other variables into account. Consequently, we see a contribution of the present paper not only in the comparison of different debuggers but in the identification of additional factors.

DOI: 10.1109/ICSE48619.2023.00058

Fonte： Finding Bug Inducing Commits from Failures

作者: An, Gabin and Hong, Jingun and Kim, Naryeong and Yoo, Shin
关键词: bug inducing commit, fault localisation, git, weighted bisection, batch testing

Abstract

A Bug Inducing Commit (BIC) is a commit that introduces a software bug into the codebase. Knowing the relevant BIC for a given bug can provide valuable information for debugging as well as bug triaging. However, existing BIC identification techniques are either too expensive (because they require the failing tests to be executed against previous versions for bisection) or inapplicable at the debugging time (because they require post hoc artefacts such as bug reports or bug fixes). We propose Fonte, an efficient and accurate BIC identification technique that only requires test coverage. Fonte combines Fault Localisation (FL) with BIC identification and ranks commits based on the suspiciousness of the code elements that they modified. Fonte reduces the search space of BICs using failure coverage as well as a filter that detects commits that are merely style changes. Our empirical evaluation using 130 real-world BICs shows that Fonte significantly outperforms state-of-the-art BIC identification techniques based on Information Retrieval as well as neural code embedding models, achieving at least 39% higher MRR. We also report that the ranking scores produced by Fonte can be used to perform weighted bisection, further reducing the cost of BIC identification. Finally, we apply Fonte to a large-scale industry project with over 10M lines of code, and show that it can rank the actual BIC within the top five commits for 87% of the studied real batch-testing failures, and save the BIC inspection cost by 32% on average.

DOI: 10.1109/ICSE48619.2023.00059

RepresentThemAll： A Universal Learning Representation of Bug Reports

作者: Fang, Sen and Zhang, Tao and Tan, Youshuai and Jiang, He and Xia, Xin and Sun, Xiaobing
关键词: No keywords

Abstract

Deep learning techniques have shown promising performance in automated software maintenance tasks associated with bug reports. Currently, all existing studies learn the customized representation of bug reports for a specific downstream task. Despite early success, training multiple models for multiple downstream tasks faces three issues: complexity, cost, and compatibility, due to the customization, disparity, and uniqueness of these automated approaches. To resolve the above challenges, we propose RepresentThemAll, a pre-trained approach that can learn the universal representation of bug reports and handle multiple downstream tasks. Specifically, RepresentThemAll is a universal bug report framework that is pre-trained with two carefully designed learning objectives: one is the dynamic masked language model and another one is a contrastive learning objective, “find yourself”. We evaluate the performance of RepresentThemAll on four downstream tasks, including duplicate bug report detection, bug report summarization, bug priority prediction, and bug severity prediction. Our experimental results show that RepresentThemAll outperforms all baseline approaches on all considered downstream tasks after well-designed fine-tuning.

DOI: 10.1109/ICSE48619.2023.00060

Demystifying Exploitable Bugs in Smart Contracts

作者: Zhang, Zhuo and Zhang, Brian and Xu, Wen and Lin, Zhiqiang
关键词: blockchain, smart contract, vulnerability, empirical study

Abstract

Exploitable bugs in smart contracts have caused significant monetary loss. Despite the substantial advances in smart contract bug finding, exploitable bugs and real-world attacks are still trending. In this paper we systematically investigate 516 unique real-world smart contract vulnerabilities in years 2021–2022, and study how many can be exploited by malicious users and cannot be detected by existing analysis tools. We further categorize the bugs that cannot be detected by existing tools into seven types and study their root causes, distributions, difficulties to audit, consequences, and repair strategies. For each type, we abstract them to a bug model (if possible), facilitating finding similar bugs in other contracts and future automation. We leverage the findings in auditing real world smart contracts, and so far we have been rewarded with $102,660 bug bounties for identifying 15 critical zero-day exploitable bugs, which could have caused up to $22.52 millions monetary loss if exploited.

DOI: 10.1109/ICSE48619.2023.00061

Understanding and Detecting On-the-Fly Configuration Bugs

作者: Wang, Teng and Jia, Zhouyang and Li, Shanshan and Zheng, Si and Yu, Yue and Xu, Erci and Peng, Shaoliang and Liao, Xiangke
关键词: on-the-fly configuration updates, bug detection, metamorphic testing

Abstract

Software systems introduce an increasing number of configuration options to provide flexibility, and support updating the options on the fly to provide persistent services. This mechanism, however, may affect the system reliability, leading to unexpected results like software crashes or functional errors. In this paper, we refer to the bugs caused by on-the-fly configuration updates as on-the-fly configuration bugs, or OCBugs for short.In this paper, we conducted the first in-depth study on 75 real-world OCBugs from 5 widely used systems to understand the symptoms, root causes, and triggering conditions of OCBugs. Based on our study, we designed and implemented Parachute, an automated testing framework to detect OCBugs. Our key insight is that the value of one configuration option, either loaded at the startup phase or updated on the fly, should have the same effects on the target program. Parachute generates tests for on-the-fly configuration updates by mutating the existing tests and conducts differential analysis to identify OCBugs. We evaluated Parachute on 7 real-world software systems. The results show that Parachute detected 75% (42/56) of the known OCBugs, and reported 13 unknown bugs, 11 of which have been confirmed or fixed by developers until the time of writing.

DOI: 10.1109/ICSE48619.2023.00062

Explaining Software Bugs Leveraging Code Structures in Neural Machine Translation

作者: Mahbub, Parvez and Shuvo, Ohiduzzaman and Rahman, Mohammad Masudur
关键词: software bug, bug explanation, software engineering, software maintenance, natural language processing, deep learning, transformers

Abstract

Software bugs claim ≈ 50% of development time and cost the global economy billions of dollars. Once a bug is reported, the assigned developer attempts to identify and understand the source code responsible for the bug and then corrects the code. Over the last five decades, there has been significant research on automatically finding or correcting software bugs. However, there has been little research on automatically explaining the bugs to the developers, which is essential but a highly challenging task. In this paper, we propose Bugsplainer, a transformer-based generative model, that generates natural language explanations for software bugs by learning from a large corpus of bug-fix commits. Bugsplainer can leverage structural information and buggy patterns from the source code to generate an explanation for a bug. Our evaluation using three performance metrics shows that Bugsplainer can generate understandable and good explanations according to Google’s standard, and can outperform multiple baselines from the literature. We also conduct a developer study involving 20 participants where the explanations from Bugsplainer were found to be more accurate, more precise, more concise and more useful than the baselines.

DOI: 10.1109/ICSE48619.2023.00063

作者: Tan, Xin and Chen, Yiran and Wu, Haohua and Zhou, Minghui and Zhang, Li
关键词: newcomer, mentoring, open source, good first issue

Abstract

Newcomers are critical for the success and continuity of open source software (OSS) projects. To attract newcomers and facilitate their onboarding, many OSS projects recommend tasks for newcomers, such as good first issues (GFIs). Previous studies have preliminarily investigated the effects of GFIs and techniques to identify suitable GFIs. However, it is still unclear whether just recommending tasks is enough and how significant mentoring is for newcomers. To better understand mentoring in OSS communities, we analyze the resolution process of 48,402 GFIs from 964 repositories through a mix-method approach. We investigate the extent, the mentorship structures, the discussed topics, and the relevance of expert involvement. We find that ~70% of GFIs have expert participation, with each GFI usually having one expert who makes two comments. Half of GFIs will receive their first expert comment within 8.5 hours after a newcomer comment. Through analysis of the collaboration networks of newcomers and experts, we observe that community mentorship presents four types of structure: centralized mentoring, decentralized mentoring, collaborative mentoring, and distributed mentoring. As for discussed topics, we identify 14 newcomer challenges and 18 expert mentoring content. By fitting the generalized linear models, we find that expert involvement positively correlates with newcomers’ successful contributions but negatively correlates with newcomers’ retention. Our study manifests the status and significance of mentoring in the OSS projects, which provides rich practical implications for optimizing the mentoring process and helping newcomers contribute smoothly and successfully.

DOI: 10.1109/ICSE48619.2023.00064

From Organizations to Individuals： Psychoactive Substance Use by Professional Programmers

作者: Newman, Kaia and Endres, Madeline and Weimer, Westley and Johnson, Brittany
关键词: software engineering, mental health, drug use, productivity, qualitative methods

Abstract

Psychoactive substances, which influence the brain to alter perceptions and moods, have the potential to have positive and negative effects on critical software engineering tasks. They are widely used in software, but that use is not well understood. We present the results of the first qualitative investigation of the experiences of, and challenges faced by, psychoactive substance users in professional software communities. We conduct a thematic analysis of hour-long interviews with 26 professional programmers who use psychoactive substances at work. Our results provide insight into individual motivations and impacts, including mental health and the relationships between various substances and productivity. Our findings elaborate on socialization effects, including soft skills, stigma, and remote work. The analysis also highlights implications for organizational policy, including positive and negative impacts on recruitment and retention. By exploring individual usage motivations, social and cultural ramifications, and organizational policy, we demonstrate how substance use can permeate all levels of software development.

DOI: 10.1109/ICSE48619.2023.00065

On the Self-Governance and Episodic Changes in Apache Incubator Projects： An Empirical Study

作者: Yin, Likang and Zhang, Xiyu and Filkov, Vladimir
关键词: No keywords

Abstract

Sustainable Open Source Software (OSS) projects are characterized by the ability to attract new project members and maintain an energetic project community. Building sustainable OSS projects from a nascent state requires effective project governance and socio-technical structure to be interleaved, in a complex and dynamic process. Although individual disciplines have studied each separately, little is known about how governance and software development work together in practice toward sustainability. Prior work has shown that many OSS projects experience large, episodic changes over short periods of time, which can propel them or drag them down. However, sustainable projects typically manage to come out unscathed from such changes, while others do not. The natural questions arise: Can we identify the back-and-forth between governance and socio-technical structure that lead to sustainability following episodic events? And, how about those that do not lead to sustainability?From a data set of social, technical, and policy digital traces from 262 sustainability-labeled ASF incubator projects, here we employ a large-scale empirical study to characterize episodic changes in socio-technical aspects measured by Change Intervals (CI), governance rules and regulations in a form of Institutional Statements (IS), and the temporal relationships between them. We find that sustainable projects during episodic changes can adapt themselves to institutional statements more efficiently, and that institutional discussions can lead to episodic changes intervals in socio-technical aspects of the projects, and vice versa. In practice, these results can provide timely guidance beyond socio-technical considerations, adding rules and regulations in the mix, toward a unified analytical framework for OSS project sustainability.

DOI: 10.1109/ICSE48619.2023.00066

Socio-Technical Anti-Patterns in Building ML-Enabled Software： Insights from Leaders on the Forefront

作者: Mailach, Alina and Siegmund, Norbert
关键词: No keywords

Abstract

Although machine learning (ML)-enabled software systems seem to be a success story considering their rise in economic power, there are consistent reports from companies and practitioners struggling to bring ML models into production. Many papers have focused on specific, and purely technical aspects, such as testing and pipelines, but only few on socio-technical aspects.Driven by numerous anecdotes and reports from practitioners, our goal is to collect and analyze socio-technical challenges of productionizing ML models centered around and within teams. To this end, we conducted the largest qualitative empirical study in this area, involving the manual analysis of 66 hours of talks that have been recorded by the MLOps community.By analyzing talks from practitioners for practitioners of a community with over 11,000 members in their Slack workspace, we found 17 anti-patterns, often rooted in organizational or management problems. We further list recommendations to overcome these problems, ranging from technical solutions over guidelines to organizational restructuring. Finally, we contextualize our findings with previous research, confirming existing results, validating our own, and highlighting new insights.

DOI: 10.1109/ICSE48619.2023.00067

Moving on from the Software Engineers’ Gambit： An Approach to Support the Defense of Software Effort Estimates

作者: Matsubara, Patricia G. F. and Steinmacher, Igor and Gadelha, Bruno and Conte, Tayana
关键词: software effort estimation, negotiation, behavioral software engineering, defense of estimates

Abstract

Pressure for higher productivity and faster delivery is increasingly pervading software organizations. This can lead software engineers to act like chess players playing a gambit—making sacrifices of their technically sound estimates, thus submitting their teams to time pressure. In turn, time pressure can have varied detrimental effects, such as poor product quality and emotional distress, decreasing productivity, which leads to more time pressure and delays: a hard-to-stop vicious cycle. This reveals a need for moving on from the more passive strategy of yielding to pressure to a more active one of defending software estimates. Therefore, we propose an approach to support software estimators in acquiring knowledge on how to carry out such defense, by introducing negotiation principles encapsulated in a set of defense lenses, presented through a digital simulation. We evaluated the proposed approach through a controlled experiment with software practitioners from different companies. We collected data on participants’ attitudes, subjective norms, perceived behavioral control, and intentions to perform the defense of their estimates in light of the Theory of Planned Behavior. We employed a frequentist and a bayesian approach to data analysis. Results show improved scores among experimental group participants after engaging with the digital simulation and learning about the lenses. They were also more inclined to choose a defense action when facing pressure scenarios than a control group exposed to questions to reflect on the reasons and outcomes of pressure over estimates. Qualitative evidence reveals that practitioners perceived the set of lenses as useful in their current work environments. Collectively, these results show the effectiveness of the proposed approach and its perceived relevance for the industry, despite the low amount of time required to engage with it.

DOI: 10.1109/ICSE48619.2023.00068

Concrat： An Automatic C-to-Rust Lock API Translator for Concurrent Programs

作者: Hong, Jaemin and Ryu, Sukyoung
关键词: No keywords

Abstract

Concurrent programs suffer from data races. To prevent data races, programmers use locks. However, programs can eliminate data races only when they acquire and release correct locks at correct timing. The lock API of C, in which people have developed a large portion of legacy system programs, does not validate the correct use of locks. On the other hand, Rust, a recently developed system programming language, provides a lock API that guarantees the correct use of locks via type checking. This makes rewriting legacy system programs in Rust a promising way to retrofit safety into them. Unfortunately, manual C-to-Rust translation is extremely laborious due to the discrepancies between their lock APIs. Even the state-of-the-art automatic C-to-Rust translator retains the C lock API, expecting developers to replace them with the Rust lock API. In this work, we propose an automatic tool to replace the C lock API with the Rust lock API. It facilitates C-to-Rust translation of concurrent programs with less human effort than the current practice. Our tool consists of a Rust code transformer that takes a lock summary as an input and a static analyzer that efficiently generates precise lock summaries. We show that the transformer is scalable and widely applicable while preserving the semantics; it transforms 66 KLOC in 2.6 seconds and successfully handles 74% of real-world programs. We also show that the analyzer is scalable and precise; it analyzes 66 KLOC in 4.3 seconds.

DOI: 10.1109/ICSE48619.2023.00069

Triggers for Reactive Synthesis Specifications

作者: Amram, Gal and Ma’ayan, Dor and Maoz, Shahar and Pistiner, Or and Ringert, Jan Oliver
关键词: No keywords

Abstract

Reactive synthesis is an automated procedure to obtain a correct-by-construction reactive system from its temporal logic specification. Two of the main challenges in bringing reactive synthesis to practice are its very high worst-case complexity and the difficulty of writing declarative specifications using basic LTL operators. To address the first challenge, researchers have suggested the GR(1) fragment of LTL, which has an efficient polynomial time symbolic synthesis algorithm. To address the second challenge, specification languages include higher-level constructs that aim at allowing engineers to write succinct and readable specifications. One such construct is the triggers operator, as supported, e.g., in the Property Specification Language (PSL).In this work we introduce triggers into specifications for reactive synthesis. The effectiveness of our contribution relies on a novel encoding of regular expressions using symbolic finite automata (SFA) and on a novel semantics for triggers that, in contrast to PSL triggers, admits an efficient translation into GR(1). We show that our triggers are expressive and succinct, and prove that our encoding is optimal.We have implemented our ideas on top of the Spectra language and synthesizer. We demonstrate the usefulness and effectiveness of using triggers in specifications for synthesis, as well as the challenges involved in using them, via a study of more than 300 triggers written by undergraduate students who participated in a project class on writing specifications for synthesis.To the best of our knowledge, our work is the first to introduce triggers into specifications for reactive synthesis.

DOI: 10.1109/ICSE48619.2023.00070

Using Reactive Synthesis： An End-to-End Exploratory Case Study

作者: Ma’ayan, Dor and Maoz, Shahar
关键词: No keywords

Abstract

Reactive synthesis is an automated procedure to obtain a correct-by-construction reactive system from its temporal logic specification. Despite its attractiveness and major research progress in the past decades, reactive synthesis is still in early-stage and has not gained popularity outside academia. We conducted an exploratory case study in which we followed students in a semester-long university workshop class on their end-to-end use of a reactive synthesizer, from writing the specifications to executing the synthesized controllers. The data we collected includes more than 500 versions of more than 80 specifications, as well as more than 2500 Slack messages, all written by the class participants. Our grounded theory analysis reveals that the use of reactive synthesis has clear benefits for certain tasks and that adequate specification language constructs assist in the specification writing process. However, inherent issues such as unrealizabilty, non-well-separation, the gap of knowledge between the users and the synthesizer, and considerable running times prevent reactive synthesis from fulfilling its promise. Based on our analysis, we propose action items in the directions of language and specification quality, tools for analysis and execution, and process and methodology, all towards making reactive synthesis more applicable for software engineers.

DOI: 10.1109/ICSE48619.2023.00071

Syntax and Domain Aware Model for Unsupervised Program Translation

作者: Liu, Fang and Li, Jia and Zhang, Li
关键词: program translation, neural networks, syntax structure, unsupervised learning

Abstract

There is growing interest in software migration as the development of software and society. Manually migrating projects between languages is error-prone and expensive. In recent years, researchers have begun to explore automatic program translation using supervised deep learning techniques by learning from large-scale parallel code corpus. However, parallel resources are scarce in the programming language domain, and it is costly to collect bilingual data manually. To address this issue, several unsupervised programming translation systems are proposed. However, these systems still rely on huge monolingual source code to train, which is very expensive. Besides, these models cannot perform well for translating the languages that are not seen during the pre-training procedure. In this paper, we propose SDA-Trans, a syntax and domain-aware model for program translation, which leverages the syntax structure and domain knowledge to enhance the cross-lingual transfer ability. SDA-Trans adopts unsupervised training on a smaller-scale corpus, including Python and Java monolingual programs. The experimental results on function translation tasks between Python, Java, and C++ show that SDA-Trans outperforms many large-scale pre-trained models, especially for unseen language translation.

DOI: 10.1109/ICSE48619.2023.00072

Developer-Intent Driven Code Comment Generation

作者: Mu, Fangwen and Chen, Xiao and Shi, Lin and Wang, Song and Wang, Qing
关键词: code comment generation, intent-controllable comment generation, automated comment-intent labeling

Abstract

Existing automatic code comment generators mainly focus on producing a general description of functionality for a given code snippet without considering developer intentions. However, in real-world practice, comments are complicated, which often contain information reflecting various intentions of developers, e.g., functionality summarization, design rationale, implementation details, code properties, etc. To bridge the gap between automatic code comment generation and real-world comment practice, we define Developer-Intent Driven Code Comment Generation, which can generate intent-aware comments for the same source code with different intents. To tackle this challenging task, we propose DOME, an approach that utilizes Intent-guided Selective Attention to explicitly select intent-relevant information from the source code, and produces various comments reflecting different intents. Our approach is evaluated on two real-world Java datasets, and the experimental results show that our approach outperforms the state-of-the-art baselines. A human evaluation also confirms the significant potential of applying DOME in practical usage, enabling developers to comment code effectively according to their own needs.

DOI: 10.1109/ICSE48619.2023.00073

Data Quality Matters： A Case Study of Obsolete Comment Detection

作者: Xu, Shengbin and Yao, Yuan and Xu, Feng and Gu, Tianxiao and Xu, Jingwei and Ma, Xiaoxing
关键词: obsolete comment detection, machine learning for software engineering, data quality

Abstract

Machine learning methods have achieved great success in many software engineering tasks. However, as a data-driven paradigm, how would the data quality impact the effectiveness of these methods remains largely unexplored. In this paper, we explore this problem under the context of just-in-time obsolete comment detection. Specifically, we first conduct data cleaning on the existing benchmark dataset, and empirically observe that with only 0.22% label corrections and even 15.0% fewer data, the existing obsolete comment detection approaches can achieve up to 10.7% relative accuracy improvement. To further mitigate the data quality issues, we propose an adversarial learning framework to simultaneously estimate the data quality and make the final predictions. Experimental evaluations show that this adversarial learning framework can further improve the relative accuracy by up to 18.1% compared to the state-of-the-art method. Although our current results are from the obsolete comment detection problem, we believe that the proposed two-phase solution, which handles the data quality issues through both the data aspect and the algorithm aspect, is also generalizable and applicable to other machine learning based software engineering tasks.

DOI: 10.1109/ICSE48619.2023.00074

Revisiting Learning-Based Commit Message Generation

作者: Dong, Jinhao and Lou, Yiling and Hao, Dan and Tan, Lin
关键词: commit message generation, deep learning, pattern-based

Abstract

Commit messages summarize code changes and help developers understand the intention. To alleviate human efforts in writing commit messages, researchers have proposed various automated commit message generation techniques, among which learning-based techniques have achieved great success in recent years. However, existing evaluation on learning-based commit message generation relies on the automatic metrics (e.g., BLEU) widely used in natural language processing (NLP) tasks, which are aggregated scores calculated based on the similarity between generated commit messages and the ground truth. Therefore, it remains unclear what generated commit messages look like and what kind of commit messages could be precisely generated by existing learning-based techniques.To fill this knowledge gap, this work performs the first study to systematically investigate the detailed commit messages generated by learning-based techniques. In particular, we first investigate the frequent patterns of the commit messages generated by state-of-the-art learning-based techniques. Surprisingly, we find the majority (~90%) of their generated commit messages belong to simple patterns (i.e., addition/removal/fix/avoidance patterns). To further explore the reasons, we then study the impact of datasets, input representations, and model components. We surprisingly find that existing learning-based techniques have competitive performance even when the inputs are only represented by change marks (i.e., “+”/“-”/“”). It indicates that existing learning-based techniques poorly utilize syntax and semantics in the code while mostly focusing on change marks, which could be the major reason for generating so many pattern-matching commit messages. We also find that the pattern ratio in the training set might also positively affect the pattern ratio of generated commit messages; and model components might have different impact on the pattern ratio.

DOI: 10.1109/ICSE48619.2023.00075

Commit Message Matters： Investigating Impact and Evolution of Commit Message Quality

作者: Li, Jiawei and Ahmed, Iftekhar
关键词: commit message quality, software defect proneness, empirical analysis

Abstract

Commit messages play an important role in communication among developers. To measure the quality of commit messages, researchers have defined what semantically constitutes a Good commit message: it should have both the summary of the code change (What) and the motivation/reason behind it (Why). The presence of the issue report/pull request links referenced in a commit message has been treated as a way of providing Why information. In this study, we found several quality issues that could hamper the links’ ability to provide Why information. Based on this observation, we developed a machine learning classifier for automatically identifying whether a commit message has What and Why information by considering both the commit messages and the link contents. This classifier outperforms state-of-the-art machine learning classifiers by 12 percentage points improvement in the F1 score. With the improved classifier, we conducted a mixed method empirical analysis and found that: (1) Commit message quality has an impact on software defect proneness, and (2) the overall quality of the commit messages decreases over time, while developers believe they are writing better commit messages. All the research artifacts (i.e., tools, scripts, and data) of this study are available on the accompanying website [2].

DOI: 10.1109/ICSE48619.2023.00076

PILAR： Studying and Mitigating the Influence of Configurations on Log Parsing

作者: Dai, Hetong and Tang, Yiming and Li, Heng and Shang, Weiyi
关键词: No keywords

Abstract

The significance of logs has been widely acknowledged with the adoption of various log analysis techniques that assist in software engineering tasks. Many log analysis techniques require structured logs as input while raw logs are typically unstructured. Automated log parsing is proposed to convert unstructured raw logs into structured log templates. Some log parsers achieve promising accuracy, yet they rely on significant efforts from the users to tune the parameters to achieve optimal results. In this paper, we first conduct an empirical study to understand the influence of the configurable parameters of six state-of-the-art log parsers on their parsing results on three aspects: 1) varying the parameters while using the same dataset, 2) keeping the same parameters while using different datasets, and 3) using different samples from the same dataset. Our results show that all these parsers are sensitive to the parameters, posing challenges to their adoption in practice. To mitigate such challenges, we propose PILAR (Parameter Insensitive Log Parser), an entropy-based log parsing approach. We compare PILAR with the existing log parsers on the same three aspects and find that PILAR is the most parameter-insensitive one. In addition, PILAR achieves the second highest parsing accuracy and efficiency among all the state-of-the-art log parsers. This paper paves the road for easing the adoption of log analysis in software engineer practices.

DOI: 10.1109/ICSE48619.2023.00077

Did We Miss Something Important? Studying and Exploring Variable-Aware Log Abstraction

作者: Li, Zhenhao and Luo, Chuan and Chen, Tse-Hsun (Peter) and Shang, Weiyi and He, Shilin and Lin, Qingwei and Zhang, Dongmei
关键词: software logs, log abstraction, deep learning

Abstract

Due to the sheer size of software logs, developers rely on automated techniques for log analysis. One of the first and most important steps of automated log analysis is log abstraction, which parses the raw logs into a structured format. Prior log abstraction techniques aim to identify and abstract all the dynamic variables in logs and output a static log template for automated log analysis. However, these abstracted dynamic variables may also contain important information that is useful to different tasks in log analysis. In this paper, we investigate the characteristics of dynamic variables and their importance in practice, and explore the potential of a variable-aware log abstraction technique. Through manual investigations and surveys with practitioners, we find that different categories of dynamic variables record various information that can be important depending on the given tasks, the distinction of dynamic variables in log abstraction can further assist in log analysis. We then propose a deep learning based log abstraction approach, named VALB, which can identify different categories of dynamic variables and preserve the value of specified categories of dynamic variables along with the log templates (i.e., variable-aware log abstraction). Through the evaluation on a widely used log abstraction benchmark, we find that VALB outperforms other state-of-the-art log abstraction techniques on general log abstraction (i.e., when abstracting all the dynamic variables) and also achieves a high variable-aware log abstraction accuracy that further identifies the category of the dynamic variables. Our study highlights the potential of leveraging the important information recorded in the dynamic variables to further improve the process of log analysis.

DOI: 10.1109/ICSE48619.2023.00078

On the Temporal Relations between Logging and Code

作者: Ding, Zishuo and Tang, Yiming and Li, Yang and Li, Heng and Shang, Weiyi
关键词: software logging, logging text, temporal relations

Abstract

Prior work shows that misleading logging texts (i.e., the textual descriptions in logging statements) can be counterproductive for developers during their use of logs. One of the most important types of information provided by logs is the temporal information of the recorded system behavior. For example, a logging text may use a perfective aspect to describe a fact that an important system event has finished. Although prior work has performed extensive studies on automated logging suggestions, few of these studies investigate the temporal relations between logging and code. In this work, we make the first attempt to comprehensively study the temporal relations between logging and its corresponding source code. In particular, we focus on two types of temporal relations: (1) logical temporal relations, which can be inferred from the execution order between the logging statement and the corresponding source code; and (2) semantic temporal relations, which can be inferred based on the semantic meaning of the logging text. We first perform qualitative analyses to study these two types of logging-code temporal relations and the inconsistency between them. As a result, we derive rules to detect these two types of temporal relations and their inconsistencies. Based on these rules, we propose a tool named TempoLo to automatically detect the issues of temporal inconsistencies between logging and code. Through an evaluation of four projects, we find that TempoLo can effectively detect temporal inconsistencies with a small number of false positives. To gather developers’ feedback on whether such inconsistencies are worth fixing, we report 15 detected instances from these projects to developers. 13 instances from three projects are confirmed and fixed, while two instances of the remaining project are pending at the time of this writing. Our work lays the foundation for describing temporal relations between logging and code and demonstrates the potential for a deeper understanding of the relationship between logging and code.

DOI: 10.1109/ICSE48619.2023.00079

How Do Developers’ Profiles and Experiences Influence their Logging Practices? An Empirical Study of Industrial Practitioners

作者: Rong, Guoping and Gu, Shenghui and Shen, Haifeng and Zhang, He and Kuang, Hongyu
关键词: logging practice, intention, concern, fulfill

Abstract

Logs record the behavioral data of running programs and are typically generated by executing log statements. Software developers generally carry out logging practices with clear intentions and associated concerns (I&Cs). However, I&Cs may not be properly fulfilled in source code as log placement — specifically determination of a log statement’s context and content— is often susceptible to an individual’s profile and experience. Some industrial studies have been conducted to discern developers’ main logging I&Cs and the way I&Cs are fulfilled. However, the findings are only based on the developers from a single company in each individual study and hence have limited generalizability. More importantly, there lacks a comprehensive and deep understanding of the relationships between developers’ profiles and experiences and their logging practices from a wider perspective. To fill this significant gap, we conducted an empirical study using mixed methods comprising questionnaire surveys, semi-structured interviews, and code analyses with practitioners from a wide range of companies across a variety of industrial domains. Results reveal that while developers share common logging I&Cs and conduct logging practices mainly in the coding stage, their profiles and experiences profoundly influence their logging I&Cs and the way the I&Cs are fulfilled. These findings pave the way to facilitate the acceptance of important logging I&Cs and the adoption of good logging practices by developers.

DOI: 10.1109/ICSE48619.2023.00080

When to Say What： Learning to Find Condition-Message Inconsistencies

作者: Bouzenia, Islem and Pradel, Michael
关键词: No keywords

Abstract

Programs often emit natural language messages, e.g., in logging statements or exceptions raised on unexpected paths. To be meaningful to users and developers, the message, i.e., what to say, must be consistent with the condition under which it gets triggered, i.e., when to say it. However, checking for inconsistencies between conditions and messages is challenging because the conditions are expressed in the logic of the programming language, while messages are informally expressed in natural language. This paper presents CMI-Finder, an approach for detecting condition-message inconsistencies. CMI-Finder is based on a neural model that takes a condition and a message as its input and then predicts whether the two are consistent. To address the problem of obtaining realistic, diverse, and large-scale training data, we present six techniques to generate large numbers of inconsistent examples to learn from automatically. Moreover, we describe and compare three neural models, which are based on binary classification, triplet loss, and fine-tuning, respectively. Our evaluation applies the approach to 300K condition-message statements extracted from 42 million lines of Python code. The best model achieves a precision of 78% at a recall of 72% on a dataset of past bug fixes. Applying the approach to the newest versions of popular open-source projects reveals 50 previously unknown bugs, 19 of which have been confirmed by the developers so far.

DOI: 10.1109/ICSE48619.2023.00081

SemParser： A Semantic Parser for Log Analytics

作者: Huo, Yintong and Su, Yuxin and Lee, Cheryl and Lyu, Michael R.
关键词: No keywords

Abstract

Logs, being run-time information automatically generated by software, record system events and activities with their timestamps. Before obtaining more insights into the run-time status of the software, a fundamental step of log analysis, called log parsing, is employed to extract structured templates and parameters from the semi-structured raw log messages. However, current log parsers are all syntax-based and regard each message as a character string, ignoring the semantic information included in parameters and templates.Thus, we propose the first semantic-based parser SemParser to unlock the critical bottleneck of mining semantics from log messages. It contains two steps, an end-to-end semantics miner and a joint parser. Specifically, the first step aims to identify explicit semantics inside a single log, and the second step is responsible for jointly inferring implicit semantics and computing structural outputs according to the contextual knowledge base of the logs. To analyze the effectiveness of our semantic parser, we first demonstrate that it can derive rich semantics from log messages collected from six widely-applied systems with an average F1 score of 0.985. Then, we conduct two representative downstream tasks, showing that current downstream models improve their performance with appropriately extracted semantics by 1.2%-11.7% and 8.65% on two anomaly detection datasets and a failure identification dataset, respectively. We believe these findings provide insights into semantically understanding log messages for the log analysis community.

DOI: 10.1109/ICSE48619.2023.00082

Badge： Prioritizing UI Events with Hierarchical Multi-Armed Bandits for Automated UI Testing

作者: Ran, Dezhi and Wang, Hao and Wang, Wenyu and Xie, Tao
关键词: GUI testing, mobile testing, mobile app, android, multi-armed bandits, reinforcement learning

Abstract

To assure high quality of mobile applications (apps for short), automated UI testing triggers events (associated with UI elements on app UIs) without human intervention, aiming to maximize code coverage and find unique crashes. To achieve high test effectiveness, automated UI testing prioritizes a UI event based on its exploration value (e.g., the increased code coverage of future exploration rooted from the UI event). Various strategies have been proposed to estimate the exploration value of a UI event without considering its exploration diversity (reflecting the variance of covered code entities achieved by explorations rooted from this UI event across its different triggerings), resulting in low test effectiveness, especially on complex mobile apps. To address the preceding problem, in this paper, we propose a new approach named Badge to prioritize UI events considering both their exploration values and exploration diversity for effective automated UI testing. In particular, we design a hierarchical multi-armed bandit model to effectively estimate the exploration value and exploration diversity of a UI event based on its historical explorations along with historical explorations rooted from UI events in the same UI group. We evaluate Badge on 21 highly popular industrial apps widely used by previous related work. Experimental results show that Badge outperforms state-of-the-art/practice tools with 18%-146% relative code coverage improvement and finding 1.19–5.20x unique crashes, demonstrating the effectiveness of Badge. Further experimental studies confirm the benefits brought by Badge’s individual algorithms.

DOI: 10.1109/ICSE48619.2023.00083

Efficiency Matters： Speeding Up Automated Testing with GUI Rendering Inference

作者: Feng, Sidong and Xie, Mulong and Chen, Chunyang
关键词: efficient android GUI testing, GUI rendering, machine learning

Abstract

Due to the importance of Android app quality assurance, many automated GUI testing tools have been developed. Although the test algorithms have been improved, the impact of GUI rendering has been overlooked. On the one hand, setting a long waiting time to execute events on fully rendered GUIs slows down the testing process. On the other hand, setting a short waiting time will cause the events to execute on partially rendered GUIs, which negatively affects the testing effectiveness. An optimal waiting time should strike a balance between effectiveness and efficiency. We propose AdaT, a lightweight image-based approach to dynamically adjust the inter-event time based on GUI rendering state. Given the real-time streaming on the GUI, AdaT presents a deep learning model to infer the rendering state, and synchronizes with the testing tool to schedule the next event when the GUI is fully rendered. The evaluations demonstrate the accuracy, efficiency, and effectiveness of our approach. We also integrate our approach with the existing automated testing tool to demonstrate the usefulness of AdaT in covering more activities and executing more events on fully rendered GUIs.

DOI: 10.1109/ICSE48619.2023.00084

CodaMosa： Escaping Coverage Plateaus in Test Generation with Pre-Trained Large Language Models

作者: Lemieux, Caroline and Inala, Jeevana Priya and Lahiri, Shuvendu K. and Sen, Siddhartha
关键词: No keywords

Abstract

Search-based software testing (SBST) generates high-coverage test cases for programs under test with a combination of test case generation and mutation. SBST’s performance relies on there being a reasonable probability of generating test cases that exercise the core logic of the program under test. Given such test cases, SBST can then explore the space around them to exercise various parts of the program. This paper explores whether Large Language Models (LLMs) of code, such as OpenAI’s Codex, can be used to help SBST’s exploration. Our proposed algorithm, CodaMosa, conducts SBST until its coverage improvements stall, then asks Codex to provide example test cases for under-covered functions. These examples help SBST redirect its search to more useful areas of the search space. On an evaluation over 486 benchmarks, CodaMosa achieves statistically significantly higher coverage on many more benchmarks (173 and 279) than it reduces coverage on (10 and 4), compared to SBST and LLM-only baselines.

DOI: 10.1109/ICSE48619.2023.00085

TaintMini： Detecting Flow of Sensitive Data in Mini-Programs with Static Taint Analysis

作者: Wang, Chao and Ko, Ronny and Zhang, Yue and Yang, Yuqing and Lin, Zhiqiang
关键词: mini-programs, taint analysis, privacy leaks detection, empirical study

Abstract

Mini-programs, which are programs running inside mobile super apps such as WeChat, often have access to privacy-sensitive information, such as location data and phone numbers, through APIs provided by the super apps. This access poses a risk of privacy sensitive data leaks, either accidentally from carelessly programmed mini-programs or intentionally from malicious ones. To address this concern, it is crucial to track the flow of sensitive data in mini-programs for either human analysis or automated tools. Although existing taint analysis techniques have been widely studied, they face unique challenges in tracking sensitive data flows in mini-programs, such as cross-language, cross-page, and cross-mini-program data flows. This paper presents a novel framework, TaintMini, which addresses these challenges by using a novel universal data flow graph approach that captures data flows within and across mini-programs. We have evaluated TaintMini with 238,866 mini-programs and detect 27,184 that contain sensitive data flows. We have also applied TaintMini to detect privacy leakage colluding mini-programs and identify 455 such programs from them that clearly violate privacy policy.

DOI: 10.1109/ICSE48619.2023.00086

AChecker： Statically Detecting Smart Contract Access Control Vulnerabilities

作者: Ghaleb, Asem and Rubin, Julia and Pattabiraman, Karthik
关键词: smart contract, access control, dataflow analysis

Abstract

As most smart contracts have a financial nature and handle valuable assets, smart contract developers use access control to protect assets managed by smart contracts from being misused by malicious or unauthorized people. Unfortunately, programming languages used for writing smart contracts, such as Solidity, were not designed with a permission-based security model in mind. Therefore, smart contract developers implement access control checks based on their judgment and in an adhoc manner, which results in several vulnerabilities in smart contracts, called access control vulnerabilities. Further, the inconsistency in implementing access control makes it difficult to reason about whether a contract meets access control needs and is free of access control vulnerabilities. In this work, we propose AChecker - an approach for detecting access control vulnerabilities. Unlike prior work, AChecker does not rely on predefined patterns or contract transactions history. Instead, it infers access control implemented in smart contracts via static dataflow analysis. Moreover, the approach performs further symbolic-based analysis to distinguish cases when unauthorized people can obtain control of the contract as intended functionality.We evaluated AChecker on three public datasets of real-world smart contracts, including one which consists of contracts with assigned access control CVEs, and compared its effectiveness with eight analysis tools. The evaluation results showed that AChecker outperforms these tools in terms of both precision and recall. In addition, AChecker flagged vulnerabilities in 21 frequently-used contracts on Ethereum blockchain with 90% precision.

DOI: 10.1109/ICSE48619.2023.00087

Fine-Grained Commit-Level Vulnerability Type Prediction by CWE Tree Structure

作者: Pan, Shengyi and Bao, Lingfeng and Xia, Xin and Lo, David and Li, Shanping
关键词: software security, vulnerability type, CWE

Abstract

Identifying security patches via code commits to allow early warnings and timely fixes for Open Source Software (OSS) has received increasing attention. However, the existing detection methods can only identify the presence of a patch (i.e., a binary classification) but fail to pinpoint the vulnerability type. In this work, we take the first step to categorize the security patches into fine-grained vulnerability types. Specifically, we use the Common Weakness Enumeration (CWE) as the label and perform fine-grained classification using categories at the third level of the CWE tree. We first formulate the task as a Hierarchical Multi-label Classification (HMC) problem, i.e., inferring a path (a sequence of CWE nodes) from the root of the CWE tree to the node at the target depth. We then propose an approach named TreeVul with a hierarchical and chained architecture, which manages to utilize the structure information of the CWE tree as prior knowledge of the classification task. We further propose a tree structure aware and beam search based inference algorithm for retrieving the optimal path with the highest merged probability. We collect a large security patch dataset from NVD, consisting of 6,541 commits from 1,560 GitHub OSS repositories. Experimental results show that TreeVul significantly outperforms the best performing baselines, with improvements of 5.9%, 25.0%, and 7.7% in terms of weighted F1-score, macro F1-score, and MCC, respectively. We further conduct a user study and a case study to verify the practical value of TreeVul in enriching the binary patch detection results and improving the data quality of NVD, respectively.

DOI: 10.1109/ICSE48619.2023.00088

Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation

作者: Sun, Jiamou and Xing, Zhenchang and Lu, Qinghua and Xu, Xiwei and Zhu, Liming and Hoang, Thong and Zhao, Dehai
关键词: No keywords

Abstract

Due to convenience, open-source software is widely used. For beneficial reasons, open-source maintainers often fix the vulnerabilities silently, exposing their users unaware of the updates to threats. Previous works all focus on black-box binary detection of the silent dependency alerts that suffer from high false-positive rates. Open-source software users need to analyze and explain AI prediction themselves. Explainable AI becomes remarkable as a complementary of black-box AI models, providing details in various forms to explain AI decisions. Noticing there is still no technique that can discover silent dependency alert on time, in this work, we propose a framework using an encoder-decoder model with a binary detector to provide explainable silent dependency alert prediction. Our model generates 4 types of vulnerability key aspects including vulnerability type, root cause, attack vector, and impact to enhance the trustworthiness and users’ acceptance to alert prediction. By experiments with several models and inputs, we confirm CodeBERT with both commit messages and code changes achieves the best results. Our user study shows that explainable alert predictions can help users find silent dependency alert more easily than black-box predictions. To the best of our knowledge, this is the first research work on the application of Explainable AI in silent dependency alert prediction, which opens the door of the related domains.

DOI: 10.1109/ICSE48619.2023.00089

Reusing Deep Neural Network Models through Model Re-Engineering

作者: Qi, Binhang and Sun, Hailong and Gao, Xiang and Zhang, Hongyu and Li, Zhaotian and Liu, Xudong
关键词: model reuse, deep neural network, re-engineering, DNN modularization

Abstract

Training deep neural network (DNN) models, which has become an important task in today’s software development, is often costly in terms of computational resources and time. With the inspiration of software reuse, building DNN models through reusing existing ones has gained increasing attention recently. Prior approaches to DNN model reuse have two main limitations: 1) reusing the entire model, while only a small part of the model’s functionalities (labels) are required, would cause much overhead (e.g., computational and time costs for inference), and 2) model reuse would inherit the defects and weaknesses of the reused model, and hence put the new system under threats of security attack. To solve the above problem, we propose SeaM, a tool that re-engineers a trained DNN model to improve its reusability. Specifically, given a target problem and a trained model, SeaM utilizes a gradient-based search method to search for the model’s weights that are relevant to the target problem. The re-engineered model that only retains the relevant weights is then reused to solve the target problem. Evaluation results on widely-used models show that the re-engineered models produced by SeaM only contain 10.11% weights of the original models, resulting 42.41% reduction in terms of inference time. For the target problem, the re-engineered models even outperform the original models in classification accuracy by 5.85%. Moreover, reusing the re-engineered models inherits an average of 57% fewer defects than reusing the entire model. We believe our approach to reducing reuse overhead and defect inheritance is one important step forward for practical model reuse.

DOI: 10.1109/ICSE48619.2023.00090

PyEvolve： Automating Frequent Code Changes in Python ML Systems

作者: Dilhara, Malinda and Dig, Danny and Ketkar, Ameya
关键词: No keywords

Abstract

Because of the naturalness of software and the rapid evolution of Machine Learning (ML) techniques, frequently repeated code change patterns (CPATs) occur often. They range from simple API migrations to changes involving several complex control structures such as for loops. While manually performing CPATs is tedious, the current state-of-the-art techniques for inferring transformation rules are not advanced enough to handle unseen variants of complex CPATs, resulting in a low recall rate. In this paper we present a novel, automated workflow that mines CPATs, infers the transformation rules, and then transplants them automatically to new target sites. We designed, implemented, evaluated and released this in a tool, PyEvolve. At its core is a novel data-flow, control-flow aware transformation rule inference engine. Our technique allows us to advance the state-of-the-art for transformation-by-example tools; without it, 70% of the code changes that PyEvolve transforms would not be possible to automate. Our thorough empirical evaluation of over 40,000 transformations shows 97% precision and 94% recall. By accepting 90% of CPATs generated by PyEvolve in famous open-source projects, developers confirmed its changes are useful.

DOI: 10.1109/ICSE48619.2023.00091

DeepArc： Modularizing Neural Networks for the Model Maintenance

作者: Ren, Xiaoning and Lin, Yun and Xue, Yinxing and Liu, Ruofan and Sun, Jun and Feng, Zhiyong and Dong, Jin Song
关键词: architecture, modularization, neural networks

Abstract

Neural networks are an emerging data-driven programming paradigm widely used in many areas. Unlike traditional software systems consisting of decomposable modules, a neural network is usually delivered as a monolithic package, raising challenges for some maintenance tasks such as model restructure and re-adaption. In this work, we propose DeepArc, a novel modularization method for neural networks, to reduce the cost of model maintenance tasks. Specifically, DeepArc decomposes a neural network into several consecutive modules, each of which encapsulates consecutive layers with similar semantics. The network modularization facilitates practical tasks such as refactoring the model to preserve existing features (e.g., model compression) and enhancing the model with new features (e.g., fitting new samples). The modularization and encapsulation allow us to restructure or retrain the model by only pruning and tuning a few localized neurons and layers. Our experiments show that (1) DeepArc can boost the runtime efficiency of the state-of-the-art model compression techniques by 14.8%; (2) compared to the traditional model retraining, DeepArc only needs to train less than 20% of the neurons on average to fit adversarial samples and repair under-performing models, leading to 32.85% faster training performance while achieving similar model prediction performance.

DOI: 10.1109/ICSE48619.2023.00092

Decomposing a Recurrent Neural Network into Modules for Enabling Reusability and Replacement

作者: Imtiaz, Sayem Mohammad and Batole, Fraol and Singh, Astha and Pan, Rangeet and Cruz, Breno Dantas and Rajan, Hridesh
关键词: recurrent neural networks, decomposing, modules, modularity

Abstract

Can we take a recurrent neural network (RNN) trained to translate between languages and augment it to support a new natural language without retraining the model from scratch? Can we fix the faulty behavior of the RNN by replacing portions associated with the faulty behavior? Recent works on decomposing a fully connected neural network (FCNN) and convolutional neural network (CNN) into modules have shown the value of engineering deep models in this manner, which is standard in traditional SE but foreign for deep learning models. However, prior works focus on the image-based multi-class classification problems and cannot be applied to RNN due to (a) different layer structures, (b) loop structures, © different types of input-output architectures, and (d) usage of both nonlinear and logistic activation functions. In this work, we propose the first approach to decompose an RNN into modules. We study different types of RNNs, i.e., Vanilla, LSTM, and GRU. Further, we show how such RNN modules can be reused and replaced in various scenarios. We evaluate our approach against 5 canonical datasets (i.e., Math QA, Brown Corpus, Wiki-toxicity, Clinc OOS, and Tatoeba) and 4 model variants for each dataset. We found that decomposing a trained model has a small cost (Accuracy: -0.6%, BLEU score: +0.10%). Also, the decomposed modules can be reused and replaced without needing to retrain.

DOI: 10.1109/ICSE48619.2023.00093

Chronos： Time-Aware Zero-Shot Identification of Libraries from Vulnerability Reports

作者: Lyu, Yunbo and Le-Cong, Thanh and Kang, Hong Jin and Widyasari, Ratnadira and Zhao, Zhipeng and Le, Xuan-Bach D. and Li, Ming and Lo, David
关键词: zero-shot learning, library identification, unseen labels, extreme multi-label classification, vulnerability reports

Abstract

Tools that alert developers about library vulnerabilities depend on accurate, up-to-date vulnerability databases which are maintained by security researchers. These databases record the libraries related to each vulnerability. However, the vulnerability reports may not explicitly list every library and human analysis is required to determine all the relevant libraries. Human analysis may be slow and expensive, which motivates the need for automated approaches. Researchers and practitioners have proposed to automatically identify libraries from vulnerability reports using extreme multi-label learning (XML).While state-of-the-art XML techniques showed promising performance, their experimental settings do not practically fit what happens in reality. Previous studies randomly split the vulnerability reports data for training and testing their models without considering the chronological order of the reports. This may unduly train the models on chronologically newer reports while testing the models on chronologically older ones. However, in practice, one often receives chronologically new reports, which may be related to previously unseen libraries. Under this practical setting, we observe that the performance of current XML techniques declines substantially, e.g., F1 decreased from 0.7 to 0.24 under experiments without and with consideration of chronological order of vulnerability reports.We propose a practical library identification approach, namely Chronos, based on zero-shot learning. The novelty of Chronos is three-fold. First, Chronos fits into the practical pipeline by considering the chronological order of vulnerability reports. Second, Chronos enriches the data of the vulnerability descriptions and labels using a carefully designed data enhancement step. Third, Chronos exploits the temporal ordering of the vulnerability reports using a cache to prioritize prediction of versions of libraries that recently had reports of vulnerabilities.In our experiments, Chronos achieves an average F1-score of 0.75, 3x better than the best XML-based approach. Data enhancement and the time-aware adjustment improve Chronos over the vanilla zero-shot learning model by 27% in average F1.

DOI: 10.1109/ICSE48619.2023.00094

Understanding the Threats of Upstream Vulnerabilities to Downstream Projects in the Maven Ecosystem

作者: Wu, Yulun and Yu, Zeliang and Wen, Ming and Li, Qiang and Zou, Deqing and Jin, Hai
关键词: maven, ecosystem security, vulnerability

Abstract

Modern software systems are increasingly relying on dependencies from the ecosystem. A recent estimation shows that around 35% of an open-source project’s code come from its depended libraries. Unfortunately, open-source libraries are often threatened by various vulnerability issues, and the number of disclosed vulnerabilities is increasing steadily over the years. Such vulnerabilities can pose significant security threats to the whole ecosystem, not only to the vulnerable libraries themselves, but also to the corresponding downstream projects. Many Software Composition Analysis (SCA) tools have been proposed, aiming to detect vulnerable libraries or components referring to existing vulnerability databases. However, recent studies report that such tools often generate a large number of false alerts. Particularly, up to 73.3% of the projects depending on vulnerable libraries are actually safe. Aiming to devise more precise tools, understanding the threats of vulnerabilities holistically in the ecosystem is significant, as already performed by a number of existing studies. However, previous researches either analyze at a very coarse granularity (e.g., without analyzing the source code) or are limited by the study scales.This study aims to bridge such gaps. In particular, we collect 44,450 instances of 〈CVE, upstream, downstream〉 relations and analyze around 50 million invocations made from downstream to upstream projects to understand the potential threats of upstream vulnerabilities to downstream projects in the Maven ecosystem. Our investigation makes interesting yet significant findings with respect to multiple aspects, including the reachability of vulnerabilities, the complexities of the reachable paths as well as how downstream projects and developers perceive upstream vulnerabilities. We believe such findings can not only provide a holistic understanding towards the threats of upstream vulnerabilities in the Maven ecosystem, but also can guide future researches in this field.

DOI: 10.1109/ICSE48619.2023.00095

SecBench.js： An Executable Security Benchmark Suite for Server-Side JavaScript

作者: Bhuiyan, Masudul Hasan Masud and Parthasarathy, Adithya Srinivas and Vasilakis, Nikos and Pradel, Michael and Staicu, Cristian-Alexandru
关键词: No keywords

Abstract

Npm is the largest software ecosystem in the world, offering millions of free, reusable packages. In recent years, various security threats to packages published on npm have been reported, including vulnerabilities that affect millions of users. To continuously improve techniques for detecting vulnerabilities and mitigating attacks that exploit them, a reusable benchmark of vulnerabilities would be highly desirable. Ideally, such a benchmark should be realistic, come with executable exploits, and include fixes of vulnerabilities. Unfortunately, there currently is no such benchmark, forcing researchers to repeatedly develop their own evaluation datasets and making it difficult to compare techniques with each other. This paper presents SecBench.js, the first comprehensive benchmark suite of vulnerabilities and executable exploits for npm. The benchmark comprises 600 vulnerabilities, which cover the five most common vulnerability classes for server-side JavaScript. Each vulnerability comes with a payload that exploits the vulnerability and an oracle that validates successful exploitation. SecBench.js enables various applications, of which we explore three in this paper: (i) crosschecking SecBench.js against public security advisories reveals 168 vulnerable versions in 19 packages that are mislabeled in the advisories; (ii) applying simple code transformations to the exploits in our suite helps identify flawed fixes of vulnerabilities; (iii) dynamically analyzing calls to common sink APIs, e.g., exec(), yields a ground truth of code locations for evaluating vulnerability detectors. Beyond providing a reusable benchmark to the community, our work identified 20 zero-day vulnerabilities, most of which are already acknowledged by practitioners.

DOI: 10.1109/ICSE48619.2023.00096

On Privacy Weaknesses and Vulnerabilities in Software Systems

作者: Sangaroonsilp, Pattaraporn and Dam, Hoa Khanh and Ghose, Aditya
关键词: privacy, vulnerabilities, threats, CWE, CVE, software

Abstract

In this digital era, our privacy is under constant threat as our personal data and traceable online/offline activities are frequently collected, processed and transferred by many software applications. Privacy attacks are often formed by exploiting vulnerabilities found in those software applications. The Common Weakness Enumeration (CWE) and Common Vulnerabilities and Exposures (CVE) systems are currently the main sources that software engineers rely on for understanding and preventing publicly disclosed software vulnerabilities. However, our study on all 922 weaknesses in the CWE and 156,537 vulnerabilities registered in the CVE to date has found a very small coverage of privacy-related vulnerabilities in both systems, only 4.45% in CWE and 0.1% in CVE. These also cover only a small number of areas of privacy threats that have been raised in existing privacy software engineering research, privacy regulations and frameworks, and relevant reputable organisations. The actionable insights generated from our study led to the introduction of 11 new common privacy weaknesses to supplement the CWE system, making it become a source for both security and privacy vulnerabilities.

DOI: 10.1109/ICSE48619.2023.00097

Detecting Exception Handling Bugs in C++ Programs

作者: Zhang, Hao and Luo, Ji and Hu, Mengze and Yan, Jun and Zhang, Jian and Qiu, Zongyan
关键词: static analysis, exception handling, bug finding

Abstract

Exception handling is a mechanism in modern programming languages. Studies have shown that the exception handling code is error-prone. However, there is still limited research on detecting exception handling bugs, especially for C++ programs.To tackle the issue, we try to precisely represent the exception control flow in C++ programs and propose an analysis method that makes use of the control flow to detect such bugs. More specifically, we first extend control flow graph by introducing the concepts of five different kinds of basic blocks, and then modify the classic symbolic execution framework by extending the program state to a quadruple and properly processing try, throw and catch statements. Based on the above techniques, we develop a static analysis tool on the top of Clang Static Analyzer to detect exception handling bugs.We run our tool on projects with high stars from GitHub and find 36 exception handling bugs in 8 projects, with a precision of 84%. We compare our tool with four state-of-the-art static analysis tools (Cppcheck, Clang Static Analyzer, Facebook Infer and IKOS) on projects from GitHub and handmade benchmarks. On the GitHub projects, other tools are not able to detect any exception handling bugs found by our tool. On the handmade benchmarks, our tool has a significant higher recall.

DOI: 10.1109/ICSE48619.2023.00098

Learning to Boost Disjunctive Static Bug-Finders

作者: Ko, Yoonseok and Oh, Hakjoo
关键词: No keywords

Abstract

We present a new learning-based approach for accelerating disjunctive static bug-finders. Industrial static bug-finders usually perform disjunctive analysis, differentiating program states along different execution paths of a program. Such path-sensitivity is essential for reducing false positives but it also increases analysis costs exponentially. Therefore, practical bug-finders use a state-selection heuristic to keep track of a small number of beneficial states only. However, designing a good heuristic for real-world programs is challenging; as a result, modern static bug-finders still suffer from low cost/bug-finding efficiency. In this paper, we aim to address this problem by learning effective state-selection heuristics from data. To this end, we present a novel data-driven technique that efficiently collects alarm-triggering traces, learns multiple candidate models, and adaptively chooses the best model tailored for each target program. We evaluate our approach with Infer and show that our technique significantly improves Infer’s bug-finding efficiency for a range of open-source C programs.

DOI: 10.1109/ICSE48619.2023.00099

Predicting Bugs by Monitoring Developers during Task Execution

作者: Laudato, Gennaro and Scalabrino, Simone and Novielli, Nicole and Lanubile, Filippo and Oliveto, Rocco
关键词: bug prediction, human aspects of software engineering, biometric sensors, empirical software engineering

Abstract

Knowing which parts of the source code will be defective can allow practitioners to better allocate testing resources. For this reason, many approaches have been proposed to achieve this goal. Most state-of-the-art predictive models rely on product and process metrics, i.e., they predict the defectiveness of a component by considering what developers did. However, there is still limited evidence of the benefits that can be achieved in this context by monitoring how developers complete a development task. In this paper, we present an empirical study in which we aim at understanding whether measuring human aspects on developers while they write code can help predict the introduction of defects. First, we introduce a new developer-based model which relies on behavioral, psychophysical, and control factors that can be measured during the execution of development tasks. Then, we run a controlled experiment involving 20 software developers to understand if our developer-based model is able to predict the introduction of bugs. Our results show that a developer-based model is able to achieve a similar accuracy compared to a state-of-the-art code-based model, i.e., a model that uses only features measured from the source code. We also observed that by combining the models it is possible to obtain the best results (~84% accuracy).

DOI: 10.1109/ICSE48619.2023.00100

Detecting Isolation Bugs via Transaction Oracle Construction

作者: Dou, Wensheng and Cui, Ziyu and Dai, Qianwang and Song, Jiansen and Wang, Dong and Gao, Yu and Wang, Wei and Wei, Jun and Chen, Lei and Wang, Hanmo and Zhong, Hua and Huang, Tao
关键词: database system, transaction, isolation, oracle

Abstract

Transactions are used to maintain the data integrity of databases, and have become an indispensable feature in modern Database Management Systems (DBMSs). Despite extensive efforts in testing DBMSs and verifying transaction processing mechanisms, isolation bugs still exist in widely-used DBMSs when these DBMSs violate their claimed transaction isolation levels. Isolation bugs can cause severe consequences, e.g., incorrect query results and database states.In this paper, we propose a novel transaction testing approach, Transaction oracle construction (Troc), to automatically detect isolation bugs in DBMSs. The core idea of Troc is to decouple a transaction into independent statements, and execute them on their own database views, which are constructed under the guidance of the claimed transaction isolation level. Any divergence between the actual transaction execution and the independent statement execution indicates an isolation bug. We implement and evaluate Troc on three widely-used DBMSs, i.e., MySQL, MariaDB, and TiDB. We have detected 5 previously-unknown isolation bugs in the latest versions of these DBMSs.

DOI: 10.1109/ICSE48619.2023.00101

SmallRace： Static Race Detection for Dynamic Languages - A Case on Smalltalk

作者: Cui, Siwei and Gao, Yifei and Unterguggenberger, Rainer and Pichler, Wilfried and Livingstone, Sean and Huang, Jeff
关键词: No keywords

Abstract

Smalltalk, one of the first object-oriented programming languages, has had a tremendous influence on the evolution of computer technology. Due to the simplicity and productivity provided by the language, Smalltalk is still in active use today by many companies with large legacy codebases and with new code written every day.A crucial problem in Smalltalk programming is the race condition. Like in any other parallel language, debugging race conditions is inherently challenging, but in Smalltalk, it is even more challenging due to its dynamic nature. Being a purely dynamically-typed language, Smalltalk allows assigning any object to any variable without type restrictions, and allows forking new threads to execute arbitrary anonymous code blocks passed as objects. In Smalltalk, race conditions can be introduced easily, but are difficult to prevent at runtime.We present SmallRace, a novel static race detection framework designed for multithreaded dynamic languages, with a focus on Smalltalk. A key component of SmallRace is SmallIR, a subset of LLVM IR, in which all variables are declared with the same type—a generic pointer i8*. This allows SmallRace to design an effective interprocedural thread-sensitive pointer analysis to infer the concrete types of dynamic variables. SmallRace automatically translates Smalltalk source code into SmallIR, supports most of the modern Smalltalk syntax in Visual Works, and generates actionable race reports with detailed debugging information. Importantly, SmallRace has been used to analyze a production codebase in a large company with over a million lines of code, and it has found tens of complex race conditions in the production code.

DOI: 10.1109/ICSE48619.2023.00102

“STILL AROUND”： Experiences and Survival Strategies of Veteran Women Software Developers

作者: van Breukelen, Sterre and Barcomb, Ann and Baltes, Sebastian and Serebrenik, Alexander
关键词: age, gender, intersectionality, software development, interview study, qualitative research

Abstract

The intersection of ageism and sexism can create a hostile environment for veteran software developers belonging to marginalized genders. In this study, we conducted 14 interviews to examine the experiences of people at this intersection, primarily women, in order to discover the strategies they employed in order to successfully remain in the field. We identified 283 codes, which fell into three main categories: Strategies, Experiences, and Perception. Several strategies we identified, such as (Deliberately) Not Trying to Look Younger, were not previously described in the software engineering literature. We found that, in some companies, older women developers are recognized as having particular value, further strengthening the known benefits of diversity in the workforce. Based on the experiences and strategies, we suggest organizations employing software developers to consider the benefits of hiring veteran women software developers. For example, companies can draw upon the life experiences of older women developers in order to better understand the needs of customers from a similar demographic. While we recognize that many of the strategies employed by our study participants are a response to systemic issues, we still consider that, in the short-term, there is benefit in describing these strategies for developers who are experiencing such issues today.

DOI: 10.1109/ICSE48619.2023.00103

When and Why Test Generators for Deep Learning Produce Invalid Inputs： An Empirical Study

作者: Riccio, Vincenzo and Tonella, Paolo
关键词: software testing, deep learning

Abstract

Testing Deep Learning (DL) based systems inherently requires large and representative test sets to evaluate whether DL systems generalise beyond their training datasets. Diverse Test Input Generators (TIGs) have been proposed to produce artificial inputs that expose issues of the DL systems by triggering misbehaviours. Unfortunately, such generated inputs may be invalid, i.e., not recognisable as part of the input domain, thus providing an unreliable quality assessment. Automated validators can ease the burden of manually checking the validity of inputs for human testers, although input validity is a concept difficult to formalise and, thus, automate.In this paper, we investigate to what extent TIGs can generate valid inputs, according to both automated and human validators. We conduct a large empirical study, involving 2 different automated validators, 220 human assessors, 5 different TIGs and 3 classification tasks. Our results show that 84% artificially generated inputs are valid, according to automated validators, but their expected label is not always preserved. Automated validators reach a good consensus with humans (78% accuracy), but still have limitations when dealing with feature-rich datasets.

DOI: 10.1109/ICSE48619.2023.00104

Fuzzing Automatic Differentiation in Deep-Learning Libraries

作者: Yang, Chenyuan and Deng, Yinlin and Yao, Jiayi and Tu, Yuxing and Li, Hanchi and Zhang, Lingming
关键词: No keywords

Abstract

Deep learning (DL) has attracted wide attention and has been widely deployed in recent years. As a result, more and more research efforts have been dedicated to testing DL libraries and frameworks. However, existing work largely overlooked one crucial component of any DL system, automatic differentiation (AD), which is the basis for the recent development of DL. To this end, we propose ∇Fuzz, the first general and practical approach specifically targeting the critical AD component in DL libraries. Our key insight is that each DL library API can be abstracted into a function processing tensors/vectors, which can be differentially tested under various execution scenarios (for computing outputs/gradients with different implementations). We have implemented ∇Fuzz as a fully automated API-level fuzzer targeting AD in DL libraries, which utilizes differential testing on different execution scenarios to test both first-order and high-order gradients, and also includes automated filtering strategies to remove false positives caused by numerical instability. We have performed an extensive study on four of the most popular and actively-maintained DL libraries, PyTorch, TensorFlow, JAX, and OneFlow. The result shows that ∇Fuzz substantially outperforms state-of-the-art fuzzers in terms of both code coverage and bug detection. To date, ∇Fuzz has detected 173 bugs for the studied DL libraries, with 144 already confirmed by developers (117 of which are previously unknown bugs and 107 are related to AD). Remarkably, ∇Fuzz contributed 58.3% (7/12) of all high-priority AD bugs for PyTorch and JAX during a two-month period. None of the confirmed AD bugs were detected by existing fuzzers.

DOI: 10.1109/ICSE48619.2023.00105

Lightweight Approaches to DNN Regression Error Reduction： An Uncertainty Alignment Perspective

作者: Li, Zenan and Zhang, Maorun and Xu, Jingwei and Yao, Yuan and Cao, Chun and Chen, Taolue and Ma, Xiaoxing and L"{u
关键词: software regression, deep neural networks, uncertainty alignment, model ensemble

Abstract

Regression errors of Deep Neural Network (DNN) models refer to the case that predictions were correct by the old-version model but wrong by the new-version model. They frequently occur when upgrading DNN models in production systems, causing disproportionate user experience degradation. In this paper, we propose a lightweight regression error reduction approach with two goals: 1) requiring no model retraining and even data, and 2) not sacrificing the accuracy. The proposed approach is built upon the key insight rooted in the unmanaged model uncertainty, which is intrinsic to DNN models, but has not been thoroughly explored especially in the context of quality assurance of DNN models. Specifically, we propose a simple yet effective ensemble strategy that estimates and aligns the two models’ uncertainty. We show that a Pareto improvement that reduces the regression errors without compromising the overall accuracy can be guaranteed in theory and largely achieved in practice. Comprehensive experiments with various representative models and datasets confirm that our approaches significantly outperform the state-of-the-art alternatives.

DOI: 10.1109/ICSE48619.2023.00106

Revisiting Neuron Coverage for DNN Testing： A Layer-Wise and Distribution-Aware Criterion

作者: Yuan, Yuanyuan and Pang, Qi and Wang, Shuai
关键词: No keywords

Abstract

Various deep neural network (DNN) coverage criteria have been proposed to assess DNN test inputs and steer input mutations. The coverage is characterized via neurons having certain outputs, or the discrepancy between neuron outputs. Nevertheless, recent research indicates that neuron coverage criteria show little correlation with test suite quality.In general, DNNs approximate distributions, by incorporating hierarchical layers, to make predictions for inputs. Thus, we champion to deduce DNN behaviors based on its approximated distributions from a layer perspective. A test suite should be assessed using its induced layer output distributions. Accordingly, to fully examine DNN behaviors, input mutation should be directed toward diversifying the approximated distributions.This paper summarizes eight design requirements for DNN coverage criteria, taking into account distribution properties and practical concerns. We then propose a new criterion, Neural Coverage (NLC), that satisfies all design requirements. NLC treats a single DNN layer as the basic computational unit (rather than a single neuron) and captures four critical properties of neuron output distributions. Thus, NLC accurately describes how DNNs comprehend inputs via approximated distributions. We demonstrate that NLC is significantly correlated with the diversity of a test suite across a number of tasks (classification and generation) and data formats (image and text). Its capacity to discover DNN prediction errors is promising. Test input mutation guided by NLC results in a greater quality and diversity of exposed erroneous behaviors.

DOI: 10.1109/ICSE48619.2023.00107

Code Review of Build System Specifications： Prevalence, Purposes, Patterns, and Perceptions

作者: Nejati, Mahtab and Alfadel, Mahmoud and McIntosh, Shane
关键词: build systems, build specifications, code review

Abstract

Build systems automate the integration of source code into executables. Maintaining build systems is known to be challenging. Lax build maintenance can lead to costly build breakages or unexpected software behaviour. Code review is a broadly adopted practice to improve software quality. Yet, little is known about how code review is applied to build specifications.In this paper, we present the first empirical study of how code review is practiced in the context of build specifications. Through quantitative analysis of 502,931 change sets from the Qt and Eclipse communities, we observe that changes to build specifications are at least two times less frequently discussed during code review when compared to production and test code changes. A qualitative analysis of 500 change sets reveals that (i) comments on changes to build specifications are more likely to point out defects than rates reported in the literature for production and test code, and (ii) evolvability and dependency-related issues are the most frequently raised patterns of issues. Follow-up interviews with nine developers with 1–40 years of experience point out social and technical factors that hinder rigorous review of build specifications, such as a prevailing lack of understanding of and interest in build systems among developers, and the lack of dedicated tooling to support the code review of build specifications.

DOI: 10.1109/ICSE48619.2023.00108

Better Automatic Program Repair by Using Bug Reports and Tests Together

作者: Motwani, Manish and Brun, Yuriy
关键词: No keywords

Abstract

Automated program repair is already deployed in industry, but concerns remain about repair quality. Recent research has shown that one of the main reasons repair tools produce incorrect (but seemingly correct) patches is imperfect fault localization (FL). This paper demonstrates that combining information from natural-language bug reports and test executions when localizing faults can have a significant positive impact on repair quality. For example, existing repair tools with such FL are able to correctly repair 7 defects in the Defects4J benchmark that no prior tools have repaired correctly.We develop, Blues, the first information-retrieval-based, statement-level FL technique that requires no training data. We further develop RAFL, the first unsupervised method for combining multiple FL techniques, which outperforms a supervised method. Using RAFL, we create SBIR by combining Blues with a spectrum-based (SBFL) technique. Evaluated on 815 real-world defects, SBIR consistently ranks buggy statements higher than its underlying techniques.We then modify three state-of-the-art repair tools, Arja, SequenceR, and SimFix, to use SBIR, SBFL, and Blues as their internal FL. We evaluate the quality of the produced patches on 689 real-world defects. Arja and SequenceR significantly benefit from SBIR: Arja using SBIR correctly repairs 28 defects, but only 21 using SBFL, and only 15 using Blues; SequenceR using SBIR correctly repairs 12 defects, but only 10 using SBFL, and only 4 using Blues. SimFix, (which has internal mechanisms to overcome poor FL), correctly repairs 30 defects using SBIR and SBFL, but only 13 using Blues. Our work is the first investigation of simultaneously using multiple software artifacts for automated program repair, and our promising findings suggest future research in this directions is likely to be fruitful.

DOI: 10.1109/ICSE48619.2023.00109

CCTest： Testing and Repairing Code Completion Systems

作者: Li, Zongjie and Wang, Chaozheng and Liu, Zhibo and Wang, Haoxuan and Chen, Dong and Wang, Shuai and Gao, Cuiyun
关键词: No keywords

Abstract

Code completion, a highly valuable topic in the software development domain, has been increasingly promoted for use by recent advances in large language models (LLMs). To date, visible LLM-based code completion frameworks such as GitHub Copilot and GPT are trained using deep learning over vast quantities of unstructured text and open source code. As the paramount component and the cornerstone in daily programming tasks, code completion has largely boosted professionals’ efficiency in building real-world software systems.In contrast to this flourishing market, we find that code completion systems often output suspicious results, and to date, an automated testing and enhancement framework for code completion systems is not available. This research proposes CCTEST, a framework to test and repair code completion systems in black-box settings. CCTest features a set of novel mutation strategies, namely program structure-consistent (PSC) mutations, to generate mutated code completion inputs. Then, it detects inconsistent outputs, representing possibly erroneous cases, from all the completed code cases. Moreover, CCTest repairs the code completion outputs by selecting the output that mostly reflects the “average” appearance of all output cases, as the final output of the code completion systems. With around 18K test inputs, we detected 33,540 inputs that can trigger erroneous cases (with a true positive rate of 86%) from eight popular LLM-based code completion systems. With repairing, we show that the accuracy of code completion systems is notably increased by 40% and 67% with respect to BLEU score and Levenshtein edit similarity.

DOI: 10.1109/ICSE48619.2023.00110

KNOD： Domain Knowledge Distilled Tree Decoder for Automated Program Repair

作者: Jiang, Nan and Lutellier, Thibaud and Lou, Yiling and Tan, Lin and Goldwasser, Dan and Zhang, Xiangyu
关键词: automated program repair, abstract syntax tree, deep learning

Abstract

Automated Program Repair (APR) improves software reliability by generating patches for a buggy program automatically. Recent APR techniques leverage deep learning (DL) to build models to learn to generate patches from existing patches and code corpora. While promising, DL-based APR techniques suffer from the abundant syntactically or semantically incorrect patches in the patch space. These patches often disobey the syntactic and semantic domain knowledge of source code and thus cannot be the correct patches to fix a bug.We propose a DL-based APR approach KNOD, which incorporates domain knowledge to guide patch generation in a direct and comprehensive way. KNOD has two major novelties, including (1) a novel three-stage tree decoder, which directly generates Abstract Syntax Trees of patched code according to the inherent tree structure, and (2) a novel domain-rule distillation, which leverages syntactic and semantic rules and teacher-student distributions to explicitly inject the domain knowledge into the decoding procedure during both the training and inference phases.We evaluate KNOD on three widely-used benchmarks. KNOD fixes 72 bugs on the Defects4J v1.2, 25 bugs on the QuixBugs, and 50 bugs on the additional Defects4J v2.0 benchmarks, outperforming all existing APR tools.

DOI: 10.1109/ICSE48619.2023.00111

Rete： Learning Namespace Representation for Program Repair

作者: Parasaram, Nikhil and Barr, Earl T. and Mechtaev, Sergey
关键词: program repair, deep learning, patch prioritisation, variable representation

Abstract

A key challenge of automated program repair is finding correct patches in the vast search space of candidate patches. Real-world programs define large namespaces of variables that considerably contributes to the search space explosion. Existing program repair approaches neglect information about the program namespace, which makes them inefficient and increases the chance of test-overfitting. We propose Rete, a new program repair technique, that learns project-independent information about program namespace and uses it to navigate the search space of patches. Rete uses a neural network to extract project-independent information about variable CDU chains, defuse chains augmented with control flow. Then, it ranks patches by jointly ranking variables and the patch templates into which the variables are inserted. We evaluated Rete on 142 bugs extracted from two datasets, ManyBugs and BugsInPy. Our experiments demonstrate that Rete generates six new correct patches that fix bugs that previous tools did not repair, an improvement of 31% and 59% over the existing state of the art.

DOI: 10.1109/ICSE48619.2023.00112

AI-Based Question Answering Assistance for Analyzing Natural-Language Requirements

作者: Ezzini, Saad and Abualhaija, Sallam and Arora, Chetan and Sabetzadeh, Mehrdad
关键词: natural-language requirements, question answering (QA), language models, natural language processing (NLP), natural language generation (NLG), BERT, T5

Abstract

By virtue of being prevalently written in natural language (NL), requirements are prone to various defects, e.g., inconsistency and incompleteness. As such, requirements are frequently subject to quality assurance processes. These processes, when carried out entirely manually, are tedious and may further overlook important quality issues due to time and budget pressures. In this paper, we propose QAssist - a question-answering (QA) approach that provides automated assistance to stakeholders, including requirements engineers, during the analysis of NL requirements. Posing a question and getting an instant answer is beneficial in various quality-assurance scenarios, e.g., incompleteness detection. Answering requirements-related questions automatically is challenging since the scope of the search for answers can go beyond the given requirements specification. To that end, QAssist provides support for mining external domain-knowledge resources. Our work is one of the first initiatives to bring together QA and external domain knowledge for addressing requirements engineering challenges. We evaluate QAssist on a dataset covering three application domains and containing a total of 387 question-answer pairs. We experiment with state-of-the-art QA methods, based primarily on recent large-scale language models. In our empirical study, QAssist localizes the answer to a question to three passages within the requirements specification and within the external domain-knowledge resource with an average recall of 90.1% and 96.5%, respectively. QAssist extracts the actual answer to the posed question with an average accuracy of 84.2%.

DOI: 10.1109/ICSE48619.2023.00113

Strategies, Benefits and Challenges of App Store-Inspired Requirements Elicitation

作者: Ferrari, Alessio and Spoletini, Paola
关键词: No keywords

Abstract

App store-inspired elicitation is the practice of exploring competitors’ apps, to get inspiration for requirements. This activity is common among developers, but little insight is available on its practical use, advantages and possible issues. This paper aims to empirically analyse this technique in a realistic scenario, in which it is used to extend the requirements of a product that were initially captured by means of more traditional requirements elicitation interviews. Considering this scenario, we conduct an experimental simulation with 58 analysts and collect qualitative data. We perform thematic analysis of the data to identify strategies, benefits, and challenges of app store-inspired elicitation, as well as differences with respect to interviews in the considered elicitation setting. Our results show that: (1) specific guidelines and procedures are required to better conduct app store-inspired elicitation; (2) current search features made available by app stores are not suitable for this practice, and more tool support is required to help analysts in the retrieval and evaluation of competing products; (3) while interviews focus on the why dimension of requirements engineering (i.e., goals), app store-inspired elicitation focuses on how (i.e., solutions), offering indications for implementation and improved usability. Our study provides a framework for researchers to address existing challenges and suggests possible benefits to fostering app store-inspired elicitation among practitioners.

DOI: 10.1109/ICSE48619.2023.00114

Data-Driven Recurrent Set Learning for Non-termination Analysis

作者: Han, Zhilei and He, Fei
关键词: program termination, recurrent set, data-driven approach, black-box learning

Abstract

Termination is a fundamental liveness property for program verification. In this paper, we revisit the problem of non-termination analysis and propose the first data-driven learning algorithm for synthesizing recurrent sets, where the non-terminating samples are effectively speculated by a novel method. To ensure convergence of learning, we develop a learning algorithm which is guaranteed to converge to a valid recurrent set if one exists, and thus establish its relative completeness. The methods are implemented in a prototype tool, and experimental results on public benchmarks show its efficacy in proving non-termination as it outperforms state-of-the-art tools, both in terms of cases solved and performance. Evaluation on nonlinear programs also demonstrates its ability to handle complex programs.

DOI: 10.1109/ICSE48619.2023.00115

Compiling Parallel Symbolic Execution with Continuations

作者: Wei, Guannan and Jia, Songlin and Gao, Ruiqi and Deng, Haotian and Tan, Shangyin and Bra\v{c
关键词: symbolic execution, compiler, code generation, metaprogramming, continuation

Abstract

Symbolic execution is a powerful program analysis and testing technique. Symbolic execution engines are usually implemented as interpreters, and the induced interpretation overhead can dramatically inhibit performance. Alternatively, implementation choices based on instrumentation provide a limited ability to transform programs. However, the use of compilation and code generation techniques beyond simple instrumentation remains underexplored for engine construction, leaving potential performance gains untapped.In this paper, we show how to tap some of these gains using sophisticated compilation techniques: We present GenSym, an optimizing symbolic-execution compiler that generates symbolic code which explores paths and generates tests in parallel. The key insight of GenSym is to compile symbolic execution tasks into cooperative concurrency via continuation-passing style, which further enables efficient parallelism. The design and implementation of GenSym is based on partial evaluation and generative programming techniques, which make it high-level and performant at the same time. We compare the performance of GenSym against the prior symbolic-execution compiler LLSC and the state-of-the-art symbolic interpreter KLEE. The results show an average 4.6\texttimes{

DOI: 10.1109/ICSE48619.2023.00116

Verifying Data Constraint Equivalence in FinTech Systems

作者: Wang, Chengpeng and Fan, Gang and Yao, Peisen and Pan, Fuxiong and Zhang, Charles
关键词: equivalence verification, data constraints, fin-tech systems

Abstract

Data constraints are widely used in FinTech systems for monitoring data consistency and diagnosing anomalous data manipulations. However, many equivalent data constraints are created redundantly during the development cycle, slowing down the FinTech systems and causing unnecessary alerts. We present EqDAC, an efficient decision procedure to determine the data constraint equivalence. We first propose the symbolic representation for semantic encoding and then introduce two light-weighted analyses to refute and prove the equivalence, respectively, which are proved to achieve in polynomial time. We evaluate EqDAC upon 30,801 data constraints in a FinTech system. It is shown that EqDAC detects 11,538 equivalent data constraints in three hours. It also supports efficient equivalence searching with an average time cost of 1.22 seconds, enabling the system to check new data constraints upon submission.

DOI: 10.1109/ICSE48619.2023.00117

Tolerate Control-Flow Changes for Sound Data Race Prediction

作者: Zhu, Shihao and Guo, Yuqi and Zhang, Long and Cai, Yan
关键词: concurrency bugs, data races, control flow, static information

Abstract

Data races seriously threaten the correctness of concurrent programs. Earlier works can report false positives. Recently, trace-based predictive analysis has achieved sound results by inferring feasible traces based on sound partial orders or constraint solvers. However, they hold the same assumption: any read event may affect the control-flow of a predicted trace. Thus, being control-flow sensitive, they have to enforce any read event (in an inferred trace) to either read the same value or a value from the same event as that in the original trace, albeit some slightly relax this. This (even with relaxation) severely limits their predictive ability and many true data races can be missed.We introduce the concept of Fix-Point Event and propose a new partial order model. This allows us to not only predict races with witness traces (like existing works with no control-flow changes) but also soundly infer existences of witness traces with potential control-flow changes. Thus, we can achieve a higher concurrency coverage and detect more data races soundly. We have implemented above as a tool ToccRace and conducted a set of experiments on a benchmark of seven real-world programs and a large-scale software MySQL, where MySQL produced 427 traces with a total size of 3.4TB. Compared with the state-of-the-art sound data race detector SeqCheck, ToccRace is significantly more effective by detecting 84.4%/200% more unique/dynamic races on the benchmark programs and 52.22%/49.8% more unique/dynamic races on MySQL, incurring reasonable time and memory costs (about 1.1x/43.5x on the benchmark programs and 10x/1.03x on MySQL). Furthermore, ToccRace is sound and is complete on two threads.

DOI: 10.1109/ICSE48619.2023.00118

Fill in the Blank： Context-Aware Automated Text Input Generation for Mobile GUI Testing

作者: Liu, Zhe and Chen, Chunyang and Wang, Junjie and Che, Xing and Huang, Yuekai and Hu, Jun and Wang, Qing
关键词: text input generation, GUI testing, android app, large language model, prompt-tuning

Abstract

Automated GUI testing is widely used to help ensure the quality of mobile apps. However, many GUIs require appropriate text inputs to proceed to the next page, which remains a prominent obstacle for testing coverage. Considering the diversity and semantic requirement of valid inputs (e.g., flight departure, movie name), it is challenging to automate the text input generation. Inspired by the fact that the pre-trained Large Language Model (LLM) has made outstanding progress in text generation, we propose an approach named QTypist based on LLM for intelligently generating semantic input text according to the GUI context. To boost the performance of LLM in the mobile testing scenario, we develop a prompt-based data construction and tuning method which automatically extracts the prompts and answers for model tuning. We evaluate QTypist on 106 apps from Google Play, and the result shows that the passing rate of QTypist is 87%, which is 93% higher than the best baseline. We also integrate QTypist with the automated GUI testing tools and it can cover 42% more app activities, 52% more pages, and subsequently help reveal 122% more bugs compared with the raw tool.

DOI: 10.1109/ICSE48619.2023.00119

作者: Chiou, Paul T. and Alotaibi, Ali S. and Halfond, William G. J.
关键词: No keywords

Abstract

The ability to navigate the Web via the keyboard interface is critical to people with various types of disabilities. However, modern websites often violate web accessibility guidelines for keyboard navigability with respect to web dialogs. In this paper, we present a novel approach for automatically detecting web accessibility bugs that prevent or hinder keyboard users’ ability to navigate dialogs in web pages. An extensive evaluation of our technique on real-world subjects showed that our technique is effective in detecting these dialog-related keyboard navigation failures.

DOI: 10.1109/ICSE48619.2023.00120

Columbus： Android App Testing through Systematic Callback Exploration

作者: Bose, Priyanka and Das, Dipanjan and Vasan, Saastha and Mariani, Sebastiano and Grishchenko, Ilya and Continella, Andrea and Bianchi, Antonio and Kruegel, Christopher and Vigna, Giovanni
关键词: No keywords

Abstract

With the continuous rise in the popularity of Android mobile devices, automated testing of apps has become more important than ever. Android apps are event-driven programs. Unfortunately, generating all possible types of events by interacting with an app’s interface is challenging for an automated testing approach. Callback-driven testing eliminates the need for event generation by directly invoking app callbacks. However, existing callback-driven testing techniques assume prior knowledge of Android callbacks, and they rely on a human expert, who is familiar with the Android API, to write stub code that prepares callback arguments before invocation. Since the Android API is very large and keeps evolving, prior techniques could only support a small fraction of callbacks present in the Android framework.In this work, we introduce Columbus, a callback-driven testing technique that employs two strategies to eliminate the need for human involvement: (i) it automatically identifies callbacks by simultaneously analyzing both the Android framework and the app under test; (ii) it uses a combination of under-constrained symbolic execution (primitive arguments), and type-guided dynamic heap introspection (object arguments) to generate valid and effective inputs. Lastly, Columbus integrates two novel feedback mechanisms—data dependency and crash-guidance—during testing to increase the likelihood of triggering crashes and maximizing coverage. In our evaluation, Columbus outperforms state-of-the-art model-driven, checkpoint-based, and callback-driven testing tools both in terms of crashes and coverage.

DOI: 10.1109/ICSE48619.2023.00121

GameRTS： A Regression Testing Framework for Video Games

作者: Yu, Jiongchi and Wu, Yuechen and Xie, Xiaofei and Le, Wei and Ma, Lei and Chen, Yingfeng and Hu, Jingyu and Zhang, Fan
关键词: No keywords

Abstract

Continuous game quality assurance is of great importance to satisfy the increasing demands of users. To respond to game issues reported by users timely, game companies often create and maintain a large number of releases, updates, and tweaks in a short time. Regression testing is an essential technique adopted to detect regression issues during the evolution of the game software. However, due to the special characteristics of game software (e.g., frequent updates and long-running tests), traditional regression testing techniques are not directly applicable. To bridge this gap, in this paper, we perform an early exploratory study to investigate the challenges in regression testing of video games. We first performed empirical studies to better understand the game development process, bugs introduced during game evolution, and the context sensitivity. Based on the results of the study, we proposed the first regression test selection (RTS) technique for game software, which is a compromise between safety and practicality. In particular, we model the test suite of game software as a State Transition Graph (STG) and then perform the RTS on the STG. We establish the dependencies between the states/actions of STG and game files, including game art resources, game design files, and source code, and perform change impact analysis to identify the states/actions (in the STG) that potentially execute such changes. We implemented our framework in a tool, named GameRTS, and evaluated its usefulness on 10 tasks of a large-scale commercial game, including a total of 1,429 commits over three versions. The experimental results demonstrate the usefulness and effectiveness of GameRTS in game RTS. For most tasks, GameRTS only selected one trace from STG, which can significantly reduce the testing time. Furthermore, GameRTS detects all the regression bugs from the test evaluation suites. Compared with the file-level RTS, GameRTS selected fewer states/actions/traces (i.e., 13.77%, 23.97%, 6.85%). In addition, GameRTS identified 2 new critical regression bugs in the game.

DOI: 10.1109/ICSE48619.2023.00122

Autonomy Is an Acquired Taste： Exploring Developer Preferences for GitHub Bots

作者: Ghorbani, Amir and Cassee, Nathan and Robinson, Derek and Alami, Adam and Ernst, Neil A. and Serebrenik, Alexander and W\k{a
关键词: software bot, pull request, human aspects

Abstract

Software bots fulfill an important role in collective software development, and their adoption by developers promises increased productivity. Past research has identified that bots that communicate too often can irritate developers, which affects the utility of the bot. However, it is not clear what other properties of human-bot collaboration affect developers’ preferences, or what impact these properties might have. The main idea of this paper is to explore characteristics affecting developer preferences for interactions between humans and bots, in the context of GitHub pull requests. We carried out an exploratory sequential study with interviews and a subsequent vignette-based survey. We find developers generally prefer bots that are personable but show little autonomy, however, more experienced developers tend to prefer more autonomous bots. Based on this empirical evidence, we recommend bot developers increase configuration options for bots so that individual developers and projects can configure bots to best align with their own preferences and project cultures.

DOI: 10.1109/ICSE48619.2023.00123

Flexible and Optimal Dependency Management via Max-SMT

作者: Pinckney, Donald and Cassano, Federico and Guha, Arjun and Bell, Jonathan and Culpo, Massimiliano and Gamblin, Todd
关键词: package-management, max-SMT, NPM, rosette, dependency-management, JavaScript

Abstract

Package managers such as NPM have become essential for software development. The NPM repository hosts over 2 million packages and serves over 43 billion downloads every week. Unfortunately, the NPM dependency solver has several shortcomings. 1) NPM is greedy and often fails to install the newest versions of dependencies; 2) NPM’s algorithm leads to duplicated dependencies and bloated code, which is particularly bad for web applications that need to minimize code size; 3) NPM’s vulnerability fixing algorithm is also greedy, and can even introduce new vulnerabilities; and 4) NPM’s ability to duplicate dependencies can break stateful frameworks and requires a lot of care to workaround. Although existing tools try to address these problems they are either brittle, rely on post hoc changes to the dependency tree, do not guarantee optimality, or are not composable.We present PacSolve, a unifying framework and implementation for dependency solving which allows for customizable constraints and optimization goals. We use PacSolve to build MaxNPM, a complete, drop-in replacement for NPM, which empowers developers to combine multiple objectives when installing dependencies. We evaluate MaxNPM with a large sample of packages from the NPM ecosystem and show that it can: 1) reduce more vulnerabilities in dependencies than NPM’s auditing tool in 33% of cases; 2) chooses newer dependencies than NPM in 14% of cases; and 3) chooses fewer dependencies than NPM in 21% of cases. All our code and data is open and available.

DOI: 10.1109/ICSE48619.2023.00124

Impact of Code Language Models on Automated Program Repair

作者: Jiang, Nan and Liu, Kevin and Lutellier, Thibaud and Tan, Lin
关键词: automated program repair, code language model, fine-tuning, deep learning

Abstract

Automated program repair (APR) aims to help developers improve software reliability by generating patches for buggy programs. Although many code language models (CLM) are developed and effective in many software tasks such as code completion, there has been little comprehensive, in-depth work to evaluate CLMs’ fixing capabilities and to fine-tune CLMs for the APR task.Firstly, this work is the first to evaluate ten CLMs on four APR benchmarks, which shows that surprisingly, the best CLM, as is, fixes 72% more bugs than the state-of-the-art deep-learning (DL)-based APR techniques. Secondly, one of the four APR benchmarks was created by us in this paper to avoid data leaking for a fair evaluation. Thirdly, it is the first work to fine-tune CLMs with APR training data, which shows that fine-tuning brings 31%–1,267% improvement to CLMs and enables them to fix 46%–164% more bugs than existing DL-based APR techniques. Fourthly, this work studies the impact of buggy lines, showing that CLMs, as is, cannot make good use of the buggy lines to fix bugs, yet fine-tuned CLMs could potentially over-rely on buggy lines. Lastly, this work analyzes the size, time, and memory efficiency of different CLMs.This work shows promising directions for the APR domain, such as fine-tuning CLMs with APR-specific designs, and also raises awareness of fair and comprehensive evaluations of CLMs and calls for more transparent reporting of open-source repositories used in the pre-training data to address the data leaking problem.

DOI: 10.1109/ICSE48619.2023.00125

Tare： Type-Aware Neural Program Repair

作者: Zhu, Qihao and Sun, Zeyu and Zhang, Wenjie and Xiong, Yingfei and Zhang, Lu
关键词: program repair, neural networks

Abstract

Automated program repair (APR) aims to reduce the effort of software development. With the development of deep learning, lots of DL-based APR approaches have been proposed using an encoder-decoder architecture. Despite the promising performance, these models share the same limitation: generating lots of untypable patches. The main reason for this phenomenon is that the existing models do not consider the constraints of code captured by a set of typing rules.In this paper, we propose, Tare, a type-aware model for neural program repair to learn the typing rules. To encode an individual typing rule, we introduce three novel components: (1) a novel type of grammars, T-Grammar, that integrates the type information into a standard grammar, (2) a novel representation of code, T-Graph, that integrates the key information needed for type checking an AST, and (3) a novel type-aware neural program repair approach, Tare, that encodes the T-Graph and generates the patches guided by T-Grammar.The experiment was conducted on three benchmarks, 393 bugs from Defects4J v1.2, 444 additional bugs from Defects4J v2.0, and 40 bugs from QuixBugs. Our results show that Tare repairs 62, 32, and 27 bugs on these benchmarks respectively, and outperforms the existing APR approaches on all benchmarks. Further analysis also shows that Tare tends to generate more compilable patches than the existing DL-based APR approaches with the typing rule information.

DOI: 10.1109/ICSE48619.2023.00126

Template-Based Neural Program Repair

作者: Meng, Xiangxin and Wang, Xu and Zhang, Hongyu and Sun, Hailong and Liu, Xudong and Hu, Chunming
关键词: automated program repair, fix templates, neural machine translation, deep learning

Abstract

In recent years, template-based and NMT-based automated program repair methods have been widely studied and achieved promising results. However, there are still disadvantages in both methods. The template-based methods cannot fix the bugs whose types are beyond the capabilities of the templates and only use the syntax information to guide the patch synthesis, while the NMT-based methods intend to generate the small range of fixed code for better performance and may suffer from the OOV (Out-of-vocabulary) problem. To solve these problems, we propose a novel template-based neural program repair approach called TENURE to combine the template-based and NMT-based methods. First, we build two large-scale datasets for 35 fix templates from template-based method and one special fix template (single-line code generation) from NMT-based method, respectively. Second, the encoder-decoder models are adopted to learn deep semantic features for generating patch intermediate representations (IRs) for different templates. The optimized copy mechanism is also used to alleviate the OOV problem. Third, based on the combined patch IRs for different templates, three tools are developed to recover real patches from the patch IRs, replace the unknown tokens, and filter the patch candidates with compilation errors by leveraging the project-specific information. On Defects4J-v1.2, TENURE can fix 79 bugs and 52 bugs with perfect and Ochiai fault localization, respectively. It is able to repair 50 and 32 bugs as well on Defects4J-v2.0. Compared with the existing template-based and NMT-based studies, TENURE achieves the best performance in all experiments.

DOI: 10.1109/ICSE48619.2023.00127

Automated Repair of Programs from Large Language Models

作者: Fan, Zhiyu and Gao, Xiang and Mirchev, Martin and Roychoudhury, Abhik and Tan, Shin Hwei
关键词: No keywords

Abstract

Large language models such as Codex, have shown the capability to produce code for many programming tasks. However, the success rate of existing models is low, especially for complex programming tasks. One of the reasons is that language models lack awareness of program semantics, resulting in incorrect programs, or even programs which do not compile. In this paper, we systematically study whether automated program repair (APR) techniques can fix the incorrect solutions produced by language models in LeetCode contests. The goal is to study whether APR techniques can enhance reliability in the code produced by large language models. Our study revealed that: (1) automatically generated code shares common programming mistakes with human-crafted solutions, indicating APR techniques may have potential to fix auto-generated code; (2) given bug location information provided by a statistical fault localization approach, the newly released Codex edit mode, which supports editing code, is similar to or better than existing Java repair tools TBar and Recoder in fixing incorrect solutions. By analyzing the experimental results generated by these tools, we provide several suggestions: (1) enhancing APR tools to surpass limitations in patch space (e.g., introducing more flexible fault localization) is desirable; (2) as large language models can derive more fix patterns by training on more data, future APR tools could shift focus from adding more fix patterns to synthesis/semantics based approaches, (3) combination of language models with APR to curate patch ingredients, is worth studying.

DOI: 10.1109/ICSE48619.2023.00128

Automated Program Repair in the Era of Large Pre-Trained Language Models

作者: Xia, Chunqiu Steven and Wei, Yuxiang and Zhang, Lingming
关键词: No keywords

Abstract

Automated Program Repair (APR) aims to help developers automatically patch software bugs. However, current state-of-the-art traditional and learning-based APR techniques face the problem of limited patch variety, failing to fix complicated bugs. This is mainly due to the reliance on bug-fixing datasets to craft fix templates (traditional) or directly predict potential patches (learning-based). Large Pre-Trained Language Models (LLMs), trained using billions of text/code tokens, can potentially help avoid this issue. Very recently, researchers have directly leveraged LLMs for APR without relying on any bug-fixing datasets. Meanwhile, such existing work either failed to include state-of-the-art LLMs or was not evaluated on realistic datasets. Thus, the true power of modern LLMs on the important APR problem is yet to be revealed.In this work, we perform the first extensive study on directly applying LLMs for APR. We select 9 recent state-of-the-art LLMs, including both generative and infilling models, ranging from 125M to 20B in size. We designed 3 different repair settings to evaluate the different ways we can use LLMs to generate patches: 1) generate the entire patch function, 2) fill in a chunk of code given the prefix and suffix 3) output a single line fix. We apply the LLMs under these repair settings on 5 datasets across 3 different languages and compare different LLMs in the number of bugs fixed, generation speed and compilation rate. We also compare the LLMs against recent state-of-the-art APR tools. Our study demonstrates that directly applying state-of-the-art LLMs can already substantially outperform all existing APR techniques on all our datasets. Among the studied LLMs, the scaling effect exists for APR where larger models tend to achieve better performance. Also, we show for the first time that suffix code after the buggy line (adopted in infilling-style APR) is important in not only generating more fixes but more patches with higher compilation rate. Besides patch generation, the LLMs consider correct patches to be more natural than other ones, and can even be leveraged for effective patch ranking or patch correctness checking. Lastly, we show that LLM-based APR can be further substantially boosted via: 1) increasing the sample size, and 2) incorporating fix template information.

DOI: 10.1109/ICSE48619.2023.00129

Faster or Slower? Performance Mystery of Python Idioms Unveiled with Empirical Evidence

作者: Zhang, Zejun and Xing, Zhenchang and Xia, Xin and Xu, Xiwei and Zhu, Liming and Lu, Qinghua
关键词: No keywords

Abstract

The usage of Python idioms is popular among Python developers in a formative study of 101 Python idiom performance related questions on Stack Overflow, we find that developers often get confused about the performance impact of Python idioms and use anecdotal toy code or rely on personal project experience which is often contradictory in performance outcomes. There has been no large-scale, systematic empirical evidence to reconcile these performance debates. In the paper, we create a large synthetic dataset with 24,126 pairs of non-idiomatic and functionally-equivalent idiomatic code for the nine unique Python idioms identified in [1], and reuse a large real-project dataset of 54,879 such code pairs provided in [1]. We develop a reliable performance measurement method to compare the speedup or slowdown by idiomatic code against non-idiomatic counterpart, and analyze the performance discrepancies between the synthetic and real-project code, the relationships between code features and performance changes, and the root causes of performance changes at the bytecode level. We summarize our findings as some actionable suggestions for using Python idioms.

DOI: 10.1109/ICSE48619.2023.00130

Testability Refactoring in Pull Requests： Patterns and Trends

作者: Reich, Pavel and Maalej, Walid
关键词: pull request mining, software quality, refactoring patterns, software testability, mining software repositories

Abstract

To create unit tests, it may be necessary to refactor the production code, e.g. by widening access to specific methods or by decomposing classes into smaller units that are easier to test independently. We report on an extensive study to understand such composite refactoring procedures for the purpose of improving testability. We collected and studied 346,841 java pull requests from 621 GitHub projects. First, we compared the atomic refactorings in two populations: pull requests with changed test-pairs (i.e. with co-changes in production and test code and thus potentially including testability refactoring) and pull requests without test-pairs. We found significantly more atomic refactorings in test-pairs pull requests, such as Change Variable Type Operation or Change Parameter Type.Second, we manually analyzed the code changes of 200 pull requests, where developers explicitly mention the terms “testability” or “refactor + test”. We identified ten composite refactoring procedures for the purpose of testability, which we call testability refactoring patterns. Third, we manually analyzed additional 524 test-pairs pull requests: both randomly selected and where we assumed to find testability refactorings, e.g. in pull requests about dependency or concurrency issues. About 25% of all analyzed pull requests actually included testability refactoring patterns. The most frequent were extract a method for override or for invocation, widen access to a method for invocation, and extract a class for invocation. We also report on frequent atomic refactorings which co-occur with the patterns and discuss the implications of our findings for research, practice, and education.

DOI: 10.1109/ICSE48619.2023.00131

Usability-Oriented Design of Liquid Types for Java

作者: Gamboa, Catarina and Canelas, Paulo and Timperley, Christopher and Fonseca, Alcides
关键词: usability, java, refinement types, liquid types

Abstract

Developers want to detect bugs as early in the development lifecycle as possible, as the effort and cost to fix them increases with the incremental development of features. Ultimately, bugs that are only found in production can have catastrophic consequences.Type systems are effective at detecting many classes of bugs during development, often providing immediate feedback both at compile-time and while typing due to editor integration. Unfortunately, more powerful static and dynamic analysis tools do not have the same success due to providing false positives, not being immediate, or not being integrated into the language.Liquid Types extend the language type system with predicates, augmenting the classes of bugs that the compiler or IDE can catch compared to the simpler type systems available in mainstream programming languages. However, previous implementations of Liquid Types have not used human-centered methods for designing or evaluating their extensions. Therefore, this paper investigates how Liquid Types can be integrated into a mainstream programming language, Java, by proposing a new design that aims to lower the barriers to entry and adapts to problems that Java developers commonly encounter at runtime. Following a participatory design methodology, we conducted a developer survey to design the syntax of LiquidJava, our prototype.To evaluate if the added effort to writing Liquid Types in Java would convince users to adopt them, we conducted a user study with 30 Java developers. The results show that LiquidJava helped users detect and fix more bugs and that Liquid Types are easy to interpret and learn with few resources. At the end of the study, all users reported interest in adopting LiquidJava for their projects.

DOI: 10.1109/ICSE48619.2023.00132

Towards Understanding Fairness and its Composition in Ensemble Machine Learning

作者: Gohar, Usman and Biswas, Sumon and Rajan, Hridesh
关键词: fairness, ensemble, machine learning, models

Abstract

Machine Learning (ML) software has been widely adopted in modern society, with reported fairness implications for minority groups based on race, sex, age, etc. Many recent works have proposed methods to measure and mitigate algorithmic bias in ML models. The existing approaches focus on single classifier-based ML models. However, real-world ML models are often composed of multiple independent or dependent learners in an ensemble (e.g., Random Forest), where the fairness composes in a non-trivial way. How does fairness compose in ensembles? What are the fairness impacts of the learners on the ultimate fairness of the ensemble? Can fair learners result in an unfair ensemble? Furthermore, studies have shown that hyperparameters influence the fairness of ML models. Ensemble hyperparameters are more complex since they affect how learners are combined in different categories of ensembles. Understanding the impact of ensemble hyperparameters on fairness will help programmers design fair ensembles. Today, we do not understand these fully for different ensemble algorithms. In this paper, we comprehensively study popular real-world ensembles: Bagging, Boosting, Stacking, and Voting. We have developed a benchmark of 168 ensemble models collected from Kaggle on four popular fairness datasets. We use existing fairness metrics to understand the composition of fairness. Our results show that ensembles can be designed to be fairer without using mitigation techniques. We also identify the interplay between fairness composition and data characteristics to guide fair ensemble design. Finally, our benchmark can be leveraged for further research on fair ensembles. To the best of our knowledge, this is one of the first and largest studies on fairness composition in ensembles yet presented in the literature.

DOI: 10.1109/ICSE48619.2023.00133

Fairify： Fairness Verification of Neural Networks

作者: Biswas, Sumon and Rajan, Hridesh
关键词: fairness, machine learning

Abstract

Fairness of machine learning (ML) software has become a major concern in the recent past. Although recent research on testing and improving fairness have demonstrated impact on real-world software, providing fairness guarantee in practice is still lacking. Certification of ML models is challenging because of the complex decision-making process of the models. In this paper, we proposed Fairify, an SMT-based approach to verify individual fairness property in neural network (NN) models. Individual fairness ensures that any two similar individuals get similar treatment irrespective of their protected attributes e.g., race, sex, age. Verifying this fairness property is hard because of the global checking and non-linear computation nodes in NN. We proposed sound approach to make individual fairness verification tractable for the developers. The key idea is that many neurons in the NN always remain inactive when a smaller part of the input domain is considered. So, Fairify leverages white-box access to the models in production and then apply formal analysis based pruning. Our approach adopts input partitioning and then prunes the NN for each partition to provide fairness certification or counterexample. We leveraged interval arithmetic and activation heuristic of the neurons to perform the pruning as necessary. We evaluated Fairify on 25 real-world neural networks collected from four different sources, and demonstrated the effectiveness, scalability and performance over baseline and closely related work. Fairify is also configurable based on the domain and size of the NN. Our novel formulation of the problem can answer targeted verification queries with relaxations and counterexamples, which have practical implications.

DOI: 10.1109/ICSE48619.2023.00134

Leveraging Feature Bias for Scalable Misprediction Explanation of Machine Learning Models

作者: Gesi, Jiri and Shen, Xinyun and Geng, Yunfan and Chen, Qihong and Ahmed, Iftekhar
关键词: machine learning, data imbalance, rule induction, misprediction explanation

Abstract

Interpreting and debugging machine learning models is necessary to ensure the robustness of the machine learning models. Explaining mispredictions can help significantly in doing so. While recent works on misprediction explanation have proven promising in generating interpretable explanations for mispredictions, the state-of-the-art techniques “blindly” deduce misprediction explanation rules from all data features, which may not be scalable depending on the number of features. To alleviate this problem, we propose an efficient misprediction explanation technique named Bias Guided Misprediction Diagnoser (BGMD), which leverages two prior knowledge about data: a) data often exhibit highly-skewed feature distributions and b) trained models in many cases perform poorly on subdataset with under-represented features. Next, we propose a technique named MAPS (Mispredicted Area UPweight Sampling). MAPS increases the weights of subdataset during model retraining that belong to the group that is prone to be mispredicted because of containing under-represented features. Thus, MAPS make retrained model pay more attention to the under-represented features. Our empirical study shows that our proposed BGMD outperformed the state-of-the-art misprediction diagnoser and reduces diagnosis time by 92%. Furthermore, MAPS outperformed two state-of-the-art techniques on fixing the machine learning model’s performance on mispredicted data without compromising performance on all data. All the research artifacts (i.e., tools, scripts, and data) of this study are available in the accompanying website [1].

DOI: 10.1109/ICSE48619.2023.00135

Information-Theoretic Testing and Debugging of Fairness Defects in Deep Neural Networks

作者: Monjezi, Verya and Trivedi, Ashutosh and Tan, Gang and Tizpaz-Niari, Saeid
关键词: No keywords

Abstract

The deep feedforward neural networks (DNNs) are increasingly deployed in socioeconomic critical decision support software systems. DNNs are exceptionally good at finding minimal, sufficient statistical patterns within their training data. Consequently, DNNs may learn to encode decisions—amplifying existing biases or introducing new ones—that may disadvantage protected individuals/groups and may stand to violate legal protections. While the existing search based software testing approaches have been effective in discovering fairness defects, they do not supplement these defects with debugging aids—such as severity and causal explanations—crucial to help developers triage and decide on the next course of action. Can we measure the severity of fairness defects in DNNs? Are these defects symptomatic of improper training or they merely reflect biases present in the training data? To answer such questions, we present DICE: an information-theoretic testing and debugging framework to discover and localize fairness defects in DNNs.The key goal of DICE is to assist software developers in triaging fairness defects by ordering them by their severity. Towards this goal, we quantify fairness in terms of protected information (in bits) used in decision making. A quantitative view of fairness defects not only helps in ordering these defects, our empirical evaluation shows that it improves the search efficiency due to resulting smoothness of the search space. Guided by the quantitative fairness, we present a causal debugging framework to localize inadequately trained layers and neurons responsible for fairness defects. Our experiments over ten DNNs, developed for socially critical tasks, show that DICE efficiently characterizes the amounts of discrimination, effectively generates discriminatory instances (vis-a-vis the state-of-the-art techniques), and localizes layers/neurons with significant biases.

DOI: 10.1109/ICSE48619.2023.00136

Demystifying Privacy Policy of Third-Party Libraries in Mobile Apps

作者: Zhao, Kaifa and Zhan, Xian and Yu, Le and Zhou, Shiyao and Zhou, Hao and Luo, Xiapu and Wang, Haoyu and Liu, Yepang
关键词: privacy policy, third-party library, android

Abstract

The privacy of personal information has received significant attention in mobile software. Although researchers have designed methods to identify the conflict between app behavior and privacy policies, little is known about the privacy compliance issues relevant to third-party libraries (TPLs). The regulators enacted articles to regulate the usage of personal information for TPLs (e.g., the CCPA requires businesses clearly notify consumers if they share consumers’ data with third parties or not). However, it remains challenging to investigate the privacy compliance issues of TPLs due to three reasons: 1) Difficulties in collecting TPLs’ privacy policies. In contrast to Android apps, which are distributed through markets like Google Play and must provide privacy policies, there is no unique platform for collecting privacy policies of TPLs. 2) Difficulties in analyzing TPL’s user privacy access behaviors. TPLs are mainly provided in binary files, such as jar or aar, and their whole functionalities usually cannot be executed independently without host apps. 3) Difficulties in identifying consistency between TPL’s functionalities and privacy policies, and host app’s privacy policy and data sharing with TPLs. This requires analyzing not only the privacy policies of TPLs and host apps but also their functionalities. In this paper, we propose an automated system named ATPChecker to analyze whether Android TPLs comply with the privacy-related regulations. We construct a data set that contains a list of 458 TPLs, 247 TPL’s privacy policies, 187 TPL’s binary files and 641 host apps and their privacy policies. Then, we analyze the bytecode of TPLs and host apps, design natural language processing systems to analyze privacy policies, and implement an expert system to identify TPL usage-related regulation compliance. The experimental results show that 23% TPLs violate regulation requirements for providing privacy policies. Over 47% TPLs miss disclosing data usage in their privacy policies. Over 65% host apps share user data with TPLs while 65% of them miss disclosing interactions with TPLs. Our findings remind developers to be mindful of TPL usage when developing apps or writing privacy policies to avoid violating regulations.

DOI: 10.1109/ICSE48619.2023.00137

Cross-Domain Requirements Linking via Adversarial-Based Domain Adaptation

作者: Chang, Zhiyuan and Li, Mingyang and Wang, Qing and Li, Shoubin and Wang, Junjie
关键词: cross-domain requirements linking, domain adaptation, adversarial learning

Abstract

Requirements linking is the core of software system maintenance and evolution, and it is critical to assuring software quality. In practice, however, the requirements links are frequently absent or incorrectly labeled, and reconstructing such ties is time-consuming and error-prone. Numerous learning-based approaches have been put forth to address the problem. However, these approaches will lose effectiveness for the Cold-Start projects with few labeled samples. To this end, we propose RADIATION, an adversarial-based domain adaptation approach for cross-domain requirements linking. Generally, RADIATION firstly adopts an IDF-based Masking strategy to filter the domain-specific features. Then it pre-trains a linking model in the source domain with sufficient labeled samples and adapts the model to target domains using a distance-enhanced adversarial technique without using any labeled target samples. Evaluation on five public datasets shows that RADIATION could achieve 66.4% precision, 89.2% recall, and significantly outperform state-of-the-art baselines by 13.4%-42.9% F1. In addition, the designed components, i.e., IDF-based Masking and Distance-enhanced Loss, could significantly improve performance.

DOI: 10.1109/ICSE48619.2023.00138

On-Demand Security Requirements Synthesis with Relational Generative Adversarial Networks

作者: Koscinski, Viktoria and Hashemi, Sara and Mirakhorli, Mehdi
关键词: software security requirements, requirements engineering, generative adversarial networks

Abstract

Security requirements engineering is a manual and error-prone activity that is often neglected due to the knowledge gap between cybersecurity professionals and software requirements engineers. In this paper, we aim to automate the process of recommending and synthesizing security requirements specifications and therefore supporting requirements engineers in soliciting and specifying security requirements. We investigate the use of Relational Generative Adversarial Networks (GANs) in automatically synthesizing security requirements specifications. We evaluate our approach using a real case study of the Court Case Management System (CCMS) developed for the Indiana Supreme Court’s Division of State Court Administration. We present an approach based on RelGAN to generate security requirements specifications for the CCMS. We show that RelGAN is practical for synthesizing security requirements specifications as indicated by subject matter experts. Based on this study, we demonstrate promising results for the use of GANs in the software requirements synthesis domain. We also provide a baseline for synthesizing requirements, highlight limitations and weaknesses of RelGAN and define opportunities for further investigations.

DOI: 10.1109/ICSE48619.2023.00139

Measuring Secure Coding Practice and Culture： A Finger Pointing at the Moon is Not the Moon

作者: Ryan, Ita and Roedig, Utz and Stol, Klaas-Jan
关键词: secure coding, security compliance

Abstract

Software security research has a core problem: it is impossible to prove the security of complex software. A low number of known defects may simply indicate that the software has not been attacked yet, or that successful attacks have not been detected. A high defect count may be the result of white-hat hacker targeting, or of a successful bug bounty program which prevented insecurities from persisting in the wild. This makes it difficult to measure the security of non-trivial software. Researchers instead usually measure effort directed towards ensuring software security. However, different researchers use their own tailored measures, usually devised from industry secure coding guidelines. Not only is there no agreed way to measure effort, there is also no agreement on what effort entails. Qualitative studies emphasise the importance of security culture in an organisation. Where software security practices are introduced solely to ensure compliance with legislative or industry standards, a box-ticking attitude to security may result. The security culture may be weak or non-existent, making it likely that precautions not explicitly mentioned in the standards will be missed. Thus, researchers need both a way to assess software security practice and a way to measure software security culture. To assess security practice, we converted the empirically-established 12 most common software security activities into questions. To assess security culture, we devised a number of questions grounded in prior literature. We ran a secure development survey with both sets of questions, obtaining organic responses from 1,100 software coders in 59 countries. We used proven common activities to assess security practice, and made a first attempt to quantitatively assess aspects of security culture in the broad developer population. Our results show that some coders still work in environments where there is little to no attempt to ensure code security. Security practice and culture do not always correlate, and some organisations with strong secure coding practice have weak secure coding culture. This may lead to problems in defect prevention and sustained software security effort.

DOI: 10.1109/ICSE48619.2023.00140

What Challenges Do Developers Face about Checked-in Secrets in Software Artifacts?

作者: Basak, Setu Kumar and Neil, Lorenzo and Reaves, Bradley and Williams, Laurie
关键词: No keywords

Abstract

Throughout 2021, GitGuardian’s monitoring of public GitHub repositories revealed a two-fold increase in the number of secrets (database credentials, API keys, and other credentials) exposed compared to 2020, accumulating more than six million secrets. To our knowledge, the challenges developers face to avoid checked-in secrets are not yet characterized. The goal of our paper is to aid researchers and tool developers in understanding and prioritizing opportunities for future research and tool automation for mitigating checked-in secrets through an empirical investigation of challenges and solutions related to checked-in secrets. We extract 779 questions related to checked-in secrets on Stack Exchange and apply qualitative analysis to determine the challenges and the solutions posed by others for each of the challenges. We identify 27 challenges and 13 solutions. The four most common challenges, in ranked order, are: (i) store/version of secrets during deployment; (ii) store/version of secrets in source code; (iii) ignore/hide of secrets in source code; and (iv) sanitize VCS history. The three most common solutions, in ranked order, are: (i) move secrets out of source code/version control and use template config file; (ii) secret management in deployment; and (iii) use local environment variables. Our findings indicate that the same solution has been mentioned to mitigate multiple challenges. However, our findings also identify an increasing trend in questions lacking accepted solutions substantiating the need for future research and tool automation on managing secrets.

DOI: 10.1109/ICSE48619.2023.00141

Lejacon： A Lightweight and Efficient Approach to Java Confidential Computing on SGX

作者: Miao, Xinyuan and Lin, Ziyi and Wang, Shaojun and Yu, Lei and Li, Sanhong and Wang, Zihan and Nie, Pengbo and Chen, Yuting and Shen, Beijun and Jiang, He
关键词: software guard extensions, separation compilation, native confidential computing service, runtime, secure closed-world

Abstract

Intel’s SGX is a confidential computing technique. It allows key functionalities of C/C++/native applications to be confidentially executed in hardware enclaves. However, numerous cloud applications are written in Java. For supporting their confidential computing, state-of-the-art approaches deploy Java Virtual Machines (JVMs) in enclaves and perform confidential computing on JVMs. Meanwhile, these JVM-in-enclave solutions still suffer from serious limitations, such as heavy overheads of running JVMs in enclaves, large attack surfaces, and deep computation stacks. To mitigate the above limitations, we formalize a Secure Closed-World (SCW) principle and then propose Lejacon, a lightweight and efficient approach to Java confidential computing. The key idea is, given a Java application, to (1) separately compile its confidential computing tasks into a bundle of Native Confidential Computing (NCC) services; (2) run the NCC services in enclaves on the Trusted Execution Environment (TEE) side, and meanwhile run the non-confidential code on a JVM on the Rich Execution Environment (REE) side. The two sides interact with each other, protecting confidential computing tasks and as well keeping the Trusted Computing Base (TCB) size small.We implement Lejacon and evaluate it against OcclumJ (a state-of-the-art JVM-in-enclave solution) on a set of benchmarks using the BouncyCastle cryptography library. The evaluation results clearly show the strengths of Lejacon: it achieves competitive performance in running Java confidential code in enclaves; compared with OcclumJ, Lejacon achieves speedups by up to 16.2\texttimes{

DOI: 10.1109/ICSE48619.2023.00142

Keyword Extraction from Specification Documents for Planning Security Mechanisms

作者: Poozhithara, Jeffy Jahfar and Asuncion, Hazeline U. and Lagesse, Brent
关键词: vulnerability prediction, CVE, CWE, keyword extraction

Abstract

Software development companies heavily invest both time and money to provide post-production support to fix security vulnerabilities in their products. Current techniques identify vulnerabilities from source code using static and dynamic analyses. However, this does not help integrate security mechanisms early in the architectural design phase. We develop VDocScan, a technique for predicting vulnerabilities based on specification documents, even before the development stage. We evaluate VDocScan using an extensive dataset of CVE vulnerability reports mapped to over 3600 product documentations. An evaluation of 8 CWE vulnerability pillars shows that even interpretable whitebox classifiers predict vulnerabilities with up to 61.1% precision and 78% recall. Further, using strategies to improve the relevance of extracted keywords, addressing class imbalance, segregating products into categories such as Operating Systems, Web applications, and Hardware, and using blackbox ensemble models such as the random forest classifier improves the performance to 96% precision and 91.1% recall. The high precision and recall shows that VDocScan can anticipate vulnerabilities detected in a product’s lifetime ahead of time during the Design phase to incorporate necessary security mechanisms. The performance is consistently high for vulnerabilities with the mode of introduction: architecture and design.

DOI: 10.1109/ICSE48619.2023.00143

Dependency Facade： The Coupling and Conflicts between Android Framework and Its Customization

作者: Jin, Wuxia and Dai, Yitong and Zheng, Jianguo and Qu, Yu and Fan, Ming and Huang, Zhenyu and Huang, Dezhi and Liu, Ting
关键词: android, downstream, dependencies, merge conflict

Abstract

Mobile device vendors develop their customized Android OS (termed downstream) based on Google Android (termed upstream) to support new features. During daily independent development, the downstream also periodically merges changes of a new release from the upstream into its development branches, keeping in sync with the upstream. Due to a large number of commits to be merged, heavy code conflicts would be reported if auto-merge operations failed. Prior work has studied conflicts in this scenario. However, it is still unclear about the coupling between the downstream and the upstream (We term this coupling as the dependency facade), as well as how merge conflicts are related to this coupling. To address this issue, we first propose the DepFCD to reveal the dependency facade from three aspects, including interface-level dependencies that indicate a clear design boundary, intrusion-level dependencies which blur the boundary, and dependency constraints imposed by the upstream non-SDK restrictions. We then empirically investigate these three aspects (RQ1, RQ2, RQ3) and merge conflicts (RQ4) on the dependency facade. To support the study, we collect four open-source downstream projects and one industrial project, with 15 downstream and 15 corresponding upstream versions. Our study reveals interesting observations and suggests earlier mitigation of merge conflicts through a well-managed dependency facade. Our study will benefit the research about the coupling between upstream and downstream as well as the downstream maintenance practice.

DOI: 10.1109/ICSE48619.2023.00144

Test Selection for Unified Regression Testing

作者: Wang, Shuai and Lian, Xinyu and Marinov, Darko and Xu, Tianyin
关键词: No keywords

Abstract

Today’s software failures have two dominating root causes: code bugs and misconfigurations. To combat failure-inducing software changes, unified regression testing (URT) is needed to synergistically test the changed code and all changed production configurations for deployment reliability. However, URT could incur high cost, as it needs to run a large number of tests under multiple configurations. Regression test selection (RTS) can reduce regression testing cost. Unfortunately, no existing RTS technique reasons about code and configuration changes collectively.We introduce Unified Regression Test Selection (uRTS) to effectively reduce the cost of URT. uRTS supports project changes on 1) code only, 2) configurations only, and 3) both code and configurations. It selects regular tests and configuration tests with a unified selection algorithm. The uRTS algorithm analyzes code and configuration dependencies of each test across runs and across configurations. uRTS provides the same safety guarantee as the state-of-the-art RTS while selecting fewer tests and, more importantly, reducing the end-to-end testing time.We implemented uRTS on top of Ekstazi (a RTS tool for code changes) and Ctest (a configuration testing framework). We evaluate uRTS on hundreds of code revisions and dozens of configurations of five large projects. The results show that uRTS reduces the end-to-end testing time, on average, by 3.64X compared to executing all tests and 1.87X compared to a competitive reference solution that directly extends RTS for URT.

DOI: 10.1109/ICSE48619.2023.00145

ATM： Black-Box Test Case Minimization Based on Test Code Similarity and Evolutionary Search

作者: Pan, Rongqi and Ghaleb, Taher A. and Briand, Lionel
关键词: test case minimization, test suite reduction, tree-based similarity, AST, genetic algorithm, black-box testing

Abstract

Executing large test suites is time and resource consuming, sometimes impossible, and such test suites typically contain many redundant test cases. Hence, test case (suite) minimization is used to remove redundant test cases that are unlikely to detect new faults. However, most test case minimization techniques rely on code coverage (white-box), model-based features, or requirements specifications, which are not always (entirely) accessible by test engineers. Code coverage analysis also leads to scalability issues, especially when applied to large industrial systems. Recently, a set of novel techniques was proposed, called FAST-R, relying solely on test case code for test case minimization, which appeared to be much more efficient than white-box techniques. However, it achieved a comparable low fault detection capability for Java projects, thus making its application challenging in practice. In this paper, we propose ATM (AST-based Test case Minimizer), a similarity-based, search-based test case minimization technique, taking a specific budget as input, that also relies exclusively on the source code of test cases but attempts to achieve higher fault detection through finer-grained similarity analysis and a dedicated search algorithm. ATM transforms test case code into Abstract Syntax Trees (AST) and relies on four tree-based similarity measures to apply evolutionary search, specifically genetic algorithms, to minimize test cases. We evaluated the effectiveness and efficiency of ATM on a large dataset of 16 Java projects with 661 faulty versions using three budgets ranging from 25% to 75% of test suites. ATM achieved significantly higher fault detection rates (0.82 on average), compared to FAST-R (0.61 on average) and random minimization (0.52 on average), when running only 50% of the test cases, within practically acceptable time (1.1 – 4.3 hours, on average, per project version), given that minimization is only occasionally applied when many new test cases are created (major releases). Results achieved for other budgets were consistent.

DOI: 10.1109/ICSE48619.2023.00146

Measuring and Mitigating Gaps in Structural Testing

作者: Hossain, Soneya Binta and Dwyer, Matthew B. and Elbaum, Sebastian and Nguyen-Tuong, Anh
关键词: code coverage, checked coverage, test oracles, mutation testing, fault-detection effectiveness

Abstract

Structural code coverage is a popular test adequacy metric that measures the percentage of program structure (e.g., statement, branch, decision) executed by a test suite. While structural coverage has several benefits, previous studies suggested that code coverage is not a good indicator of a test suite’s fault-detection effectiveness as coverage computation does not consider test oracle quality. In this research, we formally define the coverage gap in structural testing as the percentage of program structure that is executed but not observed by any test oracles. Our large-scale empirical study of 13 Java applications, 16K test cases and 51.6K test assertions shows that even for mature test suites, the gap can be as high as 51 percentage points (pp) and 34pp on average. Our study reveals that the coverage gap strongly and negatively correlates with a test suite’s fault-detection effectiveness. To mitigate gaps, we propose a lightweight static analysis of program dependencies to produce a ranked recommendation of test focus methods that can reduce the gap and improve test suite quality. When considering 34.8K assertions in the test suite as ground truth, the recommender suggests two-thirds of the focus methods written by developers within the top five recommendations.

DOI: 10.1109/ICSE48619.2023.00147

作者: Lee, Cheryl and Yang, Tianyi and Chen, Zhuangbin and Su, Yuxin and Yang, Yongqiang and Lyu, Michael R.
关键词: software system, anomaly detection, cross-modal learning

Abstract

Prompt and accurate detection of system anomalies is essential to ensure the reliability of software systems. Unlike manual efforts that exploit all available run-time information, existing approaches usually leverage only a single type of monitoring data (often logs or metrics) or fail to make effective use of the joint information among different types of data. Consequently, many false predictions occur. To better understand the manifestations of system anomalies, we conduct a systematical study on a large amount of heterogeneous data, i.e., logs and metrics. Our study demonstrates that logs and metrics can manifest system anomalies collaboratively and complementarily, and neither of them only is sufficient. Thus, integrating heterogeneous data can help recover the complete picture of a system’s health status. In this context, we propose Hades, the first end-to-end semi-supervised approach to effectively identify system anomalies based on heterogeneous data. Our approach employs a hierarchical architecture to learn a global representation of the system status by fusing log semantics and metric patterns. It captures discriminative features and meaningful interactions from heterogeneous data via a cross-modal attention module, trained in a semi-supervised manner. We evaluate Hades extensively on large-scale simulated data and datasets from Huawei Cloud. The experimental results present the effectiveness of our model in detecting system anomalies. We also release the code and the annotated dataset for replication and future research.

DOI: 10.1109/ICSE48619.2023.00148

Recommending Root-Cause and Mitigation Steps for Cloud Incidents Using Large Language Models

作者: Ahmed, Toufique and Ghosh, Supriyo and Bansal, Chetan and Zimmermann, Thomas and Zhang, Xuchao and Rajmohan, Saravan
关键词: incident management, service quality, GPT-3.x, large language models

Abstract

Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.

DOI: 10.1109/ICSE48619.2023.00149

Eadro： An End-to-End Troubleshooting Framework for Microservices on Multi-Source Data

作者: Lee, Cheryl and Yang, Tianyi and Chen, Zhuangbin and Su, Yuxin and Lyu, Michael R.
关键词: microservices, root cause localization, anomaly detection, traces

Abstract

The complexity and dynamism of microservices pose significant challenges to system reliability, and thereby, automated troubleshooting is crucial. Effective root cause localization after anomaly detection is crucial for ensuring the reliability of microservice systems. However, two significant issues rest in existing approaches: (1) Microservices generate traces, system logs, and key performance indicators (KPIs), but existing approaches usually consider traces only, failing to understand the system fully as traces cannot depict all anomalies; (2) Troubleshooting microservices generally contains two main phases, i.e., anomaly detection and root cause localization. Existing studies regard these two phases as independent, ignoring their close correlation. Even worse, inaccurate detection results can deeply affect localization effectiveness. To overcome these limitations, we propose Eadro, the first end-to-end framework to integrate anomaly detection and root cause localization based on multi-source data for troubleshooting large-scale microservices. The key insights of Eadro are the anomaly manifestations on different data sources and the close connection between detection and localization. Thus, Eadro models intra-service behaviors and inter-service dependencies from traces, logs, and KPIs, all the while leveraging the shared knowledge of the two phases via multi-task learning. Experiments on two widely-used benchmark microservices demonstrate that Eadro outperforms state-of-the-art approaches by a large margin. The results also show the usefulness of integrating multi-source data. We also release our code and data to facilitate future research.

DOI: 10.1109/ICSE48619.2023.00150

LogReducer： Identify and Reduce Log Hotspots in Kernel on the Fly

作者: Yu, Guangba and Chen, Pengfei and Li, Pairui and Weng, Tianjun and Zheng, Haibing and Deng, Yuetang and Zheng, Zibin
关键词: log hotspot, eBPF, log reduction, log parsing

Abstract

Modern systems generate a massive amount of logs to detect and diagnose system faults, which incurs expensive storage costs and runtime overhead. After investigating real-world production logs, we observe that most of the logging overhead is due to a small number of log templates, referred to as log hotspots. Therefore, we conduct a systematical study about log hotspots in an industrial system WeChat, which motivates us to identify log hotspots and reduce them on the fly. In this paper, we propose LogReducer, a non-intrusive and language-independent log reduction framework based on eBPF (Extended Berkeley Packet Filter), consisting of both online and offline processes. After two months of serving the offline process of LogReducer in WeChat, the log storage overhead has dropped from 19.7 PB per day to 12.0 PB (i.e., about a 39.08% decrease). Practical implementation and experimental evaluations in the test environment demonstrate that the online process of LogReducer can control the logging overhead of hotspots while preserving logging effectiveness. Moreover, the log hotspot handling time can be reduced from an average of 9 days in production to 10 minutes in the test with the help of LogReducer.

DOI: 10.1109/ICSE48619.2023.00151

Aries： Efficient Testing of Deep Neural Networks via Labeling-Free Accuracy Estimation

作者: Hu, Qiang and Guo, Yuejun and Xie, Xiaofei and Cordy, Maxime and Papadakis, Mike and Ma, Lei and Traon, Yves Le
关键词: deep learning testing, performance estimation, distribution shift

Abstract

Deep learning (DL) plays a more and more important role in our daily life due to its competitive performance in industrial application domains. As the core of DL-enabled systems, deep neural networks (DNNs) need to be carefully evaluated to ensure the produced models match the expected requirements. In practice, the de facto standard to assess the quality of DNNs in the industry is to check their performance (accuracy) on a collected set of labeled test data. However, preparing such labeled data is often not easy partly because of the huge labeling effort, i.e., data labeling is labor-intensive, especially with the massive new incoming unlabeled data every day. Recent studies show that test selection for DNN is a promising direction that tackles this issue by selecting minimal representative data to label and using these data to assess the model. However, it still requires human effort and cannot be automatic. In this paper, we propose a novel technique, named Aries, that can estimate the performance of DNNs on new unlabeled data using only the information obtained from the original test data. The key insight behind our technique is that the model should have similar prediction accuracy on the data which have similar distances to the decision boundary. We performed a large-scale evaluation of our technique on two famous datasets, CIFAR-10 and Tiny-ImageNet, four widely studied DNN models including ResNet101 and DenseNet121, and 13 types of data transformation methods. Results show that the estimated accuracy by Aries is only 0.03% – 2.60% off the true accuracy. Besides, Aries also outperforms the state-of-the-art labeling-free methods in 50 out of 52 cases and selection-labeling-based methods in 96 out of 128 cases.

DOI: 10.1109/ICSE48619.2023.00152

CC： Causality-Aware Coverage Criterion for Deep Neural Networks

作者: Ji, Zhenlan and Ma, Pingchuan and Yuan, Yuanyuan and Wang, Shuai
关键词: No keywords

Abstract

Deep neural network (DNN) testing approaches have grown fast in recent years to test the correctness and robustness of DNNs. In particular, DNN coverage criteria are frequently used to evaluate the quality of a test suite, and a number of coverage criteria based on neuron-wise, layer-wise, and path-/trace-wise coverage patterns have been published to date. However, we see that existing criteria are insufficient to represent how one neuron would influence subsequent neurons; hence, we lack a concept of how neurons, when functioning as causes and effects, might jointly make a DNN prediction.Given recent advances in interpreting DNN internals using causal inference, we present the first causality-aware DNN coverage criterion, which evaluates a test suite by quantifying the extent to which the suite provides new causal relations for testing DNNs. Performing standard causal inference on DNNs presents both theoretical and practical hurdles. We introduce CC (causal coverage), a practical and efficient coverage criterion that integrates a set of optimizations using DNN domain-specific knowledge. We illustrate the efficacy of CC using diverse, real-world inputs and adversarial inputs, such as adversarial examples (AEs) and backdoor inputs. We demonstrate that CC outperforms previous DNN criteria under various settings with moderate cost.

DOI: 10.1109/ICSE48619.2023.00153

Balancing Effectiveness and Flakiness of Non-Deterministic Machine Learning Tests

作者: Xia, Chunqiu Steven and Dutta, Saikat and Misailovic, Sasa and Marinov, Darko and Zhang, Lingming
关键词: No keywords

Abstract

Testing Machine Learning (ML) projects is challenging due to inherent non-determinism of various ML algorithms and the lack of reliable ways to compute reference results. Developers typically rely on their intuition when writing tests to check whether ML algorithms produce accurate results. However, this approach leads to conservative choices in selecting assertion bounds for comparing actual and expected results in test assertions. Because developers want to avoid false positive failures in tests, they often set the bounds to be too loose, potentially leading to missing critical bugs.We present FASER - the first systematic approach for balancing the trade-off between the fault-detection effectiveness and flakiness of non-deterministic tests by computing optimal assertion bounds. FASER frames this trade-off as an optimization problem between these competing objectives by varying the assertion bound. FASER leverages 1) statistical methods to estimate the flakiness rate, and 2) mutation testing to estimate the fault-detection effectiveness. We evaluate FASER on 87 non-deterministic tests collected from 22 popular ML projects. FASER finds that 23 out of 87 studied tests have conservative bounds and proposes tighter assertion bounds that maximizes the fault-detection effectiveness of the tests while limiting flakiness. We have sent 19 pull requests to developers, each fixing one test, out of which 14 pull requests have already been accepted.

DOI: 10.1109/ICSE48619.2023.00154

Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems

作者: Haq, Fitash Ul and Shin, Donghwan and Briand, Lionel C.
关键词: DNN testing, reinforcement learning, many objective search, self-driving cars, online testing

Abstract

Deep Neural Networks (DNNs) have been widely used to perform real-world tasks in cyber-physical systems such as Autonomous Driving Systems (ADS). Ensuring the correct behavior of such DNN-Enabled Systems (DES) is a crucial topic. Online testing is one of the promising modes for testing such systems with their application environments (simulated or real) in a closed loop, taking into account the continuous interaction between the systems and their environments. However, the environmental variables (e.g., lighting conditions) that might change during the systems’ operation in the real world, causing the DES to violate requirements (safety, functional), are often kept constant during the execution of an online test scenario due to the two major challenges: (1) the space of all possible scenarios to explore would become even larger if they changed and (2) there are typically many requirements to test simultaneously.In this paper, we present MORLOT (Many-Objective Reinforcement Learning for Online Testing), a novel online testing approach to address these challenges by combining Reinforcement Learning (RL) and many-objective search. MORLOT leverages RL to incrementally generate sequences of environmental changes while relying on many-objective search to determine the changes so that they are more likely to achieve any of the uncovered objectives. We empirically evaluate MORLOT using CARLA, a high-fidelity simulator widely used for autonomous driving research, integrated with Transfuser, a DNN-enabled ADS for end-to-end driving. The evaluation results show that MORLOT is significantly more effective and efficient than alternatives with a large effect size. In other words, MORLOT is a good option to test DES with dynamically changing environments while accounting for multiple safety requirements.

DOI: 10.1109/ICSE48619.2023.00155

Reliability Assurance for Deep Neural Network Architectures against Numerical Defects

作者: Li, Linyi and Zhang, Yuhao and Ren, Luyao and Xiong, Yingfei and Xie, Tao
关键词: neural network, numerical defect, testing, fix

Abstract

With the widespread deployment of deep neural networks (DNNs), ensuring the reliability of DNN-based systems is of great importance. Serious reliability issues such as system failures can be caused by numerical defects, one of the most frequent defects in DNNs. To assure high reliability against numerical defects, in this paper, we propose the RANUM approach including novel techniques for three reliability assurance tasks: detection of potential numerical defects, confirmation of potential-defect feasibility, and suggestion of defect fixes. To the best of our knowledge, RANUM is the first approach that confirms potential-defect feasibility with failure-exhibiting tests and suggests fixes automatically. Extensive experiments on the benchmarks of 63 real-world DNN architectures show that RANUM outperforms state-of-the-art approaches across the three reliability assurance tasks. In addition, when the RANUM-generated fixes are compared with developers’ fixes on open-source projects, in 37 out of 40 cases, RANUM-generated fixes are equivalent to or even better than human fixes.

DOI: 10.1109/ICSE48619.2023.00156

Demystifying Issues, Challenges, and Solutions for Multilingual Software Development

作者: Yang, Haoran and Lian, Weile and Wang, Shaowei and Cai, Haipeng
关键词: multilingual software, development issues, language interfacing, software build, data format, interoperability

Abstract

Developing a software project using multiple languages together has been a dominant practice for years. Yet it remains unclear what issues developers encounter during the development, which challenges cause the issues, and what solutions developers receive. In this paper, we aim to answer these questions via a study on developer discussions on Stack Overflow. By manually analyzing 586 highly relevant posts spanning 14 years, we observed a large variety (11 categories) of issues, dominated by those with interfacing and data handling among different languages. Behind these issues, we found that a major challenge developers faced is the diversity and complexity in multilingual code building and interoperability. Another key challenge lies in developers’ lack of particular technical background on the diverse features of various languages (e.g., threading and memory management mechanisms). Meanwhile, Stack Overflow itself served as a key source of solutions to these challenges—the majority (73%) of the posts received accepted answers eventually, and most in a week (36.5% within 24 hours and 25% in the next 6 days). Based on our findings on these issues, challenges, and solutions, we provide actionable insights and suggestions for both multi-language software researchers and developers.

DOI: 10.1109/ICSE48619.2023.00157

Automated Summarization of Stack Overflow Posts

作者: Kou, Bonan and Chen, Muhao and Zhang, Tianyi
关键词: stack overflow, text summarization, deep learning

Abstract

Software developers often resort to Stack Overflow (SO) to fill their programming needs. Given the abundance of relevant posts, navigating them and comparing different solutions is tedious and time-consuming. Recent work has proposed to automatically summarize SO posts to concise text to facilitate the navigation of SO posts. However, these techniques rely only on information retrieval methods or heuristics for text summarization, which is insufficient to handle the ambiguity and sophistication of natural language.This paper presents a deep learning based framework called Assort for SO post summarization. Assort includes two complementary learning methods, AssortS and AssortIS, to address the lack of labeled training data for SO post summarization. AssortS is designed to directly train a novel ensemble learning model with BERT embeddings and domain-specific features to account for the unique characteristics of SO posts. By contrast, AssortIS is designed to reuse pre-trained models while addressing the domain shift challenge when no training data is present (i.e., zero-shot learning). Both AssortS and AssortIS outperform six existing techniques by at least 13% and 7% respectively in terms of the F1 score. Furthermore, a human study shows that participants significantly preferred summaries generated by AssortS and AssortIS over the best baseline, while the preference difference between AssortS and AssortIS was small.

DOI: 10.1109/ICSE48619.2023.00158

Semi-Automatic, Inline and Collaborative Web Page Code Curations

作者: Rutishauser, Roy and Meyer, Andr'{e
关键词: semi-automated link curation, knowledge management, web browsing, collaboration

Abstract

Software developers spend about a quarter of their workday using the web to fulfill various information needs. Searching for relevant information online can be time-consuming, yet acquired information is rarely systematically persisted for later reference. In this work, we introduce SALI, an approach for semi-automated inline linking of web pages to source code locations. SALI helps developers naturally capture high-quality, explicit links between web pages and specific source code locations by recommending links for curation within the IDE. Through two laboratory studies, we examined the developer’s ability to both curate and consume links between web pages and specific source code locations while performing software development tasks. The studies were performed with 20 subjects working on realistic software change tasks from widely-used open-source projects. Results show that developers continuously and concisely curate web pages at meaningful locations in the code with little effort. Additionally, we found that other developers could use these curations while performing new and different change tasks to speed up relevant information gathering within unfamiliar codebases by a factor of 2.4.

DOI: 10.1109/ICSE48619.2023.00159

Identifying Key Classes for Initial Software Comprehension： Can We Do it Better?

作者: Pan, Weifeng and Du, Xin and Ming, Hua and Kim, Dae-Kyoo and Yang, Zijiang
关键词: complex networks, field theory, key classes, PageRank, program comprehension

Abstract

Key classes are excellent starting points for developers, especially newcomers, to comprehend an unknown software system. Though many unsupervised key class identification approaches have been proposed in the literature by representing software as class dependency networks (aka software networks) and using some network metrics (e.g., h-index, a-index, and coreness), they are never aware of the field where the nodes exist and the effect of the field on the importance of the nodes in it. According to the classic field theory in physics, every material particle is in a field through which they exert an impact on other particles in the field via non-contact interactions (e.g., electromagnetic force, gravity, and nuclear force). Similarly, every node in a software network might also exist in a field, which might affect the importance of class nodes in it. In this paper, we propose an approach, iFit, to identify key classes in object-oriented software systems. First, we represent software as a CSNWD (Weighted Directed Class-level Software Network) to capture the topological structure of software, including classes, their couplings, and the direction and strength of couplings. Second, we assume that the nodes in the CSNWD exist in a gravitation-like field and propose a new metric, CG (Cumulative Gravitation-like importance), to measure the importance of classes. CG is inspired by Newton’s gravitational formula and uses the PageRank value computed by a biased-PageRank algorithm as the masses of classes. Finally, classes in the system are sorted in descending order according to their CG values, and a cutoff is utilized, that is, the top-ranked classes are recommended as key classes. The experiments were performed on a data set composed of six open-source Java systems from the literature. The results show that iFit is superior to the baseline approaches on 93.75% of the total cases, and is scalable to large-scale software systems. Besides, we find that iFit is neutral to the weighting mechanisms used to assign the weights for different coupling types in the CSNWD, that is, when applying iFit to identify key classes, we can use any one of the weighting mechanisms.

DOI: 10.1109/ICSE48619.2023.00160

Improving API Knowledge Discovery with ML： A Case Study of Comparable API Methods

作者: Nam, Daye and Myers, Brad and Vasilescu, Bogdan and Hellendoorn, Vincent
关键词: No keywords

Abstract

Developers constantly learn new APIs, but often lack necessary information from documentation, resorting instead to popular question-and-answer platforms such as Stack Overflow. In this paper, we investigate how to use recent machine-learning-based knowledge extraction techniques to automatically identify pairs of comparable API methods and the sentences describing the comparison from Stack Overflow answers. We first built a prototype that can be stocked with a dataset of comparable API methods and provides tool-tips to users in search results and in API documentation. We conducted a user study with this tool based on a dataset of TensorFlow comparable API methods spanning 198 hand-annotated facts from Stack Overflow posts. This study confirmed that providing comparable API methods can be useful for helping developers understand the design space of APIs: developers using our tool were significantly more aware of the comparable API methods and better understood the differences between them. We then created SOREL, an comparable API methods knowledge extraction tool trained on our hand-annotated corpus, which achieves a 71% precision and 55% recall at discovering our manually extracted facts and discovers 433 pairs of comparable API methods from thousands of unseen Stack Overflow posts. This work highlights the merit of jointly studying programming assistance tools and constructing machine learning techniques to power them.

DOI: 10.1109/ICSE48619.2023.00161

Evidence Profiles for Validity Threats in Program Comprehension Experiments

作者: Bar'{o
关键词: program comprehension, threats to validity, empirical software engineering

Abstract

Searching for clues, gathering evidence, and reviewing case files are all techniques used by criminal investigators to draw sound conclusions and avoid wrongful convictions. Medicine, too, has a long tradition of evidence-based practice, in which administering a treatment without evidence of its efficacy is considered malpractice. Similarly, in software engineering (SE) research, we can develop sound methodologies and mitigate threats to validity by basing study design decisions on evidence.Echoing a recent call for the empirical evaluation of design decisions in program comprehension experiments, we conducted a 2-phases study consisting of systematic literature searches, snowballing, and thematic synthesis. We found out (1) which validity threat categories are most often discussed in primary studies of code comprehension, and we collected evidence to build (2) the evidence profiles for the three most commonly reported threats to validity.We discovered that few mentions of validity threats in primary studies (31 of 409) included a reference to supporting evidence. For the three most commonly mentioned threats, namely the influence of programming experience, program length, and the selected comprehension measures, almost all cited studies (17 of 18) did not meet our criteria for evidence. We show that for many threats to validity that are currently assumed to be influential across all studies, their actual impact may depend on the design and context of each specific study.Researchers should discuss threats to validity within the context of their particular study and support their discussions with evidence. The present paper can be one resource for evidence, and we call for more meta-studies of this type to be conducted, which will then inform design decisions in primary studies. Further, although we have applied our methodology in the context of program comprehension, our approach can also be used in other SE research areas to enable evidence-based experiment design decisions and meaningful discussions of threats to validity.

DOI: 10.1109/ICSE48619.2023.00162

Developers’ Visuo-Spatial Mental Model and Program Comprehension

作者: Bouraffa, Abir and Fuhrmann, Gian-Luca and Maalej, Walid
关键词: code comprehension, code navigation, developer productivity, IDE design, code visualization, cognitive studies

Abstract

Previous works from research and industry have proposed a spatial representation of code in a canvas, arguing that a navigational code space confers developers the freedom to organise elements according to their understanding. By allowing developers to translate logical relatedness into spatial proximity, this code representation could aid in code navigation and comprehension. However, the association between developers’ code comprehension and their visuo-spatial mental model of the code is not yet well understood. This mental model is affected on the one hand by the spatial code representation and on the other by the visuo-spatial working memory of developers.We address this knowledge gap by conducting an online experiment with 20 developers following a between-subject design. The control group used a conventional tab-based code visualization, while the experimental group used a code canvas to complete three code comprehension tasks. Furthermore, we measure the participants’ visuo-spatial working memory using a Corsi Block test at the end of the tasks. Our results suggest that, overall, neither the spatial representation of code nor the visuo-spatial working memory of developers has a significant impact on comprehension performance. However, we identified significant differences in the time dedicated to different comprehension activities such as navigation, annotation, and UI interactions.

DOI: 10.1109/ICSE48619.2023.00163

Two Sides of the Same Coin： Exploiting the Impact of Identifiers in Neural Code Comprehension

作者: Gao, Shuzheng and Gao, Cuiyun and Wang, Chaozheng and Sun, Jun and Lo, David and Yu, Yue
关键词: No keywords

Abstract

Previous studies have demonstrated that neural code comprehension models are vulnerable to identifier naming. By renaming as few as one identifier in the source code, the models would output completely irrelevant results, indicating that identifiers can be misleading for model prediction. However, identifiers are not completely detrimental to code comprehension, since the semantics of identifier names can be related to the program semantics. Well exploiting the two opposite impacts of identifiers is essential for enhancing the robustness and accuracy of neural code comprehension, and still remains under-explored. In this work, we propose to model the impact of identifiers from a novel causal perspective, and propose a counterfactual reasoning-based framework named CREAM. CREAM explicitly captures the misleading information of identifiers through multitask learning in the training stage, and reduces the misleading impact by counterfactual inference in the inference stage. We evaluate CREAM on three popular neural code comprehension tasks, including function naming, defect detection and code classification. Experiment results show that CREAM not only significantly outperforms baselines in terms of robustness (e.g., +37.9% on the function naming task at F1 score), but also achieve improved results on the original datasets (e.g., +0.5% on the function naming task at F1 score).

DOI: 10.1109/ICSE48619.2023.00164

SeeHow： Workflow Extraction from Programming Screencasts through Action-Aware Video Analytics

作者: Zhao, Dehai and Xing, Zhenchang and Xia, Xin and Ye, Deheng and Xu, Xiwei and Zhu, Liming
关键词: screencast, computer vision, workflow extraction, action recognition

Abstract

Programming screencasts (e.g., video tutorials on Youtube or live coding stream on Twitch) are important knowledge source for developers to learn programming knowledge, especially the workflow of completing a programming task. Nonetheless, the image nature of programming screencasts limits the accessibility of screencast content and the workflow embedded in it, resulting in a gap to access and interact with the content and workflow in programming screencasts. Existing non-intrusive methods are limited to extract either primitive human-computer interaction (HCI) actions or coarse-grained video fragments. In this work, we leverage Computer Vision (CV) techniques to build a programming screencast analysis tool which can automatically extract code-line editing steps (enter text, delete text, edit text and select text) from screencasts. Given a programming screencast, our approach outputs a sequence of coding steps and code snippets involved in each step, which we refer to as programming workflow. The proposed method is evaluated on 41 hours of tutorial videos and live coding screencasts with diverse programming environments. The results demonstrate our tool can extract code-line editing steps accurately and the extracted workflow steps can be intuitively understood by developers.

DOI: 10.1109/ICSE48619.2023.00165

AidUI： Toward Automated Recognition of Dark Patterns in User Interfaces

作者: Mansur, S M Hasan and Salma, Sabiha and Awofisayo, Damilola and Moran, Kevin
关键词: dark pattern, UI analysis, UI design

Abstract

Past studies have illustrated the prevalence of UI dark patterns, or user interfaces that can lead end-users toward (unknowingly) taking actions that they may not have intended. Such deceptive UI designs can be either intentional (to benefit an online service) or unintentional (through complicit design practices) and can result in adverse effects on end users, such as oversharing personal information or financial loss. While significant research progress has been made toward the development of dark pattern taxonomies across different software domains, developers and users currently lack guidance to help recognize, avoid, and navigate these often subtle design motifs. However, automated recognition of dark patterns is a challenging task, as the instantiation of a single type of pattern can take many forms, leading to significant variability.In this paper, we take the first step toward understanding the extent to which common UI dark patterns can be automatically recognized in modern software applications. To do this, we introduce AidUI, a novel automated approach that uses computer vision and natural language processing techniques to recognize a set of visual and textual cues in application screenshots that signify the presence of ten unique UI dark patterns, allowing for their detection, classification, and localization. To evaluate our approach, we have constructed ContextDP, the current largest dataset of fully-localized UI dark patterns that spans 175 mobile and 83 web UI screenshots containing 301 dark pattern instances. The results of our evaluation illustrate that AidUI achieves an overall precision of 0.66, recall of 0.67, F1-score of 0.65 in detecting dark pattern instances, reports few false positives, and is able to localize detected patterns with an IoU score of 0.84. Furthermore, a significant subset of our studied dark patterns can be detected quite reliably (F1 score of over 0.82), and future research directions may allow for improved detection of additional patterns. This work demonstrates the plausibility of developing tools to aid developers in recognizing and appropriately rectifying deceptive UI patterns.

DOI: 10.1109/ICSE48619.2023.00166

Carving UI Tests to Generate API Tests and API Specification

作者: Yandrapally, Rahulkrishna and Sinha, Saurabh and Tzoref-Brill, Rachel and Mesbah, Ali
关键词: web application testing, API testing, test generation, UI testing, end-to-end testing, test carving, API specification inference

Abstract

Modern web applications make extensive use of API calls to update the UI state in response to user events or server-side changes. For such applications, API-level testing can play an important role, in-between unit-level testing and UI-level (or end-to-end) testing. Existing API testing tools require API specifications (e.g., OpenAPI), which often may not be available or, when available, be inconsistent with the API implementation, thus limiting the applicability of automated API testing to web applications. In this paper, we present an approach that leverages UI testing to enable API-level testing for web applications. Our technique navigates the web application under test and automatically generates an API-level test suite, along with an OpenAPI specification that describes the application’s server-side APIs (for REST-based web applications). A key element of our solution is a dynamic approach for inferring API endpoints with path parameters via UI navigation and directed API probing. We evaluated the technique for its accuracy in inferring API specifications and the effectiveness of the “carved” API tests. Our results on seven open-source web applications show that the technique achieves 98% precision and 56% recall in inferring endpoints. The carved API tests, when added to test suites generated by two automated REST API testing tools, increase statement coverage by 52% and 29% and branch coverage by 99% and 75%, on average. The main benefits of our technique are: (1) it enables API-level testing of web applications in cases where existing API testing tools are inapplicable and (2) it creates API-level test suites that cover server-side code efficiently while exercising APIs as they would be invoked from an application’s web UI, and that can augment existing API test suites.

DOI: 10.1109/ICSE48619.2023.00167

Ex Pede Herculem： Augmenting Activity Transition Graph for Apps via Graph Convolution Network

作者: Liu, Zhe and Chen, Chunyang and Wang, Junjie and Su, Yuhui and Huang, Yuekai and Hu, Jun and Wang, Qing
关键词: GUI testing, deep learning, program analysis, empirical study

Abstract

Mobile apps are indispensable for people’s daily life. With the increase of GUI functions, apps have become more complex and diverse. As the Android app is event-driven, Activity Transition Graph (ATG) becomes an important way of app abstract and graphical user interface (GUI) modeling. Although existing works provide static and dynamic analysis to build ATG for applications, the completeness of ATG obtained is poor due to the low coverage of these techniques. To tackle this challenge, we propose a novel approach, ArchiDroid, to automatically augment the ATG via graph convolution network. It models both the semantics of activities and the graph structure of activity transitions to predict the transition between activities based on the seed ATG extracted by static analysis. The evaluation demonstrates that ArchiDroid can achieve 86% precision and 94% recall in predicting the transition between activities for augmenting ATG. We further apply the augmented ATG in two downstream tasks, i.e., guidance in automated GUI testing and assistance in app function design. Results show that the automated GUI testing tool integrated with ArchiDroid achieves 43% more activity coverage and detects 208% more bugs. Besides, ArchiDroid can predict the missing transition with 85% accuracy in real-world apps for assisting the app function design, and an interview case study further demonstrates its usefulness.

DOI: 10.1109/ICSE48619.2023.00168

Sustainability is Stratified： Toward a Better Theory of Sustainable Software Engineering

作者: McGuire, Sean and Schultz, Erin and Ayoola, Bimpe and Ralph, Paul
关键词: sustainable development, software engineering, sustainable software engineering, scoping review, meta-synthesis

Abstract

Background: Sustainable software engineering (SSE) means creating software in a way that meets present needs without undermining our collective capacity to meet our future needs. It is typically conceptualized as several intersecting dimensions or “pillars”—environmental, social, economic, technical and individual. However; these pillars are theoretically underdeveloped and require refinement. Objectives: The objective of this paper is to generate a better theory of SSE. Method: First, a scoping review was conducted to understand the state of research on SSE and identify existing models thereof. Next, a meta-synthesis of qualitative research on SSE was conducted to critique and improve the existing models identified. Results: 961 potentially relevant articles were extracted from five article databases. These articles were de-duplicated and then screened independently by two screeners, leaving 243 articles to examine. Of these, 109 were non-empirical, the most common empirical method was systematic review, and no randomized controlled experiments were found. Most papers focus on ecological sustainability (158) and the sustainability of software products (148) rather than processes. A meta-synthesis of 36 qualitative studies produced several key propositions, most notably, that sustainability is stratified (has different meanings at different levels of abstraction) and multisystemic (emerges from interactions among multiple social, technical, and sociotechnical systems). Conclusion: The academic literature on SSE is surprisingly non-empirical. More empirical evaluations of specific sustainability interventions are needed. The sustainability of software development products and processes should be conceptualized as multisystemic and stratified, and assessed accordingly.

DOI: 10.1109/ICSE48619.2023.00169

DLInfer： Deep Learning with Static Slicing for Python Type Inference

作者: Yan, Yanyan and Feng, Yang and Fan, Hongcheng and Xu, Baowen
关键词: type inference, Python, static slicing

Abstract

Python programming language has gained enormous popularity in the past decades. While its flexibility significantly improves software development productivity, the dynamic typing feature challenges software maintenance and quality assurance. To facilitate programming and type error checking, the Python programming language has provided a type hint mechanism enabling developers to annotate type information for variables. However, this manual annotation process often requires plenty of resources and may introduce errors.In this paper, we propose a deep learning type inference technique, namely DLInfer, to automatically infer the type information for Python programs. DLInfer collects slice statements for variables through static analysis and then vectorizes them with the Unigram Language Model algorithm. Based on the vectorized slicing features, we designed a bi-directional gated recurrent unit model to learn the type propagation information for inference. To validate the effectiveness of DLInfer, we conduct an extensive empirical study on 700 open-source projects. We evaluate its accuracy in inferring three kinds of fundamental types, including built-in, library, and user-defined types. By training with a large-scale dataset, DLInfer achieves an average of 98.79% Top-1 accuracy for the variables that can get type information through static analysis and manual annotation. Further, DLInfer achieves 83.03% type inference accuracy on average for the variables that can only obtain the type information through dynamic analysis. The results indicate DLInfer is highly effective in inferring types. It is promising to apply it to assist in various software engineering tasks for Python programs.

DOI: 10.1109/ICSE48619.2023.00170

ViolationTracker： Building Precise Histories for Static Analysis Violations

作者: Yu, Ping and Wu, Yijian and Peng, Xin and Peng, Jiahan and Zhang, Jian and Xie, Peicheng and Zhao, Wenyun
关键词: No keywords

Abstract

Automatic static analysis tools (ASATs) detect source code violations to static analysis rules and are usually used as a guard for source code quality. The adoption of ASATs, however, is often challenged because of several problems such as a large number of false alarms, invalid rule priorities, and inappropriate rule configurations. Research has shown that tracking the history of the violations is a promising way to solve the above problems because the facts of violation fixing may reflect the developers’ subjective expectations on the violation detection results. Precisely identifying the revisions that induce or fix a violation is however challenging because of the imprecise matching of violations between code revisions and ignorance of merge commits in the maintenance history.In this paper, we propose ViolationTracker, an approach to precisely matching the violation instances between adjacent revisions and building the lifecycle of violations with the identification of inducing, fixing, deleting, and reopening of each violation case. The approach employs code entity anchoring heuristics for violation matching and considers merge commits that used to be ignored in existing research. We evaluate ViolationTracker with a manually-validated dataset that consists of 500 violation instances and 158 threads of 30 violation cases with detailed evolution history from open-source projects. ViolationTracker achieves over 93% precision and 98% recall on violation matching, outperforming the state-of-the-art approach, and 99.4% precision on rebuilding the histories of violation cases. We also show that ViolationTracker is useful to identify actionable violations. A preliminary empirical study reveals the possibility to prioritize static analysis rules according to further analysis on the actionable rates of the rules.

DOI: 10.1109/ICSE48619.2023.00171

Compiler Test-Program Generation via Memoized Configuration Search

作者: Chen, Junjie and Suo, Chenyao and Jiang, Jiajun and Chen, Peiqi and Li, Xingjian
关键词: compiler testing, test program generation, reinforcement learning, configuration

Abstract

To ensure compilers’ quality, compiler testing has received more and more attention, and test-program generation is the core task. In recent years, some approaches have been proposed to explore test configurations for generating more effective test programs, but they either are restricted by historical bugs or suffer from the cost-effectiveness issue. Here, we propose a novel test-program generation approach (called MCS) to further improving the performance of compiler testing. MCS conducts memoized search via multi-agent reinforcement learning (RL) for guiding the construction of effective test configurations based on the memoization for the explored test configurations during the on-the-fly compiler-testing process. During the process, the elaborate coordination among configuration options can be also well learned by multi-agent RL, which is required for generating bug-triggering test programs. Specifically, MCS considers the diversity among test configurations to efficiently explore the input space and the testing results under each explored configuration to learn which portions of space are more bug-triggering. Our extensive experiments on GCC and LLVM demonstrate the performance of MCS, significantly outperforming the state-of-the-art test-program generation approaches in bug detection. Also, MCS detects 16 new bugs on the latest trunk revisions of GCC and LLVM, and all of them have been confirmed or fixed by developers. MCS has been deployed by a global IT company (i.e., Huawei) for testing their in-house compiler, and detects 10 new bugs (covering all the 5 bugs detected by the compared approaches), all of which have been confirmed.

DOI: 10.1109/ICSE48619.2023.00172

Generating Test Databases for Database-Backed Applications

作者: Yan, Cong and Nath, Suman and Lu, Shan
关键词: automated testing, test data generation, database-backed application, database-state generation

Abstract

Database-backed applications are widely used. To effectively test these applications, one needs to design not only user inputs but also database states, which imposes unique challenges. First, valid database states have to satisfy complicated constraints determined by application semantics, and hence are difficult to synthesize. Second, the state space of a database is huge, as an application can contain tens to hundreds of tables with up to tens of fields per table. Making things worse, each test involving database operations takes significant time to run. Consequently, unhelpful database states and running tests on them can severely waste testing resources.We propose DBGriller, a tool that generates database states to facilitate thorough testing of database-backed applications. To effectively generate valid database states, DBGriller strategically injects minor mutation into existing database states and transforms part of the application-under-test into a stand-alone validity checker. To tackle the huge database state space and save testing time, DBGriller uses program analysis to identify a novel branch-projected DB view that can be used to filter out database states that are unlikely to increase the testing branch coverage. Our evaluation on 9 popular open-source database applications shows that DBGriller can effectively increase branch coverage of existing tests and expose previously unknown bugs.

DOI: 10.1109/ICSE48619.2023.00173

Testing Database Engines via Query Plan Guidance

作者: Ba, Jinsheng and Rigger, Manuel
关键词: automated testing, test case generation

Abstract

Database systems are widely used to store and query data. Test oracles have been proposed to find logic bugs in such systems, that is, bugs that cause the database system to compute an incorrect result. To realize a fully automated testing approach, such test oracles are paired with a test case generation technique; a test case refers to a database state and a query on which the test oracle can be applied. In this work, we propose the concept of Query Plan Guidance (QPG) for guiding automated testing towards “interesting” test cases. SQL and other query languages are declarative. Thus, to execute a query, the database system translates every operator in the source language to one of the potentially many so-called physical operators that can be executed; the tree of physical operators is referred to as the query plan. Our intuition is that by steering testing towards exploring a variety of unique query plans, we also explore more interesting behaviors—some of which are potentially incorrect. To this end, we propose a mutation technique that gradually applies promising mutations to the database state, causing the DBMS to create potentially unseen query plans for subsequent queries. We applied our method to three mature, widely-used, and extensively-tested database systems—SQLite, TiDB, and CockroachDB—and found 53 unique, previously unknown bugs. Our method exercises 4.85–408.48\texttimes{

DOI: 10.1109/ICSE48619.2023.00174

Testing Database Systems via Differential Query Execution

作者: Song, Jiansen and Dou, Wensheng and Cui, Ziyu and Dai, Qianwang and Wang, Wei and Wei, Jun and Zhong, Hua and Huang, Tao
关键词: database system, DBMS testing, logic bug

Abstract

Database Management Systems (DBMSs) provide efficient data retrieval and manipulation for many applications through Structured Query Language (SQL). Incorrect implementations of DBMSs can result in logic bugs, which cause SELECT queries to fetch incorrect results, or UPDATE and DELETE queries to generate incorrect database states. Existing approaches mainly focus on detecting logic bugs in SELECT queries. However, logic bugs in UPDATE and DELETE queries have not been tackled.In this paper, we propose a novel and general approach, which we have termed Differential Query Execution (DQE), to detect logic bugs in SELECT, UPDATE and DELETE queries of DBMSs. The core idea of DQE is that different SQL queries with the same predicate usually access the same rows in a database. For example, a row updated by an UPDATE query with a predicate ϕ should also be fetched by a SELECT query with the same predicate ϕ. If not, a logic bug is revealed in the target DBMS. To evaluate the effectiveness and generality of DQE, we apply DQE on five production-level DBMSs, i.e., MySQL, MariaDB, TiDB, CockroachDB and SQLite. In total, we have detected 50 unique bugs in these DBMSs, 41 of which have been confirmed, and 11 have been fixed. We expect that the simplicity and generality of DQE can greatly improve the reliability of DBMSs.

DOI: 10.1109/ICSE48619.2023.00175

Analyzing the Impact of Workloads on Modeling the Performance of Configurable Software Systems

作者: M"{u
关键词: No keywords

Abstract

Modern software systems often exhibit numerous configuration options to tailor them to user requirements, including the system’s performance behavior. Performance models derived via machine learning are an established approach for estimating and optimizing configuration-dependent software performance. Most existing approaches in this area rely on software performance measurements conducted with a single workload (i.e., input fed to a system). This single workload, however, is often not representative of a software system’s real-world application scenarios. Understanding to what extent configuration and workload—individually and combined—cause a software system’s performance to vary is key to understand whether performance models are generalizable across different configurations and workloads. Yet, so far, this aspect has not been systematically studied.To fill this gap, we conducted a systematic empirical study across 25 258 configurations from nine real-world configurable software systems to investigate the effects of workload variation at system-level performance and for individual configuration options. We explore driving causes for workload-configuration interactions by enriching performance observations with option-specific code coverage information.Our results demonstrate that workloads can induce substantial performance variation and interact with configuration options, often in non-monotonous ways. This limits not only the generaliz-ability of single-workload models, but also challenges assumptions for existing transfer-learning techniques. As a result, workloads should be considered when building performance prediction models to maintain and improve representativeness and reliability.

DOI: 10.1109/ICSE48619.2023.00176

Twins or False Friends? A Study on Energy Consumption and Performance of Configurable Software

作者: Weber, Max and Kaltenecker, Christian and Sattler, Florian and Apel, Sven and Siegmund, Norbert
关键词: No keywords

Abstract

Reducing energy consumption of software is an increasingly important objective, and there has been extensive research for data centers, smartphones, and embedded systems. However, when it comes to software, we lack working tools and methods to directly reduce energy consumption. For performance, we can resort to configuration options for tuning response time or throughput of a software system. For energy, it is still unclear whether the underlying assumption that runtime performance correlates with energy consumption holds, especially when it comes to optimization via configuration. To evaluate whether and to what extent this assumption is valid for configurable software systems, we conducted the largest empirical study of this kind to date. First, we searched the literature for reports on whether and why runtime performance correlates with energy consumption. We obtained a mixed, even contradicting picture from positive to negative correlation, and that configurability has not been considered yet as a factor for this variance. Second, we measured and analyzed both the runtime performance and energy consumption of 14 real-world software systems. We found that, in many cases, it depends on the software system’s configuration whether runtime performance and energy consumption correlate and that, typically, only few configuration options influence the degree of correlation. A fine-grained analysis at the function level revealed that only few functions are relevant to obtain an accurate proxy for energy consumption and that, knowing them, allows one to infer individual transfer factors between runtime performance and energy consumption.

DOI: 10.1109/ICSE48619.2023.00177

Learning Deep Semantics for Test Completion

作者: Nie, Pengyu and Banerjee, Rahul and Li, Junyi Jessy and Mooney, Raymond J. and Gligoric, Milos
关键词: test completion, deep neural networks, programming language semantics

Abstract

Writing tests is a time-consuming yet essential task during software development. We propose to leverage recent advances in deep learning for text and code generation to assist developers in writing tests. We formalize the novel task of test completion to automatically complete the next statement in a test method based on the context of prior statements and the code under test. We develop TECO—a deep learning model using code semantics for test completion. The key insight underlying TECO is that predicting the next statement in a test method requires reasoning about code execution, which is hard to do with only syntax-level data that existing code completion models use. TECO extracts and uses six kinds of code semantics data, including the execution result of prior statements and the execution context of the test method. To provide a testbed for this new task, as well as to evaluate TECO, we collect a corpus of 130,934 test methods from 1,270 open-source Java projects. Our results show that TECO achieves an exact-match accuracy of 18, which is 29% higher than the best baseline using syntax-level data only. When measuring functional correctness of generated next statement, TECO can generate runnable code in 29% of the cases compared to 18% obtained by the best baseline. Moreover, TECO is significantly better than prior work on test oracle generation.

DOI: 10.1109/ICSE48619.2023.00178

SKCODER： A Sketch-Based Approach for Automatic Code Generation

作者: Li, Jia and Li, Yongmin and Li, Ge and Jin, Zhi and Hao, Yiyang and Hu, Xing
关键词: code generation, deep learning

Abstract

Recently, deep learning techniques have shown great success in automatic code generation. Inspired by the code reuse, some researchers propose copy-based approaches that can copy the content from similar code snippets to obtain better performance. Practically, human developers recognize the content in the similar code that is relevant to their needs, which can be viewed as a code sketch. The sketch is further edited to the desired code. However, existing copy-based approaches ignore the code sketches and tend to repeat the similar code without necessary modifications, which leads to generating wrong results.In this paper, we propose a sketch-based code generation approach named SKCODER to mimic developers’ code reuse behavior. Given a natural language requirement, SKCODER retrieves a similar code snippet, extracts relevant parts as a code sketch, and edits the sketch into the desired code. Our motivations are that the extracted sketch provides a well-formed pattern for telling models “how to write”. The post-editing further adds requirement-specific details into the sketch and outputs the complete code. We conduct experiments on two public datasets and a new dataset collected by this work. We compare our approach to 20 baselines using 5 widely used metrics. Experimental results show that (1) SKCODER can generate more correct programs, and outperforms the state-of-the-art - CodeT5-base by 30.30%, 35.39%, and 29.62% on three datasets. (2) Our approach is effective to multiple code generation models and improves them by up to 120.1% in Pass@1. (3) We investigate three plausible code sketches and discuss the importance of sketches. (4) We manually evaluate the generated code and prove the superiority of our SKCODER in three aspects.

DOI: 10.1109/ICSE48619.2023.00179

An Empirical Comparison of Pre-Trained Models of Source Code

作者: Niu, Changan and Li, Chuanyi and Ng, Vincent and Chen, Dongxiao and Ge, Jidong and Luo, Bin
关键词: pre-training of source code, AI for SE

Abstract

While a large number of pre-trained models of source code have been successfully developed and applied to a variety of software engineering (SE) tasks in recent years, our understanding of these pre-trained models is arguably fairly limited. With the goal of advancing our understanding of these models, we perform the first systematic empirical comparison of 19 recently-developed pre-trained models of source code on 13 SE tasks. To gain additional insights into these models, we adopt a recently-developed 4-dimensional categorization of pre-trained models, and subsequently investigate whether there are correlations between different categories of pre-trained models and their performances on different SE tasks.

DOI: 10.1109/ICSE48619.2023.00180

On the Robustness of Code Generation Techniques： An Empirical Study on GitHub Copilot

作者: Mastropaolo, Antonio and Pascarella, Luca and Guglielmi, Emanuela and Ciniselli, Matteo and Scalabrino, Simone and Oliveto, Rocco and Bavota, Gabriele
关键词: empirical study, recommender systems

Abstract

Software engineering research has always being concerned with the improvement of code completion approaches, which suggest the next tokens a developer will likely type while coding. The release of GitHub Copilot constitutes a big step forward, also because of its unprecedented ability to automatically generate even entire functions from their natural language description. While the usefulness of Copilot is evident, it is still unclear to what extent it is robust. Specifically, we do not know the extent to which semantic-preserving changes in the natural language description provided to the model have an effect on the generated code function. In this paper we present an empirical study in which we aim at understanding whether different but semantically equivalent natural language descriptions result in the same recommended function. A negative answer would pose questions on the robustness of deep learning (DL)-based code generators since it would imply that developers using different wordings to describe the same code would obtain different recommendations. We asked Copilot to automatically generate 892 Java methods starting from their original Javadoc description. Then, we generated different semantically equivalent descriptions for each method both manually and automatically, and we analyzed the extent to which predictions generated by Copilot changed. Our results show that modifying the description results in different code recommendations in ~46% of cases. Also, differences in the semantically equivalent descriptions might impact the correctness of the generated code (±28%).

DOI: 10.1109/ICSE48619.2023.00181

Source Code Recommender Systems： The Practitioners’ Perspective

作者: Ciniselli, Matteo and Pascarella, Luca and Aghajani, Emad and Scalabrino, Simone and Oliveto, Rocco and Bavota, Gabriele
关键词: code recommender systems, empirical study, practitioners’ survey

Abstract

The automatic generation of source code is one of the long-lasting dreams in software engineering research. Several techniques have been proposed to speed up the writing of new code. For example, code completion techniques can recommend to developers the next few tokens they are likely to type, while retrieval-based approaches can suggest code snippets relevant for the task at hand. Also, deep learning has been used to automatically generate code statements starting from a natural language description. While research in this field is very active, there is no study investigating what the users of code recommender systems (i.e., software practitioners) actually need from these tools. We present a study involving 80 software developers to investigate the characteristics of code recommender systems they consider important. The output of our study is a taxonomy of 70 “requirements” that should be considered when designing code recommender systems. For example, developers would like the recommended code to use the same coding style of the code under development. Also, code recommenders being “aware” of the developers’ knowledge (e.g., what are the framework/libraries they already used in the past) and able to customize the recommendations based on this knowledge would be appreciated by practitioners. The taxonomy output of our study points to a wide set of future research directions for code recommenders.

DOI: 10.1109/ICSE48619.2023.00182

Safe Low-Level Code without Overhead is Practical

作者: Pirelli, Solal
关键词: programming languages, safety

Abstract

Developers write low-level systems code in unsafe programming languages due to performance concerns. The lack of safety causes bugs and vulnerabilities that safe languages avoid. We argue that safety without run-time overhead is possible through type invariants that prove the safety of potentially unsafe operations. We empirically show that Rust and C# can be extended with such features to implement safe network device drivers without run-time overhead, and that Ada has these features already.

DOI: 10.1109/ICSE48619.2023.00183

Sibyl： Improving Software Engineering Tools with SMT Selection

作者: Leeson, Will and Dwyer, Matthew B and Filieri, Antonio
关键词: graph neural networks, satisfiable modulo theories, algorithm selection

Abstract

SMT solvers are often used in the back end of different software engineering tools—e.g., program verifiers, test generators, or program synthesizers. There are a plethora of algorithmic techniques for solving SMT queries. Among the available SMT solvers, each employs its own combination of algorithmic techniques that are optimized for different fragments of logics and problem types. The most efficient solver can change with small changes in the SMT query, which makes it nontrivial to decide which solver to use. Consequently, designers of software engineering tools often select a single solver, based on familiarity or convenience, and tailor their tool towards it. Choosing an SMT solver at design time misses the opportunity to optimize query solve times and, for tools where SMT solving is a bottleneck, the performance loss can be significant.In this work, we present Sibyl, an automated SMT selector based on graph neural networks (GNNs). Sibyl creates a graph representation of a given SMT query and uses GNNs to predict how each solver in a suite of SMT solvers would perform on said query. Sibyl learns to predict based on features of SMT queries that are specific to the population on which it is trained - avoiding the need for manual feature engineering. Once trained, Sibyl makes fast and accurate predictions which can substantially reduce the time needed to solve a set of SMT queries.We evaluate Sibyl in four scenarios in which SMT solvers are used: in competition, in a symbolic execution engine, in a bounded model checker, and in a program synthesis tool. We find that Sibyl improves upon the state of the art in nearly every case and provide evidence that it generalizes better than existing techniques. Further, we evaluate Sibyl’s overhead and demonstrate that it has the potential to speedup a variety of different software engineering tools.

DOI: 10.1109/ICSE48619.2023.00184

CoCoSoDa： Effective Contrastive Learning for Code Search

作者: Shi, Ensheng and Wang, Yanlin and Gu, Wenchao and Du, Lun and Zhang, Hongyu and Han, Shi and Zhang, Dongmei and Sun, Hongbin
关键词: code search, contrastive learning, soft data augmentation, momentum mechanism

Abstract

Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples. Specifically, soft data augmentation is to dynamically masking or replacing some tokens with their types for input sequences to generate positive samples. Momentum mechanism is used to generate large and consistent representations of negative samples in a mini-batch through maintaining a queue and a momentum encoder. In addition, multimodal contrastive learning is used to pull together representations of code-query pairs and push apart the unpaired code snippets and queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. Experimental results show that: (1) CoCoSoDa outperforms 18 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively. (2) The ablation studies show the effectiveness of each component of our approach. (3) We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search. (4) Our model performs robustly under different hyper-parameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model.

DOI: 10.1109/ICSE48619.2023.00185

Coverage Guided Fault Injection for Cloud Systems

作者: Gao, Yu and Dou, Wensheng and Wang, Dong and Feng, Wenhan and Wei, Jun and Zhong, Hua and Huang, Tao
关键词: cloud system, crash recovery bug, fault injection, bug detection, fuzzing

Abstract

To support high reliability and availability, modern cloud systems are designed to be resilient to node crashes and reboots. That is, a cloud system should gracefully recover from node crashes/reboots and continue to function. However, node crashes/reboots that occur under special timing can trigger crash recovery bugs that lie in incorrect crash recovery protocols and their implementations. To ensure that a cloud system is free from crash recovery bugs, some fault injection approaches have been proposed to test whether a cloud system can correctly recover from various crash scenarios. These approaches are not effective in exploring the huge crash scenario space without developers’ knowledge.In this paper, we propose CrashFuzz, a fault injection testing approach that can effectively test crash recovery behaviors and reveal crash recovery bugs in cloud systems. CrashFuzz mutates the combinations of possible node crashes and reboots according to runtime feedbacks, and prioritizes the combinations that are prone to increase code coverage and trigger crash recovery bugs for smart exploration. We have implemented CrashFuzz and evaluated it on three popular open-source cloud systems, i.e., ZooKeeper, HDFS and HBase. CrashFuzz has detected 4 unknown bugs and 1 known bug. Compared with other fault injection approaches, CrashFuzz can detect more crash recovery bugs and achieve higher code coverage.

DOI: 10.1109/ICSE48619.2023.00186

DIVER： Oracle-Guided SMT Solver Testing with Unrestricted Random Mutations

作者: Kim, Jongwook and So, Sunbeom and Oh, Hakjoo
关键词: software testing, fuzzing, SMT solver

Abstract

We present DIVER, a novel technique for effectively finding critical bugs in SMT solvers. Ensuring the correctness of SMT solvers is becoming increasingly important as many applications use solvers as a foundational basis. In response, several approaches for testing SMT solvers, which are classified into differential testing and oracle-guided approaches, have been proposed until recently. However, they are still unsatisfactory in that (1) differential testing approaches cannot validate unique yet important features of solvers, and (2) oracle-guided approaches cannot generate diverse tests due to their reliance on limited mutation rules. DIVER aims to complement these shortcomings, particularly focusing on finding bugs that are missed by existing approaches. To this end, we present a new testing technique that performs oracle-guided yet unrestricted random mutations. We have used DIVER to validate the most recent versions of three popular SMT solvers: CVC5, Z3 and dReal. In total, DIVER found 25 new bugs, of which 21 are critical and directly affect the reliability of the solvers. We also empirically prove DIVER’s own strength by showing that existing tools are unlikely to find the bugs discovered by DIVER.

DOI: 10.1109/ICSE48619.2023.00187

An Empirical Study of Deep Learning Models for Vulnerability Detection

作者: Steenhoek, Benjamin and Rahman, Md Mahbubur and Jiles, Richard and Le, Wei
关键词: deep learning, vulnerability detection, empirical study

Abstract

Deep learning (DL) models of code have recently reported great progress for vulnerability detection. In some cases, DL-based models have outperformed static analysis tools. Although many great models have been proposed, we do not yet have a good understanding of these models. This limits the further advancement of model robustness, debugging, and deployment for the vulnerability detection. In this paper, we surveyed and reproduced 9 state-of-the-art (SOTA) deep learning models on 2 widely used vulnerability detection datasets: Devign and MSR. We investigated 6 research questions in three areas, namely model capabilities, training data, and model interpretation. We experimentally demonstrated the variability between different runs of a model and the low agreement among different models’ outputs. We investigated models trained for specific types of vulnerabilities compared to a model that is trained on all the vulnerabilities at once. We explored the types of programs DL may consider “hard” to handle. We investigated the relations of training data sizes and training data composition with model performance. Finally, we studied model interpretations and analyzed important features that the models used to make predictions. We believe that our findings can help better understand model results, provide guidance on preparing training data, and improve the robustness of the models. All of our datasets, code, and results are available at https://doi.org/10.6084/m9.figshare.20791240.

DOI: 10.1109/ICSE48619.2023.00188

DeepVD： Toward Class-Separation Features for Neural Network Vulnerability Detection

作者: Wang, Wenbo and Nguyen, Tien N. and Wang, Shaohua and Li, Yi and Zhang, Jiyuan and Yadavally, Aashish
关键词: neural vulnerability detection, graph neural network, class separation

Abstract

The advances of machine learning (ML) including deep learning (DL) have enabled several approaches to implicitly learn vulnerable code patterns to automatically detect software vulnerabilities. A recent study showed that despite successes, the existing ML/DL-based vulnerability detection (VD) models are limited in the ability to distinguish between the two classes of vulnerability and benign code. We propose DEEPVD, a graph-based neural network VD model that emphasizes on class-separation features between vulnerability and benign code. DEEPVD leverages three types of class-separation features at different levels of abstraction: statement types (similar to Part-of-Speech tagging), Post-Dominator Tree (covering regular flows of execution), and Exception Flow Graph (covering the exception and error-handling flows). We conducted several experiments to evaluate DEEPVD in a real-world vulnerability dataset of 303 projects with 13,130 vulnerable methods. Our results show that DEEPVD relatively improves over the state-of-the-art ML/DL-based VD approaches 13%–29.6% in precision, 15.6%–28.9% in recall, and 16.4%–25.8% in F-score. Our ablation study confirms that our designed features and components help DEEPVD achieve high class-separability for vulnerability and benign code.

DOI: 10.1109/ICSE48619.2023.00189

Enhancing Deep Learning-Based Vulnerability Detection by Building Behavior Graph Model

作者: Yuan, Bin and Lu, Yifan and Fang, Yilin and Wu, Yueming and Zou, Deqing and Li, Zhen and Li, Zhi and Jin, Hai
关键词: vulnerability detection, behavior graph, deep learning

Abstract

Software vulnerabilities have posed huge threats to the cyberspace security, and there is an increasing demand for automated vulnerability detection (VD). In recent years, deep learning-based (DL-based) vulnerability detection systems have been proposed for the purpose of automatic feature extraction from source code. Although these methods can achieve ideal performance on synthetic datasets, the accuracy drops a lot when detecting real-world vulnerability datasets. Moreover, these approaches limit their scopes within a single function, being not able to leverage the information between functions. In this paper, we attempt to extract the function’s abstract behaviors, figure out the relationships between functions, and use this global information to assist DL-based VD to achieve higher performance. To this end, we build a Behavior Graph Model and use it to design a novel framework, namely VulBG. To examine the ability of our constructed Behavior Graph Model, we choose several existing DL-based VD models (e.g., TextCNN, ASTGRU, CodeBERT, Devign, and VulCNN) as our baseline models and conduct evaluations on two real-world datasets: the balanced FFMpeg+Qemu dataset and the unbalanced Chrome+Debian dataset. Experimental results indicate that VulBG enables all baseline models to detect more real vulnerabilities, thus improving the overall detection performance.

DOI: 10.1109/ICSE48619.2023.00190

Vulnerability Detection with Graph Simplification and Enhanced Graph Representation Learning

作者: Wen, Xin-Cheng and Chen, Yupan and Gao, Cuiyun and Zhang, Hongyu and Zhang, Jie M. and Liao, Qing
关键词: software vulnerability, graph simplification, graph representation learning

Abstract

Prior studies have demonstrated the effectiveness of Deep Learning (DL) in automated software vulnerability detection. Graph Neural Networks (GNNs) have proven effective in learning the graph representations of source code and are commonly adopted by existing DL-based vulnerability detection methods. However, the existing methods are still limited by the fact that GNNs are essentially difficult to handle the connections between long-distance nodes in a code structure graph. Besides, they do not well exploit the multiple types of edges in a code structure graph (such as edges representing data flow and control flow). Consequently, despite achieving state-of-the-art performance, the existing GNN-based methods tend to fail to capture global information (i.e., long-range dependencies among nodes) of code graphs.To mitigate these issues, in this paper, we propose a novel vulnerability detection framework with grAph siMplification and enhanced graph rePresentation LEarning, named AMPLE. AMPLE mainly contains two parts: 1) graph simplification, which aims at reducing the distances between nodes by shrinking the node sizes of code structure graphs; 2) enhanced graph representation learning, which involves one edge-aware graph convolutional network module for fusing heterogeneous edge information into node representations and one kernel-scaled representation module for well capturing the relations between distant graph nodes. Experiments on three public benchmark datasets show that AMPLE outperforms the state-of-the-art methods by 0.39%-35.32% and 7.64%-199.81% with respect to the accuracy and F1 score metrics, respectively. The results demonstrate the effectiveness of AMPLE in learning global information of code graphs for vulnerability detection.

DOI: 10.1109/ICSE48619.2023.00191

Does Data Sampling Improve Deep Learning-Based Vulnerability Detection? Yeas! and Nays!

作者: Yang, Xu and Wang, Shaowei and Li, Yi and Wang, Shaohua
关键词: vulnerability detection, deep learning, data sampling, interpretable AI

Abstract

Recent progress in Deep Learning (DL) has sparked interest in using DL to detect software vulnerabilities automatically and it has been demonstrated promising results at detecting vulnerabilities. However, one prominent and practical issue for vulnerability detection is data imbalance. Prior study observed that the performance of state-of-the-art (SOTA) DL-based vulnerability detection (DLVD) approaches drops precipitously in real world imbalanced data and a 73% drop of F1-score on average across studied approaches. Such a significant performance drop can disable the practical usage of any DLVD approaches. Data sampling is effective in alleviating data imbalance for machine learning models and has been demonstrated in various software engineering tasks. Therefore, in this study, we conducted a systematical and extensive study to assess the impact of data sampling for data imbalance problem in DLVD from two aspects: i) the effectiveness of DLVD, and ii) the ability of DLVD to reason correctly (making a decision based on real vulnerable statements). We found that in general, oversampling outperforms undersampling, and sampling on raw data outperforms sampling on latent space, typically random oversampling on raw data performs the best among all studied ones (including advanced one SMOTE and OSS). Surprisingly, OSS does not help alleviate the data imbalance issue in DLVD. If the recall is pursued, random undersampling is the best choice. Random oversampling on raw data also improves the ability of DLVD approaches for learning real vulnerable patterns. However, for a significant portion of cases (at least 33% in our datasets), DVLD approach cannot reason their prediction based on real vulnerable statements. We provide actionable suggestions and a roadmap to practitioners and researchers.

DOI: 10.1109/ICSE48619.2023.00192

Incident-Aware Duplicate Ticket Aggregation for Cloud Systems

作者: Liu, Jinyang and He, Shilin and Chen, Zhuangbin and Li, Liqun and Kang, Yu and Zhang, Xu and He, Pinjia and Zhang, Hongyu and Lin, Qingwei and Xu, Zhangwei and Rajmohan, Saravan and Zhang, Dongmei and Lyu, Michael R.
关键词: duplicate tickets, incidents, cloud systems

Abstract

In cloud systems, incidents are potential threats to customer satisfaction and business revenue. When customers are affected by incidents, they often request customer support service (CSS) from the cloud provider by submitting a support ticket. Many tickets could be duplicate as they are reported in a distributed and uncoordinated manner. Thus, aggregating such duplicate tickets is essential for efficient ticket management. Previous studies mainly rely on tickets’ textual similarity to detect duplication; however, duplicate tickets in a cloud system could carry semantically different descriptions due to the complex service dependency of the cloud system. To tackle this problem, we propose iPACK, an incident-aware method for aggregating duplicate tickets by fusing the failure information between the customer side (i.e., tickets) and the cloud side (i.e., incidents). We extensively evaluate iPACK on three datasets collected from the production environment of a large-scale cloud platform, Azure. The experimental results show that iPACK can precisely and comprehensively aggregate duplicate tickets, achieving an F1 score of 0.871~0.935 and outperforming state-of-the-art methods by 12.4%~31.2%.

DOI: 10.1109/ICSE48619.2023.00193

Large Language Models are Few-Shot Testers： Exploring LLM-Based General Bug Reproduction

作者: Kang, Sungmin and Yoon, Juyeon and Yoo, Shin
关键词: test generation, natural language processing, software engineering

Abstract

Many automated test generation techniques have been developed to aid developers with writing tests. To facilitate full automation, most existing techniques aim to either increase coverage, or generate exploratory inputs. However, existing test generation techniques largely fall short of achieving more semantic objectives, such as generating tests to reproduce a given bug report. Reproducing bugs is nonetheless important, as our empirical study shows that the number of tests added in open source repositories due to issues was about 28% of the corresponding project test suite size. Meanwhile, due to the difficulties of transforming the expected program semantics in bug reports into test oracles, existing failure reproduction techniques tend to deal exclusively with program crashes, a small subset of all bug reports. To automate test generation from general bug reports, we propose LIBRO, a framework that uses Large Language Models (LLMs), which have been shown to be capable of performing code-related tasks. Since LLMs themselves cannot execute the target buggy code, we focus on post-processing steps that help us discern when LLMs are effective, and rank the produced tests according to their validity. Our evaluation of LIBRO shows that, on the widely studied Defects4J benchmark, LIBRO can generate failure reproducing test cases for 33% of all studied cases (251 out of 750), while suggesting a bug reproducing test in first place for 149 bugs. To mitigate data contamination (i.e., the possibility of the LLM simply remembering the test code either partially or in whole), we also evaluate LIBRO against 31 bug reports submitted after the collection of the LLM training data terminated: LIBRO produces bug reproducing tests for 32% of the studied bug reports. Overall, our results show LIBRO has the potential to significantly enhance developer efficiency by automatically generating tests from bug reports.

DOI: 10.1109/ICSE48619.2023.00194

On the Reproducibility of Software Defect Datasets

作者: Zhu, Hao-Nan and Rubio-Gonz'{a
关键词: software reproducibility, software defects, software maintenance, software quality

Abstract

Software defect datasets are crucial to facilitating the evaluation and comparison of techniques in fields such as fault localization, test generation, and automated program repair. However, the reproducibility of software defect artifacts is not immune to breakage. In this paper, we conduct a study on the reproducibility of software defect artifacts. First, we study five state-of-the-art Java defect datasets. Despite the multiple strategies applied by dataset maintainers to ensure reproducibility, all datasets are prone to breakages. Second, we conduct a case study in which we systematically test the reproducibility of 1,795 software artifacts during a 13-month period. We find that 62.6% of the artifacts break at least once, and 15.3% artifacts break multiple times. We manually investigate the root causes of breakages and handcraft 10 patches, which are automatically applied to 1,055 distinct artifacts in 2,948 fixes. Based on the nature of the root causes, we propose automated dependency caching and artifact isolation to prevent further breakage. In particular, we show that isolating artifacts to eliminate external dependencies increases reproducibility to 95% or higher, which is on par with the level of reproducibility exhibited by the most reliable manually curated dataset.

DOI: 10.1109/ICSE48619.2023.00195

Context-Aware Bug Reproduction for Mobile Apps

作者: Huang, Yuchao and Wang, Junjie and Liu, Zhe and Wang, Song and Chen, Chunyang and Li, Mingyang and Wang, Qing
关键词: No keywords

Abstract

Bug reports are vital for software maintenance that allow the developers being informed of the problems encountered in the software. Before bug fixing, developers need to reproduce the bugs which is an extremely time-consuming and tedious task, and it is highly expected to automate this process. However, it is challenging to do so considering the imprecise or incomplete natural language described in reproducing steps, and the missing or ambiguous single source of information in GUI components. In this paper, we propose a context-aware bug reproduction approach ScopeDroid which automatically reproduces crashes from textual bug reports for mobile apps. It first constructs a state transition graph (STG) and extracts the contextual information of components. We then design a multi-modal neural matching network to derive the fuzzy matching matrix between all candidate GUI events and reproducing steps. With the STG and matching information, it plans the exploration path for reproducing the bug, and enriches the initial STG iteratively. We evaluate the approach on 102 bug reports from 69 popular Android apps, and it successfully reproduces 63.7% of the crashes, outperforming the state-of-the-art baselines by 32.6% and 38.3%. We also evaluate the usefulness and robustness of ScopeDroid with promising results. Furthermore, to train the neural matching network, we develop a heuristic-based automated training data generation method, which can potentially motivate and facilitate other activities as user interface operations.

DOI: 10.1109/ICSE48619.2023.00196

Read It, Don’t Watch It： Captioning Bug Recordings Automatically

作者: Feng, Sidong and Xie, Mulong and Xue, Yinxing and Chen, Chunyang
关键词: bug recording, video captioning, android app

Abstract

Screen recordings of mobile applications are easy to capture and include a wealth of information, making them a popular mechanism for users to inform developers of the problems encountered in the bug reports. However, watching the bug recordings and efficiently understanding the semantics of user actions can be time-consuming and tedious for developers. Inspired by the conception of the video subtitle in movie industry, we present a lightweight approach CAPdroid to caption bug recordings automatically. CAPdroid is a purely image-based and non-intrusive approach by using image processing and convolutional deep learning models to segment bug recordings, infer user action attributes, and generate subtitle descriptions. The automated experiments demonstrate the good performance of CAPdroid in inferring user actions from the recordings, and a user study confirms the usefulness of our generated step descriptions in assisting developers with bug replay.

DOI: 10.1109/ICSE48619.2023.00197

DUETCS： Code Style Transfer through Generation and Retrieval

作者: Chen, Binger and Abedjan, Ziawasch
关键词: No keywords

Abstract

Coding style has direct impact on code comprehension. Automatically transferring code style to user’s preference or consistency can facilitate project cooperation and maintenance, as well as maximize the value of open-source code. Existing work on automating code stylization is either limited to code formatting or requires human supervision in pre-defining style checking and transformation rules. In this paper, we present unsupervised methods to assist automatic code style transfer for arbitrary code styles. The main idea is to leverage Big Code database to learn style and content embedding separately to generate or retrieve a piece of code with the same functionality and the desired target style. We carefully encode style and content features, so that a style embedding can be learned from arbitrary code. We explored the capabilities of novel attention-based style generation models and meta-learning and implemented our ideas in DUETCS. We complement the learning-based approach with a retrieval mode, which uses the same embeddings to directly search for the desired piece of code in Big Code. Our experiments show that DUETCS captures more style aspects than existing baselines.

DOI: 10.1109/ICSE48619.2023.00198

On the Applicability of Language Models to Block-Based Programs

作者: Griebl, Elisabeth and Fein, Benedikt and Oberm"{u
关键词: block-based programs, scratch, natural language model, code completion, bugram

Abstract

Block-based programming languages like SCRATCH are increasingly popular for programming education and end-user programming. Recent program analyses build on the insight that source code can be modelled using techniques from natural language processing. Many of the regularities of source code that support this approach are due to the syntactic overhead imposed by textual programming languages. This syntactic overhead, however, is precisely what block-based languages remove in order to simplify programming. Consequently, it is unclear how well this modelling approach performs on block-based programming languages. In this paper, we investigate the applicability of language models for the popular block-based programming language SCRATCH. We model SCRATCH programs using n-gram models, the most essential type of language model, and transformers, a popular deep learning model. Evaluation on the example tasks of code completion and bug finding confirm that blocks inhibit predictability, but the use of language models is nevertheless feasible. Our findings serve as foundation for improving tooling and analyses for block-based languages.

DOI: 10.1109/ICSE48619.2023.00199

MTTM： Metamorphic Testing for Textual Content Moderation Software

作者: Wang, Wenxuan and Huang, Jen-tse and Wu, Weibin and Zhang, Jianping and Huang, Yizhan and Li, Shuqing and He, Pinjia and Lyu, Michael R.
关键词: software testing, metamorphic relations, NLP software, textual content moderation

Abstract

The exponential growth of social media platforms such as Twitter and Facebook has revolutionized textual communication and textual content publication in human society. However, they have been increasingly exploited to propagate toxic content, such as hate speech, malicious advertisement, and pornography, which can lead to highly negative impacts (e.g., harmful effects on teen mental health). Researchers and practitioners have been enthusiastically developing and extensively deploying textual content moderation software to address this problem. However, we find that malicious users can evade moderation by changing only a few words in the toxic content. Moreover, modern content moderation software’s performance against malicious inputs remains underexplored. To this end, we propose MTTM, a Metamorphic Testing framework for Textual content Moderation software. Specifically, we conduct a pilot study on 2, 000 text messages collected from real users and summarize eleven metamorphic relations across three perturbation levels: character, word, and sentence. MTTM employs these metamorphic relations on toxic textual contents to generate test cases, which are still toxic yet likely to evade moderation. In our evaluation, we employ MTTM to test three commercial textual content moderation software and two state-of-the-art moderation algorithms against three kinds of toxic content. The results show that MTTM achieves up to 83.9%, 51%, and 82.5% error finding rates (EFR) when testing commercial moderation software provided by Google, Baidu, and Huawei, respectively, and it obtains up to 91.2% EFR when testing the state-of-the-art algorithms from the academy. In addition, we leverage the test cases generated by MTTM to retrain the model we explored, which largely improves model robustness (0% ~ 5.9% EFR) while maintaining the accuracy on the original test set. A demo can be found in this link1.

DOI: 10.1109/ICSE48619.2023.00200

Metamorphic Shader Fusion for Testing Graphics Shader Compilers

作者: Xiao, Dongwei and Liu, Zhibo and Wang, Shuai
关键词: No keywords

Abstract

Computer graphics are powered by graphics APIs (e.g., OpenGL, Direct3D) and their associated shader compilers, which render high-quality images by compiling and optimizing user-written high-level shader programs into GPU machine code. Graphics rendering is extensively used in production scenarios like virtual reality (VR), gaming, autonomous driving, and robotics. Despite the development by industrial manufacturers such as Intel, Nvidia, and AMD, shader compilers — like traditional software — may produce ill-rendered outputs. In turn, these errors may result in negative results, from poor user experience in entertainment to accidents in driving assistance systems.This paper introduces FSHADER, a metamorphic testing (MT) framework designed specifically for shader compilers to uncover erroneous compilations and optimizations. FSHADER tests shader compilers by mutating input shader programs via four carefully-designed metamorphic relations (MRs). In particular, FSHADER fuses two shader programs via an MR and checks the visual consistency between the image rendered from the fused shader program with the output of fusing individually rendered images. Our study of 12 shader compilers covers five mainstream GPU vendors, including Intel, AMD, Nvidia, ARM, and Apple. We successfully uncover over 16K error-triggering inputs that generate incorrect rendering outputs. We manually locate and characterize buggy optimization places, and developers have confirmed representative bugs.

DOI: 10.1109/ICSE48619.2023.00201

MorphQ： Metamorphic Testing of the Qiskit Quantum Computing Platform

作者: Paltenghi, Matteo and Pradel, Michael
关键词: No keywords

Abstract

As quantum computing is becoming increasingly popular, the underlying quantum computing platforms are growing both in ability and complexity. Unfortunately, testing these platforms is challenging due to the relatively small number of existing quantum programs and because of the oracle problem, i.e., a lack of specifications of the expected behavior of programs. This paper presents MorphQ, the first metamorphic testing approach for quantum computing platforms. Our two key contributions are (i) a program generator that creates a large and diverse set of valid (i.e., non-crashing) quantum programs, and (ii) a set of program transformations that exploit quantum-specific metamorphic relationships to alleviate the oracle problem. Evaluating the approach by testing the popular Qiskit platform shows that the approach creates over 8k program pairs within two days, many of which expose crashes. Inspecting the crashes, we find 13 bugs, nine of which have already been confirmed. MorphQ widens the slim portfolio of testing techniques of quantum computing platforms, helping to create a reliable software stack for this increasingly important field.

DOI: 10.1109/ICSE48619.2023.00202

作者: Tufano, Rosalia and Pascarella, Luca and Bavota, Gabriele
关键词: pre-training, code recommenders

Abstract

Transformers have gained popularity in the software engineering (SE) literature. These deep learning models are usually pre-trained through a self-supervised objective, meant to provide the model with basic knowledge about a language of interest (e.g., Java). A classic pre-training objective is the masked language model (MLM), in which a percentage of tokens from the input (e.g., a Java method) is masked, with the model in charge of predicting them. Once pre-trained, the model is then fine-tuned to support the specific downstream task of interest (e.g., code summarization). While there is evidence suggesting the boost in performance provided by pre-training, little is known about the impact of the specific pre-training objective(s) used. Indeed, MLM is just one of the possible pre-training objectives and recent work from the natural language processing field suggest that pre-training objectives tailored for the specific downstream task of interest may substantially boost the model’s performance. For example, in the case of code summarization, a tailored pre-training objective could be the identification of an appropriate name for a given method, considering the method name to generate as an extreme summary. In this study, we focus on the impact of pre-training objectives on the performance of transformers when automating code-related tasks. We start with a systematic literature review aimed at identifying the pre-training objectives used in SE. Then, we pre-train 32 transformers using both (i) generic pre-training objectives usually adopted in SE; and (ii) pre-training objectives tailored to specific code-related tasks subject of our experimentation, namely bug-fixing, code summarization, and code completion. We also compare the pre-trained models with non pre-trained ones and show the advantage brought by pre-training in different scenarios, in which more or less fine-tuning data are available. Our results show that: (i) pre-training helps in boosting performance only if the amount of fine-tuning data available is small; (ii) the MLM objective is usually sufficient to maximize the prediction performance of the model, even when comparing it with pre-training objectives specialized for the downstream task at hand.

DOI: 10.1109/ICSE48619.2023.00203

Log Parsing with Prompt-Based Few-Shot Learning

作者: Le, Van-Hoang and Zhang, Hongyu
关键词: log parsing, few-shot learning, prompt-tuning, deep learning

Abstract

Logs generated by large-scale software systems provide crucial information for engineers to understand the system status and diagnose problems of the systems. Log parsing, which converts raw log messages into structured data, is the first step to enabling automated log analytics. Existing log parsers extract the common part as log templates using statistical features. However, these log parsers often fail to identify the correct templates and parameters because: 1) they often overlook the semantic meaning of log messages, and 2) they require domain-specific knowledge for different log datasets. To address the limitations of existing methods, in this paper, we propose LogPPT to capture the patterns of templates using prompt-based few-shot learning. LogPPT utilises a novel prompt tuning method to recognise keywords and parameters based on a few labelled log data. In addition, an adaptive random sampling algorithm is designed to select a small yet diverse training set. We have conducted extensive experiments on 16 public log datasets. The experimental results show that LogPPT is effective and efficient for log parsing.

DOI: 10.1109/ICSE48619.2023.00204

作者: Nashid, Noor and Sintaha, Mifta and Mesbah, Ali
关键词: large language models, transformers, few-shot learning, program repair, test assertion generation

Abstract

Large language models trained on massive code corpora can generalize to new tasks without the need for task-specific fine-tuning. In few-shot learning, these models take as input a prompt, composed of natural language instructions, a few instances of task demonstration, and a query and generate an output. However, the creation of an effective prompt for code-related tasks in few-shot learning has received little attention. We present a technique for prompt creation that automatically retrieves code demonstrations similar to the developer task, based on embedding or frequency analysis. We apply our approach, CEDAR, to two different programming languages, statically and dynamically typed, and two different tasks, namely, test assertion generation and program repair. For each task, we compare CEDAR with state-of-the-art task-specific and fine-tuned models. The empirical results show that, with only a few relevant code demonstrations, our prompt creation technique is effective in both tasks with an accuracy of 76% and 52% for exact matches in test assertion generation and program repair tasks, respectively. For assertion generation, CEDAR outperforms existing task-specific and fine-tuned models by 333% and 11%, respectively. For program repair, CEDAR yields 189% better accuracy than task-specific models and is competitive with recent fine-tuned models. These findings have practical implications for practitioners, as CEDAR could potentially be applied to multilingual and multitask settings without task or language-specific training with minimal examples and effort.

DOI: 10.1109/ICSE48619.2023.00205

An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry

作者: Jiang, Wenxin and Synovic, Nicholas and Hyatt, Matt and Schorlemmer, Taylor R. and Sethi, Rohan and Lu, Yung-Hsiang and Thiruvathukal, George K. and Davis, James C.
关键词: software reuse, empirical software engineering, machine learning, deep learning, software supply chain, engineering decision making, cybersecurity, trust

Abstract

Deep Neural Networks (DNNs) are being adopted as components in software systems. Creating and specializing DNNs from scratch has grown increasingly difficult as state-of-the-art architectures grow more complex. Following the path of traditional software engineering, machine learning engineers have begun to reuse large-scale pre-trained models (PTMs) and fine-tune these models for downstream tasks. Prior works have studied reuse practices for traditional software packages to guide software engineers towards better package maintenance and dependency management. We lack a similar foundation of knowledge to guide behaviors in pre-trained model ecosystems.In this work, we present the first empirical investigation of PTM reuse. We interviewed 12 practitioners from the most popular PTM ecosystem, Hugging Face, to learn the practices and challenges of PTM reuse. From this data, we model the decision-making process for PTM reuse. Based on the identified practices, we describe useful attributes for model reuse, including provenance, reproducibility, and portability. Three challenges for PTM reuse are missing attributes, discrepancies between claimed and actual performance, and model risks. We substantiate these identified challenges with systematic measurements in the Hugging Face ecosystem. Our work informs future directions on optimizing deep learning ecosystems by automated measuring useful attributes and potential attacks, and envision future research on infrastructure and standardization for model registries.

DOI: 10.1109/ICSE48619.2023.00206

ContraBERT： Enhancing Code Pre-Trained Models via Contrastive Learning

作者: Liu, Shangqing and Wu, Bozhi and Xie, Xiaofei and Meng, Guozhu and Liu, Yang
关键词: code pre-trained models, contrastive learning, model robustness

Abstract

Large-scale pre-trained models such as CodeBERT, GraphCodeBERT have earned widespread attention from both academia and industry. Attributed to the superior ability in code representation, they have been further applied in multiple downstream tasks such as clone detection, code search and code translation. However, it is also observed that these state-of-the-art pre-trained models are susceptible to adversarial attacks. The performance of these pre-trained models drops significantly with simple perturbations such as renaming variable names. This weakness may be inherited by their downstream models and thereby amplified at an unprecedented scale. To this end, we propose an approach namely ContraBERT that aims to improve the robustness of pre-trained models via contrastive learning. Specifically, we design nine kinds of simple and complex data augmentation operators on the programming language (PL) and natural language (NL) data to construct different variants. Furthermore, we continue to train the existing pre-trained models by masked language modeling (MLM) and contrastive pre-training task on the original samples with their augmented variants to enhance the robustness of the model. The extensive experiments demonstrate that ContraBERT can effectively improve the robustness of the existing pre-trained models. Further study also confirms that these robustness-enhanced models provide improvements as compared to original models over four popular downstream tasks.

DOI: 10.1109/ICSE48619.2023.00207

DStream： A Streaming-Based Highly Parallel IFDS Framework

作者: Wang, Xizao and Zuo, Zhiqiang and Bu, Lei and Zhao, Jianhua
关键词: interprocedural static analysis, IFDS analysis, streaming, data-parallel computation

Abstract

The IFDS framework supports interprocedural dataflow analysis with distributive flow functions over finite domains. A large class of interprocedural dataflow analysis problems can be formulated as IFDS problems and thus can be solved with the IFDS framework precisely. Unfortunately, scaling IFDS analysis to large-scale programs is challenging in terms of both massive memory consumption and low analysis efficiency.This paper presents DStream, a scalable system dedicated to precise and highly parallel IFDS analysis for large-scale programs. DStream leverages a streaming-based out-of-core computation model to reduce memory footprint significantly and adopts fine-grained data parallelism to achieve efficiency. We implemented a taint analysis as a DStream instance analysis and compared DStream with three state-of-the-art tools. Our experiments validate that DStream outperforms all other tools with average speedups from 4.37x to 14.46x on a commodity PC with limited available memory. Meanwhile, the experiments confirm that DStream successfully scales to large-scale programs which the state-of-the-art tools (e.g., FlowDroid and/or DiskDroid) fail to analyze.

DOI: 10.1109/ICSE48619.2023.00208

(Partial) Program Dependence Learning

作者: Yadavally, Aashish and Nguyen, Tien N. and Wang, Wenbo and Wang, Shaohua
关键词: neural partial program analysis, neural program dependence analysis, neural networks, deep learning

Abstract

Code fragments from developer forums often migrate to applications due to the code reuse practice. Owing to the incomplete nature of such programs, analyzing them to early determine the presence of potential vulnerabilities is challenging. In this work, we introduce NEURALPDA, a neural network-based program dependence analysis tool for both complete and partial programs. Our tool efficiently incorporates intrastatement and inter-statement contextual features into statement representations, thereby modeling program dependence analysis as a statement-pair dependence decoding task. In the empirical evaluation, we report that NEURALPDA predicts the CFG and PDG edges in complete Java and C/C++ code with combined F-scores of 94.29% and 92.46%, respectively. The F-score values for partial Java and C/C++ code range from 94.29%–97.17% and 92.46%–96.01%, respectively. We also test the usefulness of the PDGs predicted by NEURALPDA (i.e., PDG*) on the downstream task of method-level vulnerability detection. We discover that the performance of the vulnerability detection tool utilizing PDG* is only 1.1% less than that utilizing the PDGs generated by a program analysis tool. We also report the detection of 14 real-world vulnerable code snippets from StackOverflow by a machine learning-based vulnerability detection tool that employs the PDGs predicted by NEURALPDA for these code snippets.

DOI: 10.1109/ICSE48619.2023.00209

MirrorTaint： Practical Non-Intrusive Dynamic Taint Tracking for JVM-Based Microservice Systems

作者: Ouyang, Yicheng and Shao, Kailai and Chen, Kunqiu and Shen, Ruobing and Chen, Chao and Xu, Mingze and Zhang, Yuqun and Zhang, Lingming
关键词: dynamic taint analysis, microservice, JVM

Abstract

Taint analysis, i.e., labeling data and propagating the labels through data flows, has been widely used for analyzing program information flows and ensuring system/data security. Due to its important applications, various taint analysis techniques have been proposed, including static and dynamic taint analysis. However, existing taint analysis techniques can be hardly applied to the rising microservice systems for industrial applications. To address such a problem, in this paper, we proposed the first practical non-intrusive dynamic taint analysis technique MirrorTaint for extensively supporting microservice systems on JVMs. In particular, by instrumenting the microservice systems, MirrorTaint constructs a set of data structures with their respective policies for labeling/propagating taints in its mirrored space. Such data structures are essentially non-intrusive, i.e., modifying no program meta-data or runtime system. Then, during program execution, MirrorTaint replicates the stack-based JVM instruction execution in its mirrored space on-the-fly for dynamic taint tracking. We have evaluated MirrorTaint against state-of-the-art dynamic and static taint analysis systems on various popular open-source microservice systems. The results demonstrate that MirrorTaint can achieve better compatibility, quite close precision and higher recall (97.9%/100.0%) than state-of-the-art Phosphor (100.0%/9.9%) and FlowDroid (100%/28.2%). Also, MirrorTaint incurs lower runtime overhead than Phosphor (although both are dynamic techniques). Moreover, we have performed a case study in Ant Group, a global billion-user FinTech company, to compare MirrorTaint and their mature developer-experience-based data checking system for automatically generated fund documents. The result shows that the developer experience can be incomplete, causing the data checking system to only cover 84.0% total data relations, while MirrorTaint can automatically find 99.0% relations with 100.0% precision. Lastly, we also applied MirrorTaint to successfully detect a recently wide-spread Log4j2 security vulnerability.

DOI: 10.1109/ICSE48619.2023.00210

VULGEN： Realistic Vulnerability Generation Via Pattern Mining and Deep Learning

作者: Nong, Yu and Ou, Yuzhe and Pradel, Michael and Chen, Feng and Cai, Haipeng
关键词: software vulnerability, data generation, bug injection, pattern mining, deep learning, vulnerability detection

Abstract

Building new, powerful data-driven defenses against prevalent software vulnerabilities needs sizable, quality vulnerability datasets, so does large-scale benchmarking of existing defense solutions. Automatic data generation would promisingly meet the need, yet there is little work aimed to generate much-needed quality vulnerable samples. Meanwhile, existing similar and adaptable techniques suffer critical limitations for that purpose. In this paper, we present VULGEN, the first injection-based vulnerability-generation technique that is not limited to a particular class of vulnerabilities. VULGEN combines the strengths of deterministic (pattern-based) and probabilistic (deep-learning/DL-based) program transformation approaches while mutually overcoming respective weaknesses. This is achieved through close collaborations between pattern mining/application and DL-based injection localization, which separates the concerns with how and where to inject. By leveraging large, pretrained programming language modeling and only learning locations, VULGEN mitigates its own needs for quality vulnerability data (for training the localization model). Extensive evaluations show that VULGEN significantly outperforms a state-of-the-art (SOTA) pattern-based peer technique as well as both Transformer- and GNN-based approaches in terms of the percentages of generated samples that are vulnerable and those also exactly matching the ground truth (by 38.0–430.1% and 16.3–158.2%, respectively). The VULGEN-generated samples led to substantial performance improvements for two SOTA DL-based vulnerability detectors (by up to 31.8% higher in F1), close to those brought by the ground-truth real-world samples and much higher than those by the same numbers of existing synthetic samples.

DOI: 10.1109/ICSE48619.2023.00211

Compatible Remediation on Vulnerabilities from Third-Party Libraries for Java Projects

作者: Zhang, Lyuye and Liu, Chengwei and Xu, Zhengzi and Chen, Sen and Fan, Lingling and Zhao, Lida and Wu, Jiahui and Liu, Yang
关键词: remediation, compatibility, Java, open-source software

Abstract

With the increasing disclosure of vulnerabilities in open-source software, software composition analysis (SCA) has been widely applied to reveal third-party libraries and the associated vulnerabilities in software projects. Beyond the revelation, SCA tools adopt various remediation strategies to fix vulnerabilities, the quality of which varies substantially. However, ineffective remediation could induce side effects, such as compilation failures, which impede acceptance by users. According to our studies, existing SCA tools could not correctly handle the concerns of users regarding the compatibility of remediated projects. To this end, we propose Compatible Remediation of Third-party libraries (CORAL) for Maven projects to fix vulnerabilities without breaking the projects. The evaluation proved that CORAL not only fixed 87.56% of vulnerabilities which outperformed other tools (best 75.32%) and achieved a 98.67% successful compilation rate and a 92.96% successful unit test rate. Furthermore, we found that 78.45% of vulnerabilities in popular Maven projects could be fixed without breaking the compilation, and the rest of the vulnerabilities (21.55%) could either be fixed by upgrades that break the compilations or even be impossible to fix by upgrading.

DOI: 10.1109/ICSE48619.2023.00212

Automated Black-Box Testing of Mass Assignment Vulnerabilities in RESTful APIs

作者: Corradini, Davide and Pasqua, Michele and Ceccato, Mariano
关键词: REST API, security testing, black-box testing, automated software testing, mass assignment

Abstract

Mass assignment is one of the most prominent vulnerabilities in RESTful APIs that originates from a misconfiguration in common web frameworks. This allows attackers to exploit naming convention and automatic binding to craft malicious requests that (massively) override data supposed to be read-only.In this paper, we adopt a black-box testing perspective to automatically detect mass assignment vulnerabilities in RESTful APIs. Indeed, execution scenarios are generated purely based on the OpenAPI specification, that lists the available operations and their message format. Clustering is used to group similar operations and reveal read-only fields, the latter are candidates for mass assignment. Then, test interaction sequences are automatically generated by instantiating abstract testing templates, with the aim of trying to use the found read-only fields to carry out a mass assignment attack. Test interactions are run, and their execution is assessed by a specific oracle, in order to reveal whether the vulnerability could be successfully exploited.The proposed novel approach has been implemented and evaluated on a set of case studies written in different programming languages. The evaluation highlights that the approach is quite effective in detecting seeded vulnerabilities, with a remarkably high accuracy.

DOI: 10.1109/ICSE48619.2023.00213

CoLeFunDa： Explainable Silent Vulnerability Fix Identification

作者: Zhou, Jiayuan and Pacheco, Michael and Chen, Jinfu and Hu, Xing and Xia, Xin and Lo, David and Hassan, Ahmed E.
关键词: No keywords

Abstract

It is common practice for OSS users to leverage and monitor security advisories to discover newly disclosed OSS vulnerabilities and their corresponding patches for vulnerability remediation. It is common for vulnerability fixes to be publicly available one week earlier than their disclosure. This gap in time provides an opportunity for attackers to exploit the vulnerability. Hence, OSS users need to sense the fix as early as possible so that the vulnerability can be remediated before it is exploited. However, it is common for OSS to adopt a vulnerability disclosure policy which causes the majority of vulnerabilities to be fixed silently, meaning the commit with the fix does not indicate any vulnerability information. In this case even if a fix is identified, it is hard for OSS users to understand the vulnerability and evaluate its potential impact. To improve early sensing of vulnerabilities, the identification of silent fixes and their corresponding explanations (e.g., the corresponding common weakness enumeration (CWE) and exploitability rating) are equally important.However, it is challenging to identify silent fixes and provide explanations due to the limited and diverse data. To tackle this challenge, we propose CoLeFunDa: a framework consisting of a Contrastive Learner and FunDa, which is a novel approach for Function change Data augmentation. FunDa first increases the fix data (i.e., code changes) at the function level with unsupervised and supervised strategies. Then the contrastive learner leverages contrastive learning to effectively train a function change encoder, FCBERT, from diverse fix data. Finally, we leverage FCBERT to further fine-tune three downstream tasks, i.e., silent fix identification, CWE category classification, and exploitability rating classification, respectively. Our result shows that CoLeFunDa outperforms all the state-of-art baselines in all downstream tasks. We also conduct a survey to verify the effectiveness of CoLeFunDa in practical usage. The result shows that CoLeFunDa can categorize 62.5% (25 out of 40) CVEs with correct CWE categories within the top 2 recommendations.

DOI: 10.1109/ICSE48619.2023.00214

Finding Causally Different Tests for an Industrial Control System

作者: Poskitt, Christopher M. and Chen, Yuqi and Sun, Jun and Jiang, Yu
关键词: cyber-physical systems, fuzzing, test diversity, equivalence classes, causality

Abstract

Industrial control systems (ICSs) are types of cyber-physical systems in which programs, written in languages such as ladder logic or structured text, control industrial processes through sensing and actuating. Given the use of ICSs in critical infrastructure, it is important to test their resilience against manipulations of sensor/actuator inputs. Unfortunately, existing methods fail to test them comprehensively, as they typically focus on finding the simplest-to-craft manipulations for a testing goal, and are also unable to determine when a test is simply a minor permutation of another, i.e. based on the same causal events. In this work, we propose a guided fuzzing approach for finding ‘meaningfully different’ tests for an ICS via a general formalisation of sensor/actuator-manipulation strategies. Our algorithm identifies the causal events in a test, generalises them to an equivalence class, and then updates the fuzzing strategy so as to find new tests that are causally different from those already identified. An evaluation of our approach on a real-world water treatment system shows that it is able to find 106% more causally different tests than the most comparable fuzzer. While we focus on diversifying the test suite of an ICS, our formalisation may be useful for other fuzzers that intercept communication channels.

DOI: 10.1109/ICSE48619.2023.00215

Doppelg"{a

作者: Huai, Yuqi and Chen, Yuntianyi and Almanee, Sumaya and Ngo, Tuan and Liao, Xiang and Wan, Ziwen and Chen, Qi Alfred and Garcia, Joshua
关键词: cyber-physical systems, autonomous driving systems, search-based software testing

Abstract

Vehicles controlled by autonomous driving software (ADS) are expected to bring many social and economic benefits, but at the current stage not being broadly used due to concerns with regard to their safety. Virtual tests, where autonomous vehicles are tested in software simulation, are common practices because they are more efficient and safer compared to field operational tests. Specifically, search-based approaches are used to find particularly critical situations. These approaches provide an opportunity to automatically generate tests; however, systematically producing bug-revealing tests for ADS remains a major challenge. To address this challenge, we introduce DoppelTest, a test generation approach for ADSes that utilizes a genetic algorithm to discover bug-revealing violations by generating scenarios with multiple autonomous vehicles that account for traffic control (e.g., traffic signals and stop signs). Our extensive evaluation shows that DoppelTest can efficiently discover 123 bug-revealing violations for a production-grade ADS (Baidu Apollo) which we then classify into 8 unique bug categories.

DOI: 10.1109/ICSE48619.2023.00216

Generating Realistic and Diverse Tests for LiDAR-Based Perception Systems

作者: Christian, Garrett and Woodlief, Trey and Elbaum, Sebastian
关键词: software testing and validation, machine learning

Abstract

Autonomous systems rely on a perception component to interpret their surroundings, and when misinterpretations occur, they can and have led to serious and fatal system-level failures. Yet, existing methods for testing perception software remain limited in both their capacity to efficiently generate test data that translates to real-world performance and in their diversity to capture the long tail of rare but safety-critical scenarios. These limitations are particularly evident for perception systems based on LiDAR sensors, which have emerged as a crucial component in modern autonomous systems due to their ability to provide a 3D scan of the world and operate in all lighting conditions. To address these limitations, we introduce a novel approach for testing LiDAR-based perception systems by leveraging existing real-world data as a basis to generate realistic and diverse test cases through mutations that preserve realism invariants while generating inputs rarely found in existing data sets, and automatically crafting oracles that identify potentially safety-critical issues in perception performance. We implemented our approach to assess its ability to identify perception failures, generating over 50,000 test inputs for five state-of-the-art LiDAR-based perception systems. We found that it efficiently generated test cases that yield errors in perception that could result in real consequences if these systems were deployed and does so at a low rate of false positives.

DOI: 10.1109/ICSE48619.2023.00217

Rules of Engagement： Why and How Companies Participate in OSS

作者: Guizani, Mariam and Castro-Guzman, Aileen Abril and Sarma, Anita and Steinmacher, Igor
关键词: open source, OSS, companies in open source, motivations, diversity

Abstract

Company engagement in open source (OSS) is now the new norm. From large technology companies to startups, companies are participating in the OSS ecosystem by open-sourcing their technology, sponsoring projects through funding or paid developer time. However, our understanding of the OSS ecosystem is rooted in the “old world” model where individual contributors sustain OSS projects. In this work, we create a more comprehensive understanding of the hybrid OSS landscape by investigating what motivates companies to contribute and how they contribute to OSS. We conducted interviews with 20 participants who have different roles (e.g., CEO, OSPO Lead, Ecosystem Strategist) at 17 different companies of different sizes from large companies (e.g. Microsoft, RedHat, Google, Spotify) to startups. Data from semi-structured interviews reveal that company motivations can be categorized into four levels (Founders’ Vision, Reputation, Business Advantage, and Reciprocity) and companies participate through different mechanisms (e.g., Developers’ Time, Mentoring Time, Advocacy & Promotion Time), each of which tie to the different types of motivations. We hope our findings nudge more companies to participate in the OSS ecosystem, helping make it robust, diverse, and sustainable.

DOI: 10.1109/ICSE48619.2023.00218

An Empirical Study on Software Bill of Materials： Where We Stand and the Road Ahead

作者: Xia, Boming and Bi, Tingting and Xing, Zhenchang and Lu, Qinghua and Zhu, Liming
关键词: software bill of materials, SBOM, bill of materials, responsible AI, empirical study

Abstract

The rapid growth of software supply chain attacks has attracted considerable attention to software bill of materials (SBOM). SBOMs are a crucial building block to ensure the transparency of software supply chains that helps improve software supply chain security. Although there are significant efforts from academia and industry to facilitate SBOM development, it is still unclear how practitioners perceive SBOMs and what are the challenges of adopting SBOMs in practice. Furthermore, existing SBOM-related studies tend to be ad-hoc and lack software engineering focuses. To bridge this gap, we conducted the first empirical study to interview and survey SBOM practitioners. We applied a mixed qualitative and quantitative method for gathering data from 17 interviewees and 65 survey respondents from 15 countries across five continents to understand how practitioners perceive the SBOM field. We summarized 26 statements and grouped them into three topics on SBOM’s states of practice. Based on the study results, we derived a goal model and highlighted future directions where practitioners can put in their effort.

DOI: 10.1109/ICSE48619.2023.00219