ASPLOS 2024

Accurate Disassembly of Complex Binaries Without Use of Compiler Metadata

作者: Priyadarshan, Soumyakant and Nguyen, Huan and Sekar, R.
关键词: No keywords

Abstract

Accurate disassembly of stripped binaries is the first step in binary analysis, instrumentation and reverse engineering. Complex instruction sets such as the x86 pose major challenges in this context because it is very difficult to distinguish between code and embedded data. To make progress, many recent approaches have either made optimistic assumptions (e.g., absence of embedded data) or relied on additional compiler-generated metadata (e.g., relocation info and/or exception handling metadata). Unfortunately, many complex binaries do contain embedded data, while lacking the additional metadata needed by these techniques. We therefore present a novel approach for accurate disassembly that uses statistical properties of data to detect code, and behavioral properties of code to flag data. We present new static analysis and data-driven probabilistic techniques that are then combined using a prioritized error correction algorithm to achieve results that are 3X to 4X more accurate than the best previous results.

DOI: 10.1145/3623278.3624766

BaCO： A Fast and Portable Bayesian Compiler Optimization Framework

作者: Hellsten, Erik Orm and Souza, Artur and Lenfers, Johannes and Lacouture, Rubens and Hsu, Olivia and Ejjeh, Adel and Kjolstad, Fredrik and Steuwer, Michel and Olukotun, Kunle and Nardi, Luigi
关键词: compiler optimizations, high-performance computing, bayesian optimization, autotuning, autoscheduling

Abstract

We introduce the Bayesian Compiler Optimization framework (BaCO), a general purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO provides the flexibility needed to handle the requirements of modern autotuning tasks. Particularly, it deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. To reason about these parameter types and efficiently deliver high-quality code, BaCO uses Bayesian optimization algorithms specialized towards the autotuning domain. We demonstrate BaCO’s effectiveness on three modern compiler systems: TACO, RISE & ELEVATE, and HPVM2FPGA for CPUs, GPUs, and FPGAs respectively. For these domains, BaCO outperforms current state-of-the-art auto-tuners by delivering on average 1.36X–1.56X faster code with a tiny search budget, and BaCO is able to reach expert-level performance 2.9X–3.9X faster.

DOI: 10.1145/3623278.3624770

CPS： A Cooperative Para-virtualized Scheduling Framework for Manycore Machines

作者: Liu, Yuxuan and Xu, Tianqiang and Mi, Zeyu and Hua, Zhichao and Zang, Binyu and Chen, Haibo
关键词: para-virtualized scheduling, cache group, manycore machine, performance scalability

Abstract

Today’s cloud platforms offer large virtual machine (VM) instances with multiple virtual CPUs (vCPU) on manycore machines. These machines typically have a deep memory hierarchy to enhance communication between cores. Although previous researches have primarily focused on addressing the performance scalability issues caused by the double scheduling problem in virtualized environments, they mainly concentrated on solving the preemption problem of synchronization primitives and the traditional NUMA architecture. This paper specifically targets a new aspect of scalability issues caused by the absence of runtime hypervisor-internal states (RHS). We demonstrate two typical RHS problems, namely the invisible pCPU (physical CPU) load and dynamic cache group mapping. These RHS problems result in a collapse in VM performance and low CPU utilization because the guest VM lacks visibility into the latest runtime internal states maintained by the hypervisor, such as pCPU load and vCPU-pCPU mappings. Consequently, the guest VM makes inefficient scheduling decisions.To address the RHS issue, we argue that the solution lies in exposing the latest scheduling decisions made by both the guest and host schedulers to each other. Hence, we present a cooperative para-virtualized scheduling framework called CPS, which facilitates the proactive exchange of timely scheduling information between the hypervisor and guest VMs. To ensure effective scheduling decisions for VMs, a series of techniques are proposed based on the exchanged information. We have implemented CPS in Linux KVM and have designed corresponding solutions to tackle the two RHS problems. Evaluation results demonstrate that CPS significantly improves the performance of PARSEC by 81.1% and FxMark by 1.01x on average for the two identified problems.

DOI: 10.1145/3623278.3624762

DataFlower： Exploiting the Data-flow Paradigm for Serverless Workflow Orchestration

作者: Li, Zijun and Xu, Chuhao and Chen, Quan and Zhao, Jieru and Chen, Chen and Guo, Minyi
关键词: FaaS, function-as-a-Service, serverless workflow, workflow orchestration, control-flow paradigm, data-flow paradigm

Abstract

Serverless computing that runs functions with auto-scaling is a popular task execution pattern in the cloud-native era. By connecting serverless functions into workflows, tenants can achieve complex functionality. Prior research adopts the control-flow paradigm to orchestrate a serverless workflow. However, the control-flow paradigm inherently results in long response latency, due to the heavy data persistence overhead, sequential resource usage, and late function triggering.Our investigation shows that the data-flow paradigm has the potential to resolve the above problems, with careful design and optimization. We propose DataFlower, a scheme that achieves the data-flow paradigm for serverless workflows. In DataFlower, a container is abstracted to be a function logic unit and a data logic unit. The function logic unit runs the functions, and the data logic unit handles the data transmission asynchronously. Moreover, a host-container collaborative communication mechanism is used to support efficient data transfer. Our experimental results show that compared to state-of-the-art serverless designs, DataFlower reduces the 99%-ile latency of the benchmarks by up to 35.4%, and improves the peak throughput by up to 3.8X.

DOI: 10.1145/3623278.3624755

DREAM： A Dynamic Scheduler for Dynamic Real-time Multi-model ML Workloads

作者: Kim, Seah and Kwon, Hyoukjun and Song, Jinook and Jo, Jihyuck and Chen, Yu-Hsin and Lai, Liangzhen and Chandra, Vikas
关键词: scheduler, AR/VR, multi-model ML, hardware-software co-design

Abstract

Emerging real-time multi-model ML (RTMM) workloads such as AR/VR and drone control involve dynamic behaviors in various granularity; task, model, and layers within a model. Such dynamic behaviors introduce new challenges to the system software in an ML system since the overall system load is not completely predictable, unlike traditional ML workloads. In addition, RTMM workloads require real-time processing, involve highly heterogeneous models, and target resource-constrained devices. Under such circumstances, developing an effective scheduler gains more importance to better utilize underlying hardware considering the unique characteristics of RTMM workloads. Therefore, we propose a new scheduler, DREAM, which effectively handles various dynamicity in RTMM workloads targeting multi-accelerator systems. DREAM quantifies the unique requirements for RTMM workloads and utilizes the quantified scores to drive scheduling decisions, considering the current system load and other inference jobs on different models and input frames. DREAM utilizes tunable parameters that provide fast and effective adaptivity to dynamic workload changes. In our evaluation of five scenarios of RTMM workload, DREAM reduces the overall UXCosT, which is an equivalent metric of the energy-delay product (EDP) for RTMM defined in the paper, by 32.2% and 50.0% in the geometric mean (up to 80.8% and 97.6%) compared to state-of-the-art baselines, which shows the efficacy of our scheduling methodology.

DOI: 10.1145/3623278.3624753

Explainable-DSE： An Agile and Explainable Exploration of Efficient HW/SW Codesigns of Deep Learning Accelerators Using Bottleneck Analysis

作者: Dave, Shail and Nowatzki, Tony and Shrivastava, Aviral
关键词: design space exploration, domain-specific architectures, gray-box optimization, bottleneck model, hardware/software codesign, explainability, machine learning and systems

Abstract

Effective design space exploration (DSE) is paramount for hardware/software codesigns of deep learning accelerators that must meet strict execution constraints. For their vast search space, existing DSE techniques can require excessive trials to obtain a valid and efficient solution because they rely on black-box explorations that do not reason about design inefficiencies. In this paper, we propose Explainable-DSE - a framework for the DSE of accelerator codesigns using bottleneck analysis. By leveraging information about execution costs from bottleneck models, our DSE is able to identify bottlenecks and reason about design inefficiencies, thereby making bottleneck-mitigating acquisitions in further explorations. We describe the construction of bottleneck models for DNN accelerators. We also propose an API for expressing domain-specific bottleneck models and interfacing them with the DSE framework. Acquisitions of our DSE systematically cater to multiple bottlenecks that arise in executions of multi-functional workloads or multiple workloads with diverse execution characteristics. Evaluations for recent computer vision and language models show that Explainable-DSE mostly explores effectual candidates, achieving codesigns of 6X lower latency in 47X fewer iterations vs. non-explainable DSEs using evolutionary or ML-based optimizations. By taking minutes or tens of iterations, it enables opportunities for runtime DSEs.

DOI: 10.1145/3623278.3624772

Exploiting the Regular Structure of Modern Quantum Architectures for Compiling and Optimizing Programs with Permutable Operators

作者: Jin, Yuwei and Hua, Fei and Chen, Yanhao and Hayes, Ari and Zhang, Chi and Zhang, Eddy Z.
关键词: quantum circuit compilation, QAOA, circuit fidelity

Abstract

A critical feature in today’s quantum circuit is that they have permutable two-qubit operators. The flexibility in ordering the permutable two-qubit gates leads to more compiler optimization opportunities. However, it also imposes significant challenges due to the additional degree of freedom. Our Contributions are two-fold. We first propose a general methodology that can find structured solutions for scalable quantum hardware. It breaks down the complex compilation problem into two sub-problems that can be solved at small scale. Second, we show how such a structured method can be adapted to practical cases that handle sparsity of the input problem graphs and the noise variability in real hardware. Our evaluation evaluates our method on IBM and Google architecture coupling graphs for up to 1,024 qubits and demonstrate better result in both depth and gate count - by up to 72% reduction in depth, and 66% reduction in gate count. Our real experiments on IBM Mumbai show that we can find better expected minimal energy than the state-of-the-art baseline.

DOI: 10.1145/3623278.3624751

Fast Instruction Selection for Fast Digital Signal Processing

作者: Root, Alexander J and Ahmad, Maaz Bin Safeer and Sharlet, Dillon and Adams, Andrew and Kamil, Shoaib and Ragan-Kelley, Jonathan
关键词: No keywords

Abstract

Modern vector processors support a wide variety of instructions for fixed-point digital signal processing. These instructions support a proliferation of rounding, saturating, and type conversion modes, and are often fused combinations of more primitive operations. While these are common idioms in fixed-point signal processing, it is difficult to use these operations in portable code. It is challenging for programmers to write down portable integer arithmetic in a C-like language that corresponds exactly to one of these instructions, and even more challenging for compilers to recognize when these instructions can be used. Our system, Pitchfork, defines a portable fixed-point intermediate representation, FPIR, that captures common idioms in fixed-point code. FPIR can be used directly by programmers experienced with fixed-point, or Pitchfork can automatically lift from integer operations into FPIR using a term-rewriting system (TRS) composed of verified manual and automatically-synthesized rules. Pitchfork then lowers from FPIR into target-specific fixed-point instructions using a set of target-specific TRSs. We show that this approach improves runtime performance of portably-written fixed-point signal processing code in Halide, across a range of benchmarks, by geomean 1.31x on x86 with AVX2, 1.82x on ARM Neon, and 2.44x on Hexagon HVX compared to a standard LLVM-based compiler flow, while maintaining or improving existing compile times.

DOI: 10.1145/3623278.3624768

FITS： Inferring Intermediate Taint Sources for Effective Vulnerability Analysis of IoT Device Firmware

作者: Liu, Puzhuo and Zheng, Yaowen and Sun, Chengnian and Qin, Chuan and Fang, Dongliang and Liu, Mingdong and Sun, Limin
关键词: firmware, taint analysis, vulnerability

Abstract

Finding vulnerabilities in firmware is vital as any firmware vulnerability may lead to cyberattacks to the physical IoT devices. Taint analysis is one promising technique for finding firmware vulnerabilities thanks to its high coverage and scalability. However, sizable closed-source firmware makes it extremely difficult to analyze the complete data-flow paths from taint sources (i.e., interface library functions such as recv) to sinks.We observe that certain custom functions in binaries can be used as intermediate taint sources (ITSs). Compared to interface library functions, using custom functions as taint sources can significantly shorten the data-flow paths for analysis. However, inferring ITSs is challenging due to the complexity and customization of firmware. Moreover, the debugging information and symbol table of binaries in firmware are stripped; therefore, prior techniques of inferring taint sources are not applicable except laborious manual analysis. To this end, this paper proposes FITS to automatically infer ITSs. Specifically, FITS represents each function with a novel behavioral feature representation that captures the static and dynamic properties of the function, and ranks custom functions as taint sources through behavioral clustering and similarity scoring.We evaluated FITS on 59 large, real-world firmware samples. The inference results of FITS are accurate: at least one of top-3 ranked custom functions can be used as an ITS with 89% precision. ITSs helped Karonte find 15 more bugs and helped the static taint engine find 339 more bugs. More importantly, 21 bugs have been awarded CVE IDs and rated high severity with media coverage.

DOI: 10.1145/3623278.3624759

Flame： A Centralized Cache Controller for Serverless Computing

作者: Yang, Yanan and Zhao, Laiping and Li, Yiming and Wu, Shihao and Hao, Yuechan and Ma, Yuchi and Li, Keqiu
关键词: serverless computing, keep-alive, hotspot function, coldstart

Abstract

Caching function is a promising way to mitigate coldstart overhead in serverless computing. However, as caching also increases the resource cost significantly, how to make caching decisions is still challenging. We find that the prior “local cache control” designs are insufficient to achieve high cache efficiency due to the workload skewness across servers.In this paper, inspired by the idea of software defined network management, we propose Flame, an efficient cache system to manage cached functions with a “centralized cache control” design. By decoupling the cache control plane from local servers and setting up a separate centralized controller, Flame is able to make caching decisions considering a global view of cluster status, enabling the optimized cache-hit ratio and resource efficiency. We evaluate Flame with real-world workloads and the evaluation results show that it can reduce the cache resource usage by 36% on average while improving the coldstart ratio by nearly 7x than the state-of-the-art method.

DOI: 10.1145/3623278.3624769

FreePart： Hardening Data Processing Software via Framework-based Partitioning and Isolation

作者: Ahad, Ali and Wang, Gang and Kim, Chung Hwan and Jana, Suman and Lin, Zhiqiang and Kwon, Yonghwi
关键词: software isolation, software partitioning, data processing frameworks

Abstract

Data processing oriented software, especially machine learning applications, are heavily dependent on standard frameworks/libraries such as TensorFlow and OpenCV. As those frameworks have gained significant popularity, the exploitation of vulnerabilities in the frameworks has become a critical security concern. While software isolation can minimize the impact of exploitation, existing approaches suffer from difficulty analyzing complex program dependencies or excessive overhead, making them ineffective in practice.We propose FreePart, a framework-focused software partitioning technique specialized for data processing applications. It is based on an observation that the execution of a data processing application, including data flows and usage of critical data, is closely related to the invocations of framework APIs. Hence, we conduct a temporal partitioning of the host application’s execution based on the invocations of framework APIs and the data objects used by the APIs. By focusing on data accesses at runtime instead of static program code, it provides effective and practical isolation from the perspective of data. Our evaluation on 23 applications using popular frameworks (e.g., OpenCV, Caffe, PyTorch, and TensorFlow) shows that FreePart is effective against all attacks composed of 18 real-world vulnerabilities with a low overhead (3.68%).

DOI: 10.1145/3623278.3624760

HIR： An MLIR-based Intermediate Representation for Hardware Accelerator Description

作者: Majumder, Kingshuk and Bondhugula, Uday
关键词: HDL, HLS, MLIR, verilog, accelerator, FPGA

Abstract

The emergence of machine learning, image and audio processing on edge devices has motivated research towards power-efficient custom hardware accelerators. Though FPGAs are an ideal target for custom accelerators, the difficulty of hardware design and the lack of vendor agnostic, standardized hardware compilation infrastructure has hindered their adoption.This paper introduces HIR, an MLIR-based intermediate representation (IR) and a compiler to design hardware accelerators for affine workloads. HIR replaces the traditional datapath + FSM representation of hardware with datapath + schedules. We implement a compiler that automatically synthesizes the finite-state-machine (FSM) from the schedule description. The IR also provides high-level language features, such as loops and multi-dimensional tensors. The combination of explicit schedules and high-level language abstractions allow HIR to express synchronization-free, fine-grained parallelism, as well as high-level optimizations such as loop pipelining and overlapped execution of multiple kernels.Built as a dialect in MLIR, it draws from best IR practices learnt from communities like those of LLVM. While offering rich optimization opportunities and a high-level abstraction, the IR enables sharing of optimizations, utilities and passes with software compiler infrastructure. Our evaluation shows that the generated hardware design is comparable in performance and resource usage with Vitis HLS. We believe that such a common hardware compilation pipeline can help accelerate the research in language design for hardware description.

DOI: 10.1145/3623278.3624767

LightRidge： An End-to-end Agile Design Framework for Diffractive Optical Neural Networks

作者: Li, Yingjie and Chen, Ruiyang and Lou, Minhan and Sensale-Rodriguez, Berardi and Gao, Weilu and Yu, Cunxi
关键词: No keywords

Abstract

To lower the barrier to diffractive optical neural networks (DONNs) design, exploration, and deployment, we propose LightRidge, the first end-to-end optical ML compilation framework, which consists of (1) precise and differentiable optical physics kernels that enable complete explorations of DONNs architectures, (2) optical physics computation kernel acceleration that significantly reduces the runtime cost in training, emulation, and deployment of DONNs, and (3) versatile and flexible optical system modeling and user-friendly domain-specific-language (DSL). As a result, LightRidge framework enables efficient end-to-end design and deployment of DONNs, and significantly reduces the efforts for programming, hardware-software codesign, and chip integration. Our results are experimentally conducted with physical optical systems, where we demonstrate: (1) the optical physics kernels precisely correlated to low-level physics and systems, (2) significant speedups in runtime with physics-aware emulation workloads compared to the state-of-the-art commercial system, (3) effective architectural design space exploration verified by the hardware prototype and on-chip integration case study, and (4) novel DONN design principles including successful demonstrations of advanced image classification and image segmentation task using DONNs architecture and topology.

DOI: 10.1145/3623278.3624757

Manticore： Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism

作者: Emami, Mahyar and Kashani, Sahand and Kamahori, Keisuke and Pourghannad, Mohammad Sepehr and Raj, Ritik and Larus, James R.
关键词: No keywords

Abstract

The demise of Moore’s Law and Dennard Scaling has revived interest in specialized computer architectures and accelerators. Verification and testing of this hardware depend heavily upon cycle-accurate simulation of register-transfer-level (RTL) designs. The fastest software RTL simulators can simulate designs at 1–1000 kHz, i.e., more than three orders of magnitude slower than hardware. Improved simulators can increase designers’ productivity by speeding design iterations and permitting more exhaustive exploration.One possibility is to exploit low-level parallelism, as RTL expresses considerable fine-grain concurrency. Unfortunately, state-of-the-art RTL simulators often perform best on a single core since modern processors cannot effectively exploit fine-grain parallelism.This work presents Manticore: a parallel computer designed to accelerate RTL simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution model to eliminate fine-grain synchronization overhead. It relies entirely on a compiler to schedule resources and communication, which is feasible since RTL code contains few divergent execution paths. With static scheduling, communication and synchronization no longer incur runtime overhead, making fine-grain parallelism practical. Moreover, static scheduling dramatically simplifies processor implementation, significantly increasing the number of cores that fit on a chip. Our 225-core FPGA implementation running at 475 MHz outperforms a state-of-the-art RTL simulator running on desktop and server computers in 8 out of 9 benchmarks.

DOI: 10.1145/3623278.3624750

MiniMalloc： A Lightweight Memory Allocator for Hardware-Accelerated Machine Learning

作者: Moffitt, Michael D.
关键词: memory allocation, hardware acceleration, machine learning

Abstract

We present a new approach to static memory allocation, a key problem that arises in the compilation of machine learning models onto the resources of a specialized hardware accelerator. Our methodology involves a recursive depth-first search that limits exploration to a special class of canonical solutions, dramatically reducing the size of the search space. We also develop a spatial inference technique that exploits this special structure by pruning unpromising partial assignments and backtracking more effectively than otherwise possible. Finally, we introduce a new mechanism capable of detecting and eliminating dominated solutions from consideration. Empirical results demonstrate orders of magnitude improvement in performance as compared to the previous state-of-the-art on many benchmarks, as well as a substantial reduction in library size.

DOI: 10.1145/3623278.3624752

Predict; Don’t React for Enabling Efficient Fine-Grain DVFS in GPUs

作者: Bharadwaj, Srikant and Das, Shomit and Mazumdar, Kaushik and Beckmann, Bradford M. and Kosonocky, Stephen
关键词: dynamic voltage frequency scaling, graphics processing unit

Abstract

With the continuous improvement of on-chip integrated voltage regulators (IVRs) and fast, adaptive frequency control, dynamic voltage-frequency scaling (DVFS) transition times have shrunk from the microsecond to the nanosecond regime, providing immense opportunity to improve energy efficiency. The key to unlocking the continued improvement in V/f circuit technology is the creation of new, smarter DVFS mechanisms that better adapt to rapid fluctuations in workload demand.It is particularly important to optimize fine-grain DVFS mechanisms for graphics processing units (GPUs) as the chips become ever more important workhorses in the datacenter. However, GPU’s massive amount of thread-level parallelism makes it uniquely difficult to determine the optimal V/f state at run-time. Existing solutions—mostly designed for single-threaded CPUs and longer time scales—fail to consider the seemingly chaotic, highly varying nature of GPU workloads at short time scales.This paper proposes a novel prediction mechanism, PCSTALL, that is tailored for emerging DVFS capabilities in GPUs and achieves near-optimal energy efficiency. Using the insights from our fine-grained workload analysis, we propose a wavefront-level program counter (PC) based DVFS mechanism that improves program behavior prediction accuracy by 32% on average as compared to the best performing prior predictor for a wide set of GPU applications at 1μs DVFS time epochs. Compared to the current state-of-art, our PC-based technique achieves 19% average improvement when optimized for Energy-Delay2 Product (ED2P) at 50μs time epochs, reaching 32% when operated with 1μs DVFS technologies.

DOI: 10.1145/3623278.3624756

RECom： A Compiler Approach to Accelerating Recommendation Model Inference with Massive Embedding Columns

作者: Pan, Zaifeng and Zheng, Zhen and Zhang, Feng and Wu, Ruofan and Liang, Hao and Wang, Dalin and Qiu, Xiafei and Bai, Junjie and Lin, Wei and Du, Xiaoyong
关键词: No keywords

Abstract

Embedding columns are important for deep recommendation models to achieve high accuracy, but they can be very time-consuming during inference. Machine learning (ML) compilers are used broadly in real businesses to optimize ML models automatically. Unfortunately, no existing work uses compilers to automatically accelerate the heavy embedding column computations during recommendation model inferences. To fill this gap, we propose RECom, the first ML compiler that aims at optimizing the massive embedding columns in recommendation models on the GPU. RECom addresses three major challenges. First, generating an efficient schedule on the GPU for the massive operators within embedding columns is difficult. Existing solutions usually lead to numerous small kernels and also lack inter-subgraph parallelism. We adopt a novel codegen strategy that fuses massive embedding columns into a single kernel and maps each column into a separate thread block on the GPU. Second, the complex shape computations under dynamic shape scenarios impede further graph optimizations. We develop a symbolic expression-based module to reconstruct all shape computations. Third, ML frameworks inevitably introduce redundant computations due to robustness considerations. We develop a subgraph optimization module that performs graph-level simplifications based on the entire embedding column context. Experiments on both in-house and open-source models show that RECom can achieve 6.61X and 1.91X over state-of-the-art baselines in terms of end-to-end inference latency and throughput, respectively. RECom’s source code is publicly available at https://github.com/AlibabaResearch/recom.

DOI: 10.1145/3623278.3624761

ShapleyIQ： Influence Quantification by Shapley Values for Performance Debugging of Microservices

作者: Li, Ye and Tan, Jian and Wu, Bin and He, Xiao and Li, Feifei
关键词: No keywords

Abstract

Years of experience in operating large-scale microservice systems strengthens our belief that their individual components, with inevitable anomalies, still demand a quantification of the influences on the end-to-end performance indicators. On a causal graph that represents the complex dependencies between the system components, the scatteredly detected anomalies, even when they look similar, could have different implications with contrastive remedial actions. To this end, we design ShapleyIQ, an online monitoring and diagnosis service that can effectively improve the system stability. It is guided by rigorous analysis on Shapley values for a causal graph. Notably, a new property on splitting invariance addresses the challenging exponential computation complexity problem of generic Shapley values by decomposition.This service has been deployed on a core infrastructure system on Alibaba Cloud, for over a year with more than 15,000 operations for 86 services across 2,546 machines. Since then, it has drastically improved the DevOps efficiency, and the system failures have been significantly reduced by 83.3%. We also conduct an offline evaluation on an open source microservice system TrainTicket, which is to pinpoint the root causes of the performance issues in hindsight. Extensive experiments and test cases show that our system can achieve 97.3% accuracy in identifying the top-1 root causes for these datasets, which significantly outperforms baseline algorithms by at least 28.7% in absolute difference.

DOI: 10.1145/3623278.3624771

Sleuth： A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural Networks

作者: Gan, Yu and Liu, Guiyang and Zhang, Xin and Zhou, Qi and Wu, Jiesheng and Jiang, Jiangwei
关键词: No keywords

Abstract

Cloud microservices are being scaled up due to the rising demand for new features and the convenience of cloud-native technologies. However, the growing scale of microservices complicates the remote procedure call (RPC) dependency graph, exacerbates the tail-of-scale effect, and makes many of the empirical rules for detecting the root cause of end-to-end performance issues unreliable. Additionally, existing open-source microservice benchmarks are too small to evaluate performance debugging algorithms at a production-scale with hundreds or even thousands of services and RPCs.To address these challenges, we present Sleuth, a trace-based root cause analysis (RCA) system for large-scale microservices using un-supervised graph learning. Sleuth leverages a graph neural network to capture the causal impact of each span in a trace, and trace clustering using a trace distance metric to reduce the amount of traces required for root cause localization. A pre-trained Sleuth model can be transferred to different microservice applications without any retraining or with few-shot fine-tuning. To quantitatively evaluate the performance and scalability of Sleuth, we propose a method to generate microservice benchmarks comparable to a production-scale. The experiments on the existing benchmark suites and synthetic large-scale microservices indicate that Sleuth has significantly outperformed the prior work in detection accuracy, performance, and adaptability on a large-scale deployment.

DOI: 10.1145/3623278.3624758

Supporting Descendants in SIMD-Accelerated JSONPath

作者: Gienieczko, Mateusz and Murlak, Filip and Paperman, Charles
关键词: json, jsonpath, simd, query language, data management

Abstract

Harnessing the power of SIMD can bring tremendous performance gains in data processing. In querying streamed JSON data, the state of the art leverages SIMD to fast forward significant portions of the document. However, it does not provide support for descendant, which excludes many real-life queries and makes formulating many others hard. In this work, we aim to change this: we consider the fragment of JSONPath that supports child, descendant, wildcard, and labels. We propose a modular approach based on novel depth-stack automata that process a stream of events produced by a state-driven classifier, allowing fast forwarding parts of the input document irrelevant at the current stage of the computation. We implement our solution in Rust and compare it with the state of the art, confirming that our approach allows supporting descendants without sacrificing performance, and that reformulating natural queries using descendants brings impressive performance gains in many cases.

DOI: 10.1145/3623278.3624754

VarSaw： Application-tailored Measurement Error Mitigation for Variational Quantum Algorithms

作者: Dangwal, Siddharth and Ravi, Gokul Subramanian and Das, Poulami and Smith, Kaitlin N. and Baker, Jonathan Mark and Chong, Frederic T.
关键词: quantum computing, variational quantum algorithms, error mitigation, measurement errors, noisy intermediate-scale quantum, variational quantum eigensolver

Abstract

For potential quantum advantage, Variational Quantum Algorithms (VQAs) need high accuracy beyond the capability of today’s NISQ devices, and thus will benefit from error mitigation. In this work we are interested in mitigating measurement errors which occur during qubit measurements after circuit execution and tend to be the most error-prone operations, especially detrimental to VQAs. Prior work, JigSaw, has shown that measuring only small subsets of circuit qubits at a time and collecting results across all such subset' circuits can reduce measurement errors. Then, running the entire (global’) original circuit and extracting the qubit-qubit measurement correlations can be used in conjunction with the subsets to construct a high-fidelity output distribution of the original circuit. Unfortunately, the execution cost of JigSaw scales polynomially in the number of qubits in the circuit, and when compounded by the number of circuits and iterations in VQAs, the resulting execution cost quickly turns insurmountable.To combat this, we propose VarSaw, which improves JigSaw in an application-tailored manner, by identifying considerable redundancy in the JigSaw approach for VQAs: spatial redundancy across subsets from different VQA circuits and temporal redundancy across globals from different VQA iterations. VarSaw then eliminates these forms of redundancy by commuting the subset circuits and selectively executing the global circuits, reducing computational cost (in terms of the number of circuits executed) over naive JigSaw for VQA by 25x on average and up to 1000x, for the same VQA accuracy. Further, it can recover, on average, 45% of the infidelity from measurement errors in the noisy VQA baseline. Finally, it improves fidelity by 55%, on average, over JigSaw for a fixed computational budget. VarSaw can be accessed here: https://github.com/siddharthdangwal/VarSaw

DOI: 10.1145/3623278.3624764

Veil： A Protected Services Framework for Confidential Virtual Machines

作者: Ahmad, Adil and Ou, Botong and Liu, Congyu and Zhang, Xiaokuan and Fonseca, Pedro
关键词: confidential virtual machines, OS design, cloud security

Abstract

Confidential virtual machines (CVMs) enabled by AMD SEV provide a protected environment for sensitive computations on an untrusted cloud. Unfortunately, CVMs are typically deployed with huge and vulnerable operating system kernels, exposing the CVMs to attacks that exploit kernel vulnerabilities. Veil is a versatile CVM framework that efficiently protects critical system services like shielding sensitive programs, which cannot be entrusted to the buggy kernel. Veil leverages a new hardware primitive, virtual machine privilege levels (VMPL), to install a privileged security monitor inside the CVM. We overcome several challenges in designing Veil, including (a) creating unlimited secure domains with a limited number of VMPLs, (b) establishing resource-efficient domain switches, and © maintaining commodity kernel backwards-compatibility with only minor changes. Our evaluation shows that Veil incurs no discernible performance slowdown during normal CVM execution while incurring a modest overhead (2 – 64%) when running its protected services across real-world use cases.

DOI: 10.1145/3623278.3624763

λFS： A Scalable and Elastic Distributed File System Metadata Service using Serverless Functions

作者: Carver, Benjamin and Han, Runzhou and Zhang, Jingyuan and Zheng, Mai and Cheng, Yue
关键词: No keywords

Abstract

The metadata service (MDS) sits on the critical path for distributed file system (DFS) operations, and therefore it is key to the overall performance of a large-scale DFS. Common “serverful” MDS architectures, such as a single server or cluster of servers, have a significant shortcoming: either they are not scalable, or they make it difficult to achieve an optimal balance of performance, resource utilization, and cost. A modern MDS requires a novel architecture that addresses this shortcoming.To this end, we design and implement γFS, an elastic, high-performance metadata service for large-scale DFSes. γFS scales a DFS metadata cache elastically on a FaaS (Function-as-a-Service) platform and synthesizes a series of techniques to overcome the obstacles that are encountered when building large, stateful, and performance-sensitive applications on FaaS platforms. γFS takes full advantage of the unique benefits offered by FaaS—elastic scaling and massive parallelism—to realize a highly-optimized metadata service capable of sustaining up to 4.13X higher throughput, 90.40% lower latency, 85.99% lower cost, 3.33X better performance-per-cost, and better resource utilization and efficiency than a state-of-the-art DFS for an industrial workload.

DOI: 10.1145/3623278.3624765

A Fault-Tolerant Million Qubit-Scale Distributed Quantum Computer

作者: Kim, Junpyo and Min, Dongmoon and Cho, Jungmin and Jeong, Hyeonseong and Byun, Ilkwon and Choi, Junhyuk and Hong, Juwon and Kim, Jangwoo
关键词: fault-tolerant quantum computing, distributed quantum computing, quantum error correction, cryogenic computing, single flux quantum (SFQ)

Abstract

A million qubit-scale quantum computer is essential to realize the quantum supremacy. Modern large-scale quantum computers integrate multiple quantum computers located in dilution refrigerators (DR) to overcome each DR’s unscaling cooling budget. However, a large-scale multi-DR quantum computer introduces its unique challenges (i.e., slow and erroneous inter-DR entanglement, increased qubit scale), and they make the baseline error handling mechanism ineffective by increasing the number of gate operations and the inter-DR communication latency to decode and correct errors. Without resolving these challenges, it is impossible to realize a fault-tolerant large-scale multi-DR quantum computer.In this paper, we propose a million qubit-scale distributed quantum computer which uses a novel error handling mechanism enabling fault-tolerant multi-DR quantum computing. First, we apply a low-overhead multi-DR error syndrome measurement (ESM) sequence to reduce both the number of gate operations and the error rate. Second, we apply a scalable multi-DR error decoding unit (EDU) architecture to maximize both the decoding speed and accuracy. Our multi-DR error handling SW-HW co-design improves the ESM latency, ESM errors, EDU latency, and EDU accuracy by 3.7 times, 2.4 times, 685 times, and 6.1 · 1010 times, respectively.With our scheme applied to assumed voltage-scaled CMOS and mature ERSFQ technologies, we successfully build a fault-tolerant million qubit-scale quantum computer.

DOI: 10.1145/3620665.3640388

A Journey of a 1,000 Kernels Begins with a Single Step： A Retrospective of Deep Learning on GPUs

作者: Davies, Michael and McDougall, Ian and Anandaraj, Selvaraj and Machchhar, Deep and Jain, Rithik and Sankaralingam, Karthikeyan
关键词: No keywords

Abstract

We are in age of AI, with rapidly changing algorithms and a somewhat synergistic change in hardware. MLPerf is a recent benchmark suite that serves as a way to compare and evaluate hardware. However it has several drawbacks - it is dominated by CNNs and does a poor job of capturing the diversity of AI use cases, and only represents a sliver of production AI use cases. This paper performs a longitudinal study of state-of-art AI applications spanning vision, physical simulation, vision synthesis, language and speech processing, and tabular data processing, across three generations of hardware to understand how the AI revolution has panned out. We call this collection of applications and execution scaffolding the CaSiO suite. The paper reports on data gathered at the framework level, device API level, and hardware and microarchitecture level. The paper provides insights on the hardware-software revolution with pointers to future trends.

DOI: 10.1145/3620665.3640367

A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors

作者: Kuper, Reese and Jeong, Ipoom and Yuan, Yifan and Wang, Ren and Ranganathan, Narayan and Rao, Nikhil and Hu, Jiayu and Kumar, Sanjay and Lantz, Philip and Kim, Nam Sung
关键词: data streaming accelerator (DSA), accelerator, measurement

Abstract

As semiconductor power density is no longer constant with the technology process scaling down, we need different solutions if we are to continue scaling application performance. To this end, modern CPUs are integrating capable data accelerators on the chip, aiming to improve performance and efficiency for a wide range of applications and usages. One such accelerator is the Intel® Data Streaming Accelerator (DSA) introduced since Intel® 4th Generation Xeon® Scalable CPUs (Sapphire Rapids). DSA targets data movement operations in memory that are common sources of overhead in datacenter workloads and infrastructure. In addition, it supports a wider range of operations on streaming data, such as CRC32 calculations, computation of deltas between data buffers, and data integrity field (DIF) operations. This paper aims to introduce the latest features supported by DSA, dive deep into its versatility, and analyze its throughput benefits through a comprehensive evaluation with both microbenchmarks and real use cases. Along with the analysis of its characteristics and the rich software ecosystem of DSA, we summarize several insights and guidelines for the programmer to make the most out of DSA, and use an in-depth case study of DPDK Vhost to demonstrate how these guidelines benefit a real application.

DOI: 10.1145/3620665.3640401

Achieving Near-Zero Read Retry for 3D NAND Flash Memory

作者: Ye, Min and Li, Qiao and Lv, Yina and Zhang, Jie and Ren, Tianyu and Wen, Daniel and Kuo, Tei-Wei and Xue, Chun Jason
关键词: No keywords

Abstract

As the flash-based storage devices age with program/erase (P/E) cycles, they require an increasing number of read retries for error correction, which in turn deteriorates their read performance. The design of read-retry methods is critical to flash read performance. Current flash chips embed pre-defined read retry tables (RRT) for retry, but these tables fail to consider the read granularity and error behaviors. We characterize different types of real flash chips, based on which we further develop models for the correlation among the optimal read offsets of read voltages required for reading each page. By leveraging characterization observations and the models, we propose a methodology to generate a tailored RRT for each flash model. We introduce a dynamic read retry procedure to pick up proper read voltages from the table, followed by a proximity-search method for fine-tuning the read offsets. Experiments on real flash chips show that the proposed methodology can achieve near-zero retries. It reduces the average number of read retries to below 0.003 for data with high retention time at 8K P/E cycles, whereas the state-of-the-art approaches incur over 3 read retries on average once the flash is aged to 3K P/E cycles.

DOI: 10.1145/3620665.3640372

An Encoding Scheme to Enlarge Practical DNA Storage Capacity by Reducing Primer-Payload Collisions

作者: Wei, Yixun and Li, Bingzhe and Du, David H. C.
关键词: DNA storage, DNA encoding scheme, primer-payload collision

Abstract

Deoxyribonucleic Acid (DNA), with its ultra-high storage density and long durability, is a promising long-term archival storage medium and is attracting much attention today. A DNA storage system encodes and stores digital data with synthetic DNA sequences and decodes DNA sequences back to digital data via sequencing. Many encoding schemes have been proposed to enlarge DNA storage capacity by increasing DNA encoding density. However, only increasing encoding density is insufficient because enhancing DNA storage capacity is a multifaceted problem.This paper assumes that random accesses are necessary for practical DNA archival storage. We identify all factors affecting DNA storage capacity under current technologies and systematically investigate the practical DNA storage capacity with several popular encoding schemes. The investigation result shows the collision between primers and DNA payload sequences is a major factor limiting DNA storage capacity. Based on this discovery, we designed a new encoding scheme called Collision Aware Code (CAC) to trade some encoding density for the reduction of primer-payload collisions. Compared with the best result among the five existing encoding schemes, CAC can extricate 120% more primers from collisions and increase the DNA tube capacity from 211.96 GB to 295.11 GB. Besides, we also evaluate CAC’s recoverability from DNA storage errors. The result shows CAC is comparable to those of existing encoding schemes.

DOI: 10.1145/3620665.3640417

Atalanta： A Bit is Worth a “Thousand” Tensor Values

作者: Lascorz, Alberto Delmas and Mahmoud, Mostafa and Zadeh, Ali Hadi and Nikolic, Milos and Ibrahim, Kareem and Giannoula, Christina and Abdelhadi, Ameer and Moshovos, Andreas
关键词: No keywords

Abstract

Atalanta is a lossless, hardware/software co-designed compression technique for the tensors of fixed-point quantized deep neural networks. Atalanta increases effective memory capacity, reduces off-die traffic, and/or helps to achieve the desired performance/energy targets while using smaller off-die memories during inference. Atalanta is architected to deliver nearly identical coding efficiency compared to Arithmetic Coding while avoiding its complexity, overhead, and bandwidth limitations. Indicatively, the Atalanta decoder and encoder units each use less than 50B of internal storage. In hardware, Atalanta is implemented as an assist over any machine learning accelerator transparently compressing/decompressing tensors just before the off-die memory controller. This work shows the performance and energy efficiency of Atalanta when implemented in a 65nm technology node. Atalanta reduces data footprint of weights and activations to 60% and 48% respectively on average over a wide set of 8-bit quantized models and complements a wide range of quantization methods. Integrated with a Tensorcore-based accelerator, Atalanta boosts the speedup and energy efficiency to 1.44\texttimes{

DOI: 10.1145/3620665.3640356

AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference

作者: Park, Jaehyun and Choi, Jaewan and Kyung, Kwanhee and Kim, Michael Jaemin and Kwon, Yongsuk and Kim, Nam Sung and Ahn, Jung Ho
关键词: processing-in-memory, transformer-based generative model, DRAM

Abstract

The Transformer-based generative model (TbGM), comprising summarization (Sum) and generation (Gen) stages, has demonstrated unprecedented generative performance across a wide range of applications. However, it also demands immense amounts of compute and memory resources. Especially, the Gen stages, consisting of the attention and fully-connected (FC) layers, dominate the overall execution time. Meanwhile, we reveal that the conventional system with GPUs used for TbGM inference cannot efficiently execute the attention layer, even with batching, due to various constraints. To address this inefficiency, we first propose AttAcc, a processing-in-memory (PIM) architecture for efficient execution of the attention layer. Subsequently, for the end-to-end acceleration of TbGM inference, we propose a novel heterogeneous system architecture and optimizations that strategically use xPU and PIM together. It leverages the high memory bandwidth of AttAcc for the attention layer and the powerful compute capability of the conventional system for the FC layer. Lastly, we demonstrate that our GPU-PIM system outperforms the conventional system with the same memory capacity, improving performance and energy efficiency of running a 175B TbGM by up to 2.81\texttimes{

DOI: 10.1145/3620665.3640422

Avoiding Instruction-Centric Microarchitectural Timing Channels Via Binary-Code Transformations

作者: Flanders, Michael and Sharma, Reshabh K and Michael, Alexandra E. and Grossman, Dan and Kohlbrenner, David
关键词: No keywords

Abstract

With the end of Moore’s Law-based scaling, novel microarchitectural optimizations are being patented, researched, and implemented at an increasing rate. Previous research has examined recently published patents and papers and demonstrated ways these upcoming optimizations present new security risks via novel side channels. As these side channels are introduced by microarchitectural optimization, they are not generically solvable in source code.In this paper, we build program analysis and transformation tools for automatically mitigating the security risks introduced by future instruction-centric microarchitectural optimizations. We focus on two classes of optimizations that are not yet deployed: silent stores and computation simplification. Silent stores are known to leak secret data being written to memory by dropping in-flight stores that will have no effect. Computation simplification is known to leak operands to arithmetic instructions by shortcutting trivial computations at execution time. This presents problems that classical constant-time techniques cannot handle: register spills, address calculations, and the micro-ops of complex instructions are all potentially leaky. To address these problems, we design, implement, and evaluate a process and tool, cio, for detecting and mitigating these types of side channels in cryptographic code. cio is a backstop, providing verified mitigation for novel microarchitectural side-channels when more specialized and efficient hardware or software tools, such as microcode patches, are not yet available.

DOI: 10.1145/3620665.3640400

BitPacker： Enabling High Arithmetic Efficiency in Fully Homomorphic Encryption Accelerators

作者: Samardzic, Nikola and Sanchez, Daniel
关键词: No keywords

Abstract

Fully Homomorphic Encryption (FHE) enables computing directly on encrypted data. Though FHE is slow on a CPU, recent hardware accelerators compensate most of FHE’s overheads, enabling real-time performance in complex programs like deep neural networks. However, the state-of-the-art FHE scheme, CKKS, is inefficient on accelerators. CKKS represents encrypted data using integers of widely different sizes (typically 30 to 60 bits). This leaves many bits unused in registers and arithmetic datapaths. This overhead is minor in CPUs, but accelerators are dominated by multiplications, so poor utilization causes large area and energy overheads.We present BitPacker, a new implementation of CKKS that keeps encrypted data packed in fixed-size words, enabling near-full arithmetic efficiency in accelerators. BitPacker is the first redesign of an FHE scheme that targets accelerators. On a state-of-the-art accelerator, BitPacker improves performance by gmean 59% and by up to 3\texttimes{

DOI: 10.1145/3620665.3640397

BVAP： Energy and Memory Efficient Automata Processing for Regular Expressions with Bounded Repetitions

作者: Wen, Ziyuan and Kong, Lingkun and Le Glaunec, Alexis and Mamouras, Konstantinos and Yang, Kaiyuan
关键词: action-homogeneous nondeterministic bit vector automata, automata processor, energy efficiency

Abstract

Regular pattern matching is pervasive in applications such as text processing, malware detection, network security, and bioinformatics. Recent studies have demonstrated specialized in-memory automata processors with superior energy and memory efficiencies than existing computing platforms. Yet, they lack efficient support for the construct of bounded repetition that is widely used in regular expressions (regexes). This paper presents BVAP, a software-hardware co-designed in-memory Bit Vector Automata Processor. It is enabled by a novel theoretical model called Action-Homogeneous Non-deterministic Bit Vector Automata (AH-NBVA), its efficient hardware implementation, and a compiler that translates regexes into hardware configurations. BVAP is evaluated with a cycle-accurate simulator in a 28nm CMOS process, achieving 67-95% higher energy efficiency and 42-68% lower area, compared to state-of-the-art automata processors (CA, eAP, and CAMA), across a set of real-world benchmarks.

DOI: 10.1145/3620665.3640412

Carat： Unlocking Value-Level Parallelism for Multiplier-Free GEMMs

作者: Pan, Zhewen and San Miguel, Joshua and Wu, Di
关键词: value-level parallelism, value reuse, temporal computing, low-precision, batch processing, multiplier-free

Abstract

In recent years, hardware architectures optimized for general matrix multiplication (GEMM) have been well studied to deliver better performance and efficiency for deep neural networks. With trends towards batched, low-precision data, e.g., FP8 format in this work, we observe that there is growing untapped potential for value reuse. We propose a novel computing paradigm, value-level parallelism, whereby unique products are computed only once, and different inputs subscribe to (select) their products via temporal coding. Our architecture, Carat, employs value-level parallelism and transforms multiplication into accumulation, performing GEMMs with efficient multiplier-free hardware. Experiments show that, on average, Carat improves iso-area throughput and energy efficiency by 1.02\texttimes{

DOI: 10.1145/3620665.3640364

CIM-MLC： A Multi-level Compilation Stack for Computing-In-Memory Accelerators

作者: Qu, Songyun and Zhao, Shixin and Li, Bing and He, Yintao and Cai, Xuyi and Zhang, Lei and Wang, Ying
关键词: No keywords

Abstract

In recent years, various computing-in-memory (CIM) processors have been presented, showing superior performance over traditional architectures. To unleash the potential of various CIM architectures, such as device precision, crossbar size, and crossbar number, it is necessary to develop compilation tools that are fully aware of the CIM architectural details and implementation diversity. However, due to the lack of architectural support in current popular open-source compiling stacks such as TVM, existing CIM designs either manually deploy networks or build their own compilers, which is time-consuming and labor-intensive. Although some works expose the specific CIM device programming interfaces to compilers, they are often bound to a fixed CIM architecture, lacking the flexibility to support the CIM architectures with different computing granularity. On the other hand, existing compilation works usually consider the scheduling of limited operation types (such as crossbar-bound matrix-vector multiplication). Unlike conventional processors, CIM accelerators are featured by their diverse architecture, circuit, and device, which cannot be simply abstracted by a single level if we seek to fully explore the advantages brought by CIM.Therefore, we propose CIM-MLC, a universal multi-level compilation framework for general CIM architectures. In this work, we first establish a general hardware abstraction for CIM architectures and computing modes to represent various CIM accelerators. Based on the proposed abstraction, CIM-MLC can compile tasks onto a wide range of CIM accelerators having different devices, architectures, and programming interfaces. More importantly, compared with existing compilation work, CIM-MLC can explore the mapping and scheduling strategies across multiple architectural tiers in CIM, which form a tractable yet effective design space, to achieve better scheduling and instruction generation results. Experimental results show that CIM-MLC achieves 3.2\texttimes{

DOI: 10.1145/3620665.3640359

CMC： Video Transformer Acceleration via CODEC Assisted Matrix Condensing

作者: Song, Zhuoran and Qi, Chunyu and Liu, Fangxin and Jing, Naifeng and Liang, Xiaoyao
关键词: video transformer, CODEC, deep neural network accelerator

Abstract

Video Transformers (VidTs) have reached the forefront of accuracy in various video understanding tasks. Despite their remarkable achievements, the processing requirements for a large number of video frames still present a significant performance bottleneck, impeding their deployment to resource-constrained platforms. While accelerators meticulously designed for Vision Transformers (ViTs) have emerged, they may not be the optimal solution for VidTs, primarily due to two reasons. These accelerators tend to overlook the inherent temporal redundancy that characterizes VidTs, limiting their chance for further performance enhancement. Moreover, incorporating a sparse attention prediction module within these accelerators incurs a considerable overhead.To this end, we move our attention to the video CODEC, which is essential for video preprocessing and can be utilized to detect the temporal and spatial similarity in raw video frames, showcasing the potential of exploring the temporal and spatial redundancy in VidTs while avoiding significant costs on prediction. This paper proposes CMC, the first CODEC assisted algorithm-accelerator co-design framework (CMC) for VidT acceleration. Specifically, from the algorithm aspects, we offer CODEC-friendly inter- and intra-matrix prediction algorithms to identify the informative data on-the-fly. We then design a recovery algorithm so that we can safely skip the computation on non-informative data in the temporal and spatial domains and recover their results by copying the informative data’s features to reserve accuracy. From the hardware aspects, we propose to augment the video CODEC to make it efficiently implement inter- and intra-matrix prediction algorithms with negligible costs. Additionally, we propose a specialized CMC architecture that includes a recovery engine with fine-grained buffer management to translate the computational saving in the algorithm to real speedup. Experiments show that CMC can achieve 2.1\texttimes{

DOI: 10.1145/3620665.3640393

Codesign of quantum error-correcting codes and modular chiplets in the presence of defects

作者: Lin, Sophia Fuhui and Viszlai, Joshua and Smith, Kaitlin N. and Ravi, Gokul Subramanian and Yuan, Charles and Chong, Frederic T. and Brown, Benjamin J.
关键词: No keywords

Abstract

Fabrication errors pose a significant challenge in scaling up solid-state quantum devices to the sizes required for fault-tolerant (FT) quantum applications. To mitigate the resource overhead caused by fabrication errors, we combine two approaches: (1) leveraging the flexibility of a modular architecture, (2) adapting the procedure of quantum error correction (QEC) to account for fabrication defects.We simulate the surface code adapted to defective qubit arrays to find metrics that characterize how defects affect fidelity. We then use our simulations to determine the impact of defects on the resource overhead of realizing a fault-tolerant quantum computer on a chiplet-based modular architecture. Our QEC simulation adapts the syndrome readout circuit for the surface code to account for an arbitrary distribution of defects. Our simulations show that our strategy for dealing with fabrication defects demonstrates an exponential suppression of logical failure, where error rates of non-defective physical qubits are ~ 0.1% for a circuit-based noise model. This is a typical regime on which we imagine running the defect-free surface code. We use our numerical results to establish post-selection criteria for assembling a device with defective chiplets. Using our criteria, we then evaluate the resource overhead in terms of the average number of physical qubits fabricated for a logical qubit to obtain a target logical error rate. We find that an optimal choice of chiplet size, based on the defect rate and target performance, is essential to limiting any additional error correction overhead due to defects. When the optimal chiplet size is chosen, at a defect rate of 1% the resource overhead can be reduced to below 3X and 6X respectively for the two defect models we use, for a wide range of target performance. Without tolerance to defects, the overhead grows exponentially as we increase the number of physical qubits in each logical qubit to achieve better performance, and also grows faster with an increase in the defect rate. When the target code distance is 27, the resource overhead of the defect-intolerant, modular approach is 45X and more than 105X higher than the super-stabilizer approach, respectively, at a defect rate of 0.1% and 0.3%. We also determine cutoff fidelity values that help identify whether a qubit should be disabled or kept as part of the QEC code.

DOI: 10.1145/3620665.3640362

Compiling Loop-Based Nested Parallelism for Irregular Workloads

作者: Su, Yian and Rainey, Mike and Wanninger, Nick and Dhiantravan, Nadharm and Liang, Jasper and Acar, Umut A. and Dinda, Peter and Campanoni, Simone
关键词: No keywords

Abstract

Modern programming languages offer special syntax and semantics for logical fork-join parallelism in the form of parallel loops, allowing them to be nested, e.g., a parallel loop within another parallel loop. This expressiveness comes at a price, however: on modern multicore systems, realizing logical parallelism results in overheads due to the creation and management of parallel tasks, which can wipe out the benefits of parallelism. Today, we expect application programmers to cope with it by manually tuning and optimizing their code. Such tuning requires programmers to reason about architectural factors hidden behind layers of software abstractions, such as task scheduling and load balancing. Managing these factors is particularly challenging when workloads are irregular because their performance is input-sensitive. This paper presents HBC, the first compiler that translates C/C++ programs with high-level, fork-join constructs (e.g., OpenMP) to binaries capable of automatically controlling the cost of parallelism and dealing with irregular, input-sensitive workloads. The basis of our approach is Heartbeat Scheduling, a recent proposal for automatic granularity control, which is backed by formal guarantees on performance. HBC binaries outperform OpenMP binaries for workloads for which even entirely manual solutions struggle to find the right balance between parallelism and its costs.

DOI: 10.1145/3620665.3640405

Cornucopia Reloaded： Load Barriers for CHERI Heap Temporal Safety

作者: Filardo, Nathaniel Wesley and Gutstein, Brett F. and Woodruff, Jonathan and Clarke, Jessica and Rugg, Peter and Davis, Brooks and Johnston, Mark and Norton, Robert and Chisnall, David and Moore, Simon W. and Neumann, Peter G. and Watson, Robert N. M.
关键词: capability revocation, CHERI, temporal safety, use after free

Abstract

Violations of temporal memory safety (“use after free”, “UAF”) continue to pose a significant threat to software security. The CHERI capability architecture has shown promise as a technology for C and C++ language reference integrity and spatial memory safety. Building atop CHERI, prior works - CHERIvoke and Cornucopia - have explored adding heap temporal safety. The most pressing limitation of Cornucopia was its impractical “stop-the-world” pause times.We present Cornucopia Reloaded, a re-designed drop-in replacement implementation of CHERI temporal safety, using a novel architectural feature - a per-page capability load barrier, added in Arm’s Morello prototype CPU and CHERI-RISC-V - to nearly eliminate application pauses. We analyze the performance of Reloaded as well as Cornucopia and CHERIvoke on Morello, using the CHERI-compatible SPEC CPU2006 INT workloads to assess its impact on batch workloads and using pgbench and gRPC QPS as surrogate interactive workloads. Under Reloaded, applications no longer experience significant revocation-induced stop-the-world periods, without additional wall- or CPU-time cost over Cornucopia and with median 87% of Cornucopia’s DRAM traffic overheads across SPEC CPU2006 and < 50% for pgbench.

DOI: 10.1145/3620665.3640416

Design of Novel Analog Compute Paradigms with Ark

作者: Wang, Yu-Neng and Cowan, Glenn and R"{u
关键词: analog computing, unconventional computing paradigms, hardware-algorithm co-design

Abstract

Previous efforts on reconfigurable analog circuits mostly focused on specialized analog circuits, produced through careful co-design, or on highly reconfigurable, but relatively resource inefficient, accelerators that implement analog compute paradigms. This work deals with an intermediate point in the design space: specialized reconfigurable circuits for analog compute paradigms. This class of circuits requires new methodologies for performing co-design, as prior techniques are typically highly specialized to conventional circuit classes (e.g., filters, ADCs). In this context, we present Ark, a programming language for describing analog compute paradigms. Ark enables progressive incorporation of analog behaviors into computations, and deploys a validator and dynamical system compiler for verifying and simulating computations. We use Ark to codify the design space for three different exemplary circuit design problems, and demonstrate that Ark helps exploring design trade-offs and evaluating the impact of non-idealities to the computation.

DOI: 10.1145/3620665.3640419

Direct Memory Translation for Virtualized Clouds

作者: Zhang, Jiyuan and Jia, Weiwei and Chai, Siyuan and Liu, Peizhe and Kim, Jongyul and Xu, Tianyin
关键词: cloud computing, virtualization, virtual memory, address translation

Abstract

Virtual memory translation has become a key performance bottleneck of memory-intensive workloads in virtualized cloud environments. On the x86 architecture, a nested translation needs to sequentially fetch up to 24 page table entries (PTEs). This paper presents Direct Memory Translation (DMT), a hardware-software extension for x86-based virtual memory that minimizes translation overhead while maintaining backward compatibility with x86. In DMT, the OS manages last-level PTEs in a contiguous physical memory region, termed Translation Entry Areas (TEAs). DMT establishes a direct mapping from each virtual page in a Virtual Memory Area (VMA) to the corresponding PTE in a TEA. Since processes manage memory with a handful of major VMAs, the mapping can be maintained per VMA and effectively stored in a few dedicated registers. DMT further optimizes virtualized memory translation via guest-host cooperation by directly allocating guest TEAs in physical memory, bypassing intermediate virtualization layers. DMT is inherently scalable—it takes one, two, and three memory references in native, virtualized, and nested virtualized setups. Its scalability enables hardware-assisted translation for nested virtualization. Our evaluation shows that DMT significantly speeds up page walks by an average of 1.58x (1.65x with THP) in a virtualized setup, resulting in 1.20x (1.14x with THP) speedup of application execution on average.

DOI: 10.1145/3620665.3640358

作者: Luo, Zhihong and Son, Sam and Bali, Dev and Amaro, Emmanuel and Ousterhout, Amy and Ratnasamy, Sylvia and Shenker, Scott
关键词: blind scheduling, microsecond scale, tail latency

Abstract

A longstanding performance challenge in datacenter-based applications is how to efficiently handle incoming client requests that spawn many very short (μs scale) jobs that must be handled with high throughput and low tail latency. When no assumptions are made about the duration of individual jobs, or even about the distribution of their durations, this requires blind scheduling with frequent and efficient preemption, which is not scalably supported for μs-level tasks. We present Tiny Quanta (TQ), a system that enables efficient blind scheduling of μs-level workloads. TQ performs fine-grained preemptive scheduling and does so with high performance via a novel combination of two mechanisms: forced multitasking and two-level scheduling. Evaluations with a wide variety of μs-level workloads show that TQ achieves low tail latency while sustaining 1.2x to 6.8x the throughput of prior blind scheduling systems.

DOI: 10.1145/3620665.3640381

Eliminating Storage Management Overhead of Deduplication over SSD Arrays Through a Hardware/Software Co-Design

作者: Wen, Yuhong and Zhao, Xiaogang and Zhou, You and Zhang, Tong and Yang, Shangjun and Xie, Changsheng and Wu, Fei
关键词: SSD array, hardware/software co-design, deduplication, storage systems

Abstract

This paper presents a hardware/software co-design solution to efficiently implement block-layer deduplication over SSD arrays. By introducing complex and varying dependency over the entire storage space, deduplication is infamously subject to high storage management overheads in terms of CPU/memory resource usage and I/O performance degradation. To fundamentally address this problem, one intuitive idea is to offload deduplication storage management from host into SSDs, which is motivated by the redundant dual address mapping in host-side deduplication layer and intra-SSD flash translation layer (FTL). The practical implementation of this idea is nevertheless challenging because of the array-wide deduplication vs. per-SSD FTL management scope mismatch. Aiming to tackle this challenge, this paper presents a solution, called ARM-Dedup, that makes SSD FTL deduplication-oriented and array-aware and accordingly re-architects deduplication software to achieve lightweight and high-performance deduplication over an SSD array. We implemented an ARM-Dedup prototype based on the Linux Dmdedup engine and mdraid software RAID over FEMU SSD emulators. Experimental results show that ARM-Dedup has good scalability and can improve system performance significantly, such as by up to 272% and 127% higher IOPS in synthetic and real-world workloads, respectively.

DOI: 10.1145/3620665.3640368

Elivagar： Efficient Quantum Circuit Search for Classification

作者: Anagolum, Sashwat and Alavisamani, Narges and Das, Poulami and Qureshi, Moinuddin and Shi, Yunong
关键词: No keywords

Abstract

Designing performant and noise-robust circuits for Quantum Machine Learning (QML) is challenging — the design space scales exponentially with circuit size, and there are few well-supported guiding principles for QML circuit design. Although recent Quantum Circuit Search (QCS) methods attempt to search for such circuits, they directly adopt designs from classical Neural Architecture Search (NAS) that are misaligned with the unique constraints of quantum hardware, resulting in high search overheads and severe performance bottlenecks.We present '{E

DOI: 10.1145/3620665.3640354

Energy Efficient Convolutions with Temporal Arithmetic

作者: Gretsch, Rhys and Song, Peiyang and Madhavan, Advait and Lau, Jeremy and Sherwood, Timothy
关键词: No keywords

Abstract

Convolution is an important operation at the heart of many applications, including image processing, object detection, and neural networks. While data movement and coordination operations continue to be important areas for optimization in general-purpose architectures, for computation fused with sensor operation, the underlying multiply-accumulate (MAC) operations dominate power consumption. Non-traditional data encoding has been shown to reduce the energy consumption of this arithmetic, with options including everything from reduced-precision floating point to fully stochastic operation, but all of these approaches start with the assumption that a complete analog-to-digital conversion (ADC) has already been done for each pixel. While analog-to-time converters have been shown to use less energy, arithmetically manipulating temporally encoded signals beyond simple min, max, and delay operations has not previously been possible, meaning operations such as convolution have been out of reach. In this paper we show that arithmetic manipulation of temporally encoded signals is possible, practical to implement, and extremely energy efficient.The core of this new approach is a negative log transformation of the traditional numeric space into a ‘delay space’ where scaling (multiplication) becomes delay (addition in time). The challenge lies in dealing with addition and subtraction. We show these operations can also be done directly in this negative log delay space, that the associative and commutative properties still apply to the transformed operations, and that accurate approximations can be built efficiently in hardware using delay elements and basic CMOS logic elements. Furthermore, we show that these operations can be chained together in space or operated recurrently in time. This approach fits naturally into the staged ADC readout inherent to most modern cameras. To evaluate our approach, we develop a software system that automatically transforms traditional convolutions into delay space architectures. The resulting system is used to analyze and balance error from both a new temporal equivalent of quantization and delay element noise, resulting in designs that improve the energy per pixel of each convolution frame by more than 2\texttimes{

DOI: 10.1145/3620665.3640395

ExeGPT： Constraint-Aware Resource Scheduling for LLM Inference

作者: Oh, Hyungjun and Kim, Kihong and Kim, Jaemin and Kim, Sungkyun and Lee, Junyeol and Chang, Du-seong and Seo, Jiwon
关键词: LLM inference, scheduling optimization

Abstract

This paper presents ExeGPT, a distributed system designed for constraint-aware LLM inference. ExeGPT finds and runs with an optimal execution schedule to maximize inference throughput while satisfying a given latency constraint. By leveraging the distribution of input and output sequences, it effectively allocates resources and determines optimal execution configurations, including batch sizes and partial tensor parallelism. We also introduce two scheduling strategies based on Round-Robin Allocation and Workload-Aware Allocation policies, suitable for different NLP workloads.We evaluate ExeGPT on six LLM instances of T5, OPT, and GPT-3 and five NLP tasks, each with four distinct latency constraints. Compared to FasterTransformer, ExeGPT achieves up to 15.2\texttimes{

DOI: 10.1145/3620665.3640383

FaaSGraph： Enabling Scalable, Efficient, and Cost-Effective Graph Processing with Serverless Computing

作者: Liu, Yushi and Sun, Shixuan and Li, Zijun and Chen, Quan and Gao, Sen and He, Bingsheng and Li, Chao and Guo, Minyi
关键词: serverless computing, graph processing, resource sharing, cold start

Abstract

Graph processing is widely used in cloud services; however, current frameworks face challenges in efficiency and cost-effectiveness when deployed under the Infrastructure-as-a-Service model due to its limited elasticity. In this paper, we present FaaSGraph, a serverless-native graph computing scheme that enables efficient and economical graph processing through the co-design of graph processing frameworks and serverless computing systems. Specifically, we design a data-centric serverless execution model to efficiently power heavy computing tasks. Furthermore, we carefully design a graph processing paradigm to seamlessly cooperate with the data-centric model. Our experiments show that FaaS-Graph improves end-to-end performance by up to 8.3X and reduces memory usage by up to 52.4% compared to state-of-the-art IaaS-based methods. Moreover, FaaSGraph delivers steady 99%-ile performance in highly fluctuated workloads and reduces monetary cost by 85.7%.

DOI: 10.1145/3620665.3640361

FOCAL： A First-Order Carbon Model to Assess Processor Sustainability

作者: Eeckhout, Lieven
关键词: computer architecture, sustainability, modeling

Abstract

Sustainability in general and global warming in particular are grand societal challenges. Computer systems demand substantial materials and energy resources throughout their entire lifetime. A key question is how computer engineers and scientists can reduce the environmental impact of computing. To overcome the inherent data uncertainty, this paper proposes FOCAL, a parameterized first-order carbon model to assess processor sustainability using first principles. FOCAL’s normalized carbon footprint (NCF) metric guides computer architects to holistically optimize chip area, energy and power consumption to reduce a processor’s environmental footprint. We use FOCAL to analyze and categorize a broad set of archetypal processor mechanisms into strongly, weakly or less sustainable design choices, providing insight and intuition into how to reduce a processor’s environmental footprint with implications to both hardware and software. A case study illustrates a pathway for designing strongly sustainable multicore processors delivering high performance while at the same time reducing their environmental footprint.

DOI: 10.1145/3620665.3640415

FPGA Technology Mapping Using Sketch-Guided Program Synthesis

作者: Smith, Gus Henry and Kushigian, Benjamin and Canumalla, Vishal and Cheung, Andrew and Lyubomirsky, Steven and Porncharoenwase, Sorawee and Just, Ren'{e
关键词: No keywords

Abstract

FPGA technology mapping is the process of implementing a hardware design expressed in high-level HDL (hardware design language) code using the low-level, architecture-specific primitives of the target FPGA. As FPGAs become increasingly heterogeneous, achieving high performance requires hardware synthesis tools that better support mapping to complex, highly configurable primitives like digital signal processors (DSPs). Current tools support DSP mapping via handwritten special-case mapping rules, which are laborious to write, error-prone, and often overlook mapping opportunities. We introduce Lakeroad, a principled approach to technology mapping via sketch-guided program synthesis. Lakeroad leverages two techniques—architecture-independent sketch templates and semantics extraction from HDL—to provide extensible technology mapping with stronger correctness guarantees and higher coverage of mapping opportunities than state-of-the-art tools. Across representative microbenchmarks, Lakeroad produces 2–3.5\texttimes{

DOI: 10.1145/3620665.3640387

GIANTSAN： Efficient Memory Sanitization with Segment Folding

作者: Ling, Hao and Huang, Heqing and Wang, Chengpeng and Cai, Yuandao and Zhang, Charles
关键词: No keywords

Abstract

Memory safety sanitizers, the sharp weapon for detecting invalid memory operations during execution, employ runtime metadata to model the memory and help find memory errors hidden in the programs. However, location-based methods, the most widely deployed memory sanitization methods thanks to their high compatibility, face the low protection density issue: the number of bytes safeguarded by one metadata is limited. As a result, numerous memory accesses require loading excessive metadata, leading to a high runtime overhead.To address this issue, we propose a new shadow encoding with segment folding to increase the protection density. Specifically, we characterize neighboring bytes with identical metadata by building novel summaries, called folded segments, on those bytes to reduce unnecessary metadata loadings. The new encoding uses less metadata to safeguard large memory regions, speeding up memory sanitization.We implement our designed technique as GiantSan. Our evaluation using the SPEC CPU 2017 benchmark shows that GiantSan outperforms the state-of-the-art methods with 59.10% and 38.52% less runtime overhead than ASan and ASan-, respectively. Moreover, under the same redzone setting, GiantSan detects 463 fewer false negative cases than ASan and ASan- in testing the real-world project PHP.

DOI: 10.1145/3620665.3640391

GMLake： Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

作者: Guo, Cong and Zhang, Rui and Xu, Jiale and Leng, Jingwen and Liu, Zihan and Huang, Ziyu and Guo, Minyi and Wu, Hao and Zhao, Shouren and Zhao, Junping and Zhang, Ke
关键词: memory defragmentation, GPU, deep learning, virtual memory stitching

Abstract

Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the prohibitively large overhead (e.g., 10\texttimes{

DOI: 10.1145/3620665.3640423

Grafu： Unleashing the Full Potential of Future Value Computation for Out-of-core Synchronous Graph Processing

作者: Yang, Tsun-Yu and England, Cale and Li, Yi and Li, Bingzhe and Yang, Ming-Chang
关键词: graph systems, out-of-core, future value computation

Abstract

As graphs exponentially grow recently, out-of-core graph systems have been invented to process large-scale graphs by keeping massive data in storage. Among them, many systems process the graphs iteration-by-iteration and provide synchronous semantics that allows easy programmability by forcing the computation dependency of vertex values between iterations. On the other hand, although future value computation is an effective IO optimization for out-of-core graph systems by computing vertex values of future iterations in advance, it is challenging to take full advantage of future value computation while guaranteeing iteration-based dependency. In fact, based on our investigation, even state-of-the-art work along this direction has a wide gap from optimality in IO reduction and further requires substantial overhead in computation as well as extra memory consumption.This paper presents Grafu, an out-of-core graph system unleashing the full potential of future value computation while providing synchronous semantics. For this goal, three main designs are proposed to optimize future value computation from different perspectives. First, we propose a new elevator execution order to significantly increase the number of future-computed vertices for considerable IO reduction. Second, unlike existing work that uses high-cost barriers to ensure dependency under future value computation, we present conditional barrier to alleviate this computational overhead by adaptively removing the barriers while guaranteeing dependency. Third, we introduce a new graph reordering technique, greedy coloring, to reduce the extra memory consumption required for future value computation. Our evaluation of various billion-scale graphs reveals that Grafu significantly outperforms other state-of-the-art systems.

DOI: 10.1145/3620665.3640409

Greybox Fuzzing for Concurrency Testing

作者: Wolff, Dylan and Shi, Zheng and Duck, Gregory J. and Mathur, Umang and Roychoudhury, Abhik
关键词: No keywords

Abstract

Uncovering bugs in concurrent programs is a challenging problem owing to the exponentially large search space of thread interleavings. Past approaches towards concurrency testing are either optimistic — relying on random sampling of these interleavings — or pessimistic — relying on systematic exploration of a reduced (bounded) search space. In this work, we suggest a fresh, pragmatic solution neither focused only on formal, systematic testing, nor solely on unguided sampling or stress-testing approaches. We employ a biased random search which guides exploration towards neighborhoods which will likely expose new behavior. As such it is thematically similar to greybox fuzz testing, which has proven to be an effective technique for finding bugs in sequential programs. To identify new behaviors in the domain of interleavings, we prune and navigate the search space using the “reads-from” relation. Our approach is significantly more efficient at finding bugs per schedule exercised than other state-of-the art concurrency testing tools and approaches. Experiments on widely used concurrency datasets also show that our greybox fuzzing inspired approach gives a strict improvement over a randomized baseline scheduling algorithm in practice via a more uniform exploration of the schedule space. We make our concurrency testing infrastructure “Reads-From Fuzzer” (RFF) available for experimentation and usage by the wider community to aid future research.

DOI: 10.1145/3620665.3640389

Societal infrastructure in the age of Artificial General Intelligence

作者: Vahdat, Amin
关键词: No keywords

Abstract

Today, we are at an inflection point in computing where emerging Generative AI services are placing unprecedented demand for compute while the existing architectural patterns for improving efficiency have stalled. In this talk, we will discuss the likely needs of the next generation of computing infrastructure and use recent examples at Google from networks to accelerators to servers to illustrate the challenges and opportunities ahead. Taken together, we chart a course where computing must be increasingly specialized and co-optimized with algorithms and software, all while fundamentally focusing on security and sustainability.

DOI: 10.1145/3620666.3655589

Challenges and Opportunities for Systems Using CXL Memory

作者: Witchel, Emmett
关键词: No keywords

Abstract

We are at the start of the technology cycle for compute express link (CXL) memory, which is a significant opportunity and challenge for architecture, operating systems, and programming languages. The 3.0 CXL specification allows multiple, physically attached hosts to dynamically share memory. We call such a configuration a CXL pod. Pods provide an intermediate hardware configuration between a network of machines, each with their private memory, and a shared memory multiprocessor with a unified memory, accessible to all machines.This talk will discuss system support for single-node applications to attain scalable performance and high availability across a CXL pod as well as pointing out likely technical challenges for future systems. Along with the technical content, the talk categorizes computer systems research using the model of storytelling with a beginning, a middle and an end. We also examine the fascination popular culture has with personal aha moments and weigh their importance for a group working on an impending submission deadline.

DOI: 10.1145/3620666.3655590

Harnessing the Power of Specialization for Sustainable Computing

作者: Eilam, Tamar
关键词: No keywords

Abstract

Computing is critical to address some of the most pressing needs of humanity today, including climate change mitigation and adaptation. However, it is also the source of a significant and steadily increasing carbon toll, attributed in part to the exponential growth in energy-demanding workloads, such as artificial intelligence (AI). Due to the demise of Dennard scaling, we can no longer count on exponentially-improve energy efficiency of general-purpose processors. Therefore, today’s operational efficiency gains rely on specialized hardware.In this talk I will discuss the promise and the perils of harnessing specialization on the road to sustainable computing.

DOI: 10.1145/3620666.3655591

AWS Trainium： The Journey for Designing and Optimization Full Stack ML Hardware

作者: Bshara, Nafea
关键词: No keywords

Abstract

Machine learning accelerators present a unique set of design challenges across chip architecture, instruction set, server design, compiler, and both inter- and intra-chip connectivity. With AWS Trainium, we’ve utilized AWS’s end-to-end ownership from chip to server, network, compilers, and runtime tools to collaboratively design and optimize across all layers, emphasizing simplicity and ease of use. This talk will illustrate the design principles, tradeoffs, and lessons learned during the development of three generations of AWS ML products, from conceptualization to placing systems in the hands of AWS customers.

DOI: 10.1145/3620666.3655592

8-bit Transformer Inference and Fine-tuning for Edge Accelerators

作者: Yu, Jeffrey and Prabhu, Kartik and Urman, Yonatan and Radway, Robert M. and Han, Eric and Raina, Priyanka
关键词: No keywords

Abstract

Transformer models achieve state-of-the-art accuracy on natural language processing (NLP) and vision tasks, but demand significant computation and memory resources, which makes it difficult to perform inference and training (fine-tuning) on edge accelerators. Quantization to lower precision data types is a promising way to reduce computation and memory resources. Prior work has employed 8-bit integer (int8) quantization for Transformer inference, but int8 lacks the precision and range required for training. 8-bit floating-point (FP8) quantization has been used for Transformer training, but prior work only quantizes the inputs to matrix multiplications and leaves the rest of the operations in high precision.This work conducts an in-depth analysis of Transformer inference and fine-tuning at the edge using two 8-bit floating-point data types: FP8 and 8-bit posit (Posit8). Unlike FP8, posit has variable length exponent and fraction fields, leading to higher precision for values around 1, making it well suited for storing Transformer weights and activations. As opposed to prior work, we evaluate the impact of quantizing all operations in both the forward and backward passes, going beyond just matrix multiplications. Specifically, our work makes the following contributions: (1) We perform Transformer inference in FP8 and Posit8, achieving less than 1% accuracy loss compared to BFloat16 through operation fusion, without the need for scaling factors. (2) We perform Transformer fine-tuning in 8 bits by adapting low-rank adaptation (LoRA) to Posit8 and FP8, enabling 8-bit GEMM operations with increased multiply-accumulate efficiency and reduced memory accesses. (3) We design an area- and power-efficient posit softmax, which employs bitwise operations to approximate the exponential and reciprocal functions. The resulting vector unit in the Posit8 accelerator, that performs both softmax computation and other element-wise operations in Transformers, is on average 33% smaller and consumes 35% less power than the vector unit in the FP8 accelerator, while maintaining the same level of accuracy. Our work demonstrates that both Posit8 and FP8 can achieve inference and fine-tuning accuracy comparable to BFloat16, while reducing accelerator’s area by 30% and 34%, and power consumption by 26% and 32%, respectively.

DOI: 10.1145/3620666.3651368

A Midsummer Night’s Tree： Efficient and High Performance Secure SCM

作者: Thomas, Samuel and Workneh, Kidus and McCarty, Jac and Izraelevitz, Joseph and Lehman, Tamara and Bahar, R. Iris
关键词: No keywords

Abstract

Secure memory is a highly desirable property to prevent memory corruption-based attacks. The emergence of nonvolatile, storage class memory (SCM) devices presents new challenges for secure memory. Metadata for integrity verification, organized in a Bonsai Merkle Tree (BMT), is cached on-chip in volatile caches, and may be lost on a power failure. As a consequence, care is required to ensure that metadata updates are always propagated into SCM. To optimize metadata updates, state-of-the-art approaches propose lazy update crash consistent metadata schemes. However, few consider the implications of their optimizations on on-chip area, which leads to inefficient utilization of scarce on-chip space. In this paper, we propose A Midsummer Night’s Tree (AMNT), a novel “tree within a tree” approach to provide crash consistent integrity with low run-time overhead while limiting on-chip area for security metadata. Our approach offloads the potential hardware complexity of our technique to software to keep area overheads low. Our proposed mechanism results in significant improvements (a 41% reduction in execution overhead on average versus the state-of-the-art) for in-memory storage applications while significantly reducing the required on-chip area to implement our protocol.

DOI: 10.1145/3620666.3651354

A shared compilation stack for distributed-memory parallelism in stencil DSLs

作者: Bisbas, George and Lydike, Anton and Bauer, Emilien and Brown, Nick and Fehr, Mathieu and Mitchell, Lawrence and Rodriguez-Canal, Gabriel and Jamieson, Maurice and Kelly, Paul H. J. and Steuwer, Michel and Grosser, Tobias
关键词: message passing, MPI, MLIR, SSA, domain-specific languages, intermediate representations, stencil computations

Abstract

Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target current- and next-generation supercomputers. The convenience and performance of DSLs come with significant development and maintenance costs. The siloed design of DSL compilers and the resulting inability to benefit from shared infrastructure cause uncertainties around longevity and the adoption of DSLs at scale. By tailoring the broadly-adopted MLIR compiler framework to HPC, we bring the same synergies that the machine learning community already exploits across their DSLs (e.g. Tensorflow, PyTorch) to the finite-difference stencil HPC community. We introduce new HPC-specific abstractions for message passing targeting distributed stencil computations. We demonstrate the sharing of common components across three distinct HPC stencil-DSL compilers: Devito, PSyclone, and the Open Earth Compiler, showing that our framework generates high-performance executables based upon a shared compiler ecosystem.

DOI: 10.1145/3620666.3651344

Accelerating Multi-Scalar Multiplication for Efficient Zero Knowledge Proofs with Multi-GPU Systems

作者: Ji, Zhuoran and Zhang, Zhiyuan and Xu, Jiming and Ju, Lei
关键词: zero knowledge proof, multi-scalar multiplication, multi-GPU systems, pippenger’s algorithm

Abstract

Zero-knowledge proof is a cryptographic primitive that allows for the validation of statements without disclosing any sensitive information, foundational in applications like verifiable outsourcing and digital currency. However, the extensive proof generation time limits its widespread adoption. Even with GPU acceleration, proof generation can still take minutes, with Multi-Scalar Multiplication (MSM) accounting for about 78.2% of the workload. To address this, we present DistMSM, a novel MSM algorithm tailored for distributed multi-GPU systems. At the algorithmic level, DistMSM adapts Pippenger’s algorithm for multi-GPU setups, effectively identifying and addressing bottlenecks that emerge during scaling. At the GPU kernel level, DistMSM introduces an elliptic curve arithmetic kernel tailored for contemporary GPU architectures. It optimizes register pressure with two innovative techniques and leverages tensor cores for specific big integer multiplications. Compared to state-of-the-art MSM implementations, DistMSM offers an average 6.39\texttimes{

DOI: 10.1145/3620666.3651364

ACES： Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache Optimizations

作者: Lu, Xiaoyang and Long, Boyu and Chen, Xiaoming and Han, Yinhe and Sun, Xian-He
关键词: No keywords

Abstract

Sparse matrix-matrix multiplication (SpMM) is a critical computational kernel in numerous scientific and machine learning applications. SpMM involves massive irregular memory accesses and poses great challenges to conventional cache-based computer architectures. Recently dedicated SpMM accelerators have been proposed to enhance SpMM performance. However, current SpMM accelerators still face challenges in adapting to varied sparse patterns, fully exploiting inherent parallelism, and optimizing cache performance. To address these issues, we introduce ACES, a novel SpMM accelerator in this study. First, ACES features an adaptive execution flow that dynamically adjusts to diverse sparse patterns. The adaptive execution flow balances parallel computing efficiency and data reuse. Second, ACES incorporates locality-concurrency co-optimizations within the global cache. ACES utilizes a concurrency-aware cache management policy, which considers data locality and concurrency for optimal replacement decisions. Additionally, the integration of a non-blocking buffer with the global cache enhances concurrency and reduces computational stalls. Third, the hardware architecture of ACES is designed to integrate all innovations. The architecture ensures efficient support across the adaptive execution flow, advanced cache optimizations, and fine-grained parallel processing. Our performance evaluation demonstrates that ACES significantly outperforms existing solutions, providing a 2.1\texttimes{

DOI: 10.1145/3620666.3651381

AdaPipe： Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning

作者: Sun, Zhenbo and Cao, Huanqi and Wang, Yuanwei and Feng, Guanyu and Chen, Shengqi and Wang, Haojie and Chen, Wenguang
关键词: No keywords

Abstract

Large language models (LLMs) have demonstrated powerful capabilities, requiring huge memory with their increasing sizes and sequence lengths, thus demanding larger parallel systems. The broadly adopted pipeline parallelism introduces even heavier and unbalanced memory consumption. Recomputation is a widely employed technique to mitigate the problem but introduces extra computation overhead.This paper proposes AdaPipe, which aims to find the optimized recomputation and pipeline stage partitioning strategy. AdaPipe employs adaptive recomputation to maximize memory utilization and reduce the computation cost of each pipeline stage. A flexible stage partitioning algorithm is also adopted to balance the computation between different stages. We evaluate AdaPipe by training two representative models, GPT-3 (175B) and Llama 2 (70B), achieving up to 1.32\texttimes{

DOI: 10.1145/3620666.3651359

AERO： Adaptive Erase Operation for Improving Lifetime and Performance of Modern NAND Flash-Based SSDs

作者: Cho, Sungjun and Kim, Beomjun and Cho, Hyunuk and Seo, Gyeongseob and Mutlu, Onur and Kim, Myungsuk and Park, Jisung
关键词: solid state drives (SSDs), NAND flash memory, erase operation, SSD lifetime, I/O performance

Abstract

This work investigates a new erase scheme in NAND flash memory to improve the lifetime and performance of modern solid-state drives (SSDs). In NAND flash memory, an erase operation applies a high voltage (e.g., > 20 V) to flash cells for a long time (e.g., > 3.5 ms), which degrades cell endurance and potentially delays user I/O requests. While a large body of prior work has proposed various techniques to mitigate the negative impact of erase operations, no work has yet investigated how erase latency should be set to fully exploit the potential of NAND flash memory; most existing techniques use a fixed latency for every erase operation which is set to cover the worst-case operating conditions. To address this, we propose Aero (Adaptive ERase Operation), a new erase scheme that dynamically adjusts erase latency to be just long enough for reliably erasing target cells, depending on the cells’ current erase characteristics. Aero accurately predicts such near-optimal erase latency based on the number of fail bits during an erase operation. To maximize its benefits, we further optimize Aero in two aspects. First, at the beginning of an erase operation, Aero attempts to erase the cells for a short time (e.g., 1 ms), which enables Aero to always obtain the number of fail bits necessary to accurately predict the near-optimal erase latency. Second, Aero aggressively yet safely reduces erase latency by leveraging a large reliability margin present in modern SSDs. We demonstrate the feasibility and reliability of Aero using 160 real 3D NAND flash chips, showing that it enhances SSD lifetime over the conventional erase scheme by 43% without change to existing NAND flash chips. Our system-level evaluation using eleven real-world workloads shows that an AERO-enabled SSD reduces read tail latency by 34% on average over a state-of-the-art technique.

DOI: 10.1145/3620666.3651341

AUDIBLE： A Convolution-Based Resource Allocator for Oversubscribing Burstable Virtual Machines

作者: Jokar Jandaghi, Seyedali and Mahdaviani, Kaveh and Mirhosseini, Amirhossein and Elnikety, Sameh and Amza, Cristiana and Schroeder, Bianca
关键词: No keywords

Abstract

In an effort to increase the utilization of data center resources cloud providers have introduced a new type of virtual machine (VM) offering, called a burstable VM (BVM). Our work is the first to study the characteristics of burstable VMs (based on traces from production systems at a major cloud provider) and resource allocation approaches for BVM workloads. We propose new approaches for BVM resource allocation and use extensive simulations driven by field data to compare them with two baseline approaches used in practice. We find that traditional approaches based on using a fixed oversubscription ratio or based on the Central Limit Theorem do not work well for BVMs: They lead to either low utilization or high server capacity violation rates. Based on the lessons learned from our workload study, we develop a new approach to BVM scheduling, called Audible, using a non-parametric statistical model, which makes the approach light-weight and workload independent, and obviates the need for training machine learning models and for tuning their parameters. We show that Audible achieves high system utilization while being able to enforce stringent requirements on server capacity violations.

DOI: 10.1145/3620666.3651376

BeeZip： Towards An Organized and Scalable Architecture for Data Compression

作者: Gao, Ruihao and Li, Zhichun and Tan, Guangming and Li, Xueqi
关键词: No keywords

Abstract

Data compression plays a critical role in operating systems and large-scale computing workloads. Its primary objective is to reduce network bandwidth consumption and memory/storage capacity utilization. Given the need to manipulate hash tables, and execute matching operations on extensive data volumes, data compression software has transformed into a resource-intensive CPU task. To tackle this challenge, numerous prior studies have introduced hardware acceleration methods. For example, they have utilized Content-Addressable Memory (CAM) for string matches, incorporated redundant historical copies for each matching component, and so on. While these methods amplify the compression throughput, they often compromise an essential aspect of compression performance: the compression ratio (C.R.). Moreover, hardware accelerators face significant resource costs, especially in memory, when dealing with new large sliding window algorithms.We introduce BeeZip, the first hardware acceleration system designed explicitly for compression with a large sliding window. BeeZip tackles the hardware-level challenge of optimizing both compression ratio and throughput. BeeZip offers architectural support for compression algorithms with the following distinctive attributes: 1) A two-stage compression algorithm adapted for accelerator parallelism, decoupling hash parallelism and match execution dependencies; 2) An organized hash hardware accelerator named BeeHash engine enhanced with dynamic scheduling, which orchestrates hash processes with structured parallelism; 3) A hardware accelerator named HiveMatch engine for the match process, which employs a new scalable parallelism approach and a heterogeneous scale processing unit design to reduce memory resource overhead.Experimental results show that on the Silesia dataset, BeeZip achieves an optimal throughput of 10.42GB/s (C.R. 2.96) and the best C.R. of 3.14 (throughput of 5.95GB/s). Under similar compression ratios, compared to single-threaded/36-threaded software implementations, BeeZip offers accelerator speedups of 23.2\texttimes{

DOI: 10.1145/3620666.3651323

Boost Linear Algebra Computation Performance via Efficient VNNI Utilization

作者: Zhou, Hao and Han, Qiukun and Shi, Heng and Zhang, Yalin and Yao, Jianguo
关键词: No keywords

Abstract

Intel’s Vector Neural Network Instruction (VNNI) provides higher efficiency on calculating dense linear algebra (DLA) computations than conventional SIMD instructions. However, existing auto-vectorizers frequently deliver suboptimal utilization of VNNI by either failing to recognize VNNI’s unique computation pattern at the innermost loops/basic blocks, or producing inferior code through constrained and rudimentary peephole optimizations/pattern matching techniques. Auto-tuning frameworks might generate proficient code but are hampered by the necessity for sophisticated pattern templates and extensive search processes.This paper introduces a novel compilation methodology that generates high-performance VNNI-enabled code. By leveraging DLA’s salient characteristics to identify VNNI’s utilization opportunities, it proceeds to pinpoint the most effective strategies for feeding VNNI’s inputs and hinding VNNI’s execution latency via efficient memory access and register/cache reuse and exploited instruction-level parallelism. A tailored static cost analysis model guides this exploration, determining critical parameters for effective code transformation and generation. The evaluation on DLA and DNN workloads show that our framework outperforms state-of-the-art industrial compilers and research works.

DOI: 10.1145/3620666.3651333

C4CAM： A Compiler for CAM-based In-memory Accelerators

作者: Farzaneh, Hamid and De Lima, Joao Paulo Cardoso and Li, Mengyuan and Khan, Asif Ali and Hu, Xiaobo Sharon and Castrillon, Jeronimo
关键词: No keywords

Abstract

Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von Neumann bottleneck. Platforms based on content-addressable memories (CAMs) are particularly interesting due to their efficient support for the search-based operations that form the foundation for many applications, including K-nearest neighbors (KNN), high-dimensional computing (HDC), recommender systems, and one-shot learning among others. Today, these platforms are designed by hand and can only be programmed with low-level code, accessible only to hardware experts. In this paper, we introduce C4CAM, the first compiler framework to quickly explore CAM configurations and seamlessly generate code from high-level Torch-Script code. C4CAM employs a hierarchy of abstractions that progressively lowers programs, allowing code transformations at the most suitable abstraction level. Depending on the type and technology, CAM arrays exhibit varying latencies and power profiles. Our framework allows analyzing the impact of such differences in terms of system-level performance and energy consumption, and thus supports designers in selecting appropriate designs for a given application.

DOI: 10.1145/3620666.3651386

Centauri： Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning

作者: Chen, Chang and Li, Xiuhong and Zhu, Qianchao and Duan, Jiangfei and Sun, Peng and Zhang, Xingcheng and Yang, Chao
关键词: No keywords

Abstract

Efficiently training large language models (LLMs) necessitates the adoption of hybrid parallel methods, integrating multiple communications collectives within distributed partitioned graphs. Overcoming communication bottlenecks is crucial and is often achieved through communication and computation overlaps. However, existing overlap methodologies tend to lean towards either fine-grained kernel fusion or limited operation scheduling, constraining performance optimization in heterogeneous training environments.This paper introduces Centauri, an innovative framework that encompasses comprehensive communication partitioning and hierarchical scheduling schemes for optimized overlap. We propose a partition space comprising three inherent abstraction dimensions: primitive substitution, topology-aware group partitioning, and workload partitioning. These dimensions collectively create a comprehensive optimization space for efficient overlap. To determine the efficient overlap of communication and computation operators, we decompose the scheduling tasks in hybrid parallel training into three hierarchical tiers: operation, layer, and model. Through these techniques, Centauri effectively overlaps communication latency and enhances hardware utilization. Evaluation results demonstrate that Centauri achieves up to 1.49\texttimes{

DOI: 10.1145/3620666.3651379

Characterizing a Memory Allocator at Warehouse Scale

作者: Zhou, Zhuangzhuang and Gogte, Vaibhav and Vaish, Nilay and Kennelly, Chris and Xia, Patrick and Kanev, Svilen and Moseley, Tipp and Delimitrou, Christina and Ranganathan, Parthasarathy
关键词: datacenter, warehouse-scale computing, memory allocator, memory management

Abstract

Memory allocation constitutes a substantial component of warehouse-scale computation. Optimizing the memory allocator not only reduces the datacenter tax, but also improves application performance, leading to significant cost savings.We present the first comprehensive characterization study of TCMalloc, a memory allocator used by warehouse-scale applications in Google’s production fleet. Our characterization reveals a profound diversity in the memory allocation patterns, allocated object sizes and lifetimes, for large-scale datacenter workloads, as well as in their performance on heterogeneous hardware platforms. Based on these insights, we optimize TCMalloc for warehouse-scale environments. Specifically, we propose optimizations for each level of its cache hierarchy that include usage-based dynamic sizing of allocator caches, leveraging hardware topology to mitigate inter-core communication overhead, and improving allocation packing algorithms based on statistical data. We evaluate these design choices using benchmarks and fleet-wide A/B experiments in our production fleet, resulting in a 1.4% improvement in throughput and a 3.4% reduction in RAM usage for the entire fleet. For the applications with the highest memory allocation usage, we observe up to 8.1% and 6.3% improvement in throughput and memory usage respectively. At our scale, even a single percent CPU or memory improvement translates to significant savings in server costs.

DOI: 10.1145/3620666.3651350

Characterizing Power Management Opportunities for LLMs in the Cloud

作者: Patel, Pratyush and Choukse, Esha and Zhang, Chaojie and Goiri, '{I
关键词: large language models, power usage, cloud, datacenters, GPUs, power oversubscription, profiling

Abstract

Recent innovation in large language models (LLMs), and their myriad use cases have rapidly driven up the compute demand for datacenter GPUs. Several cloud providers and other enterprises plan to substantially grow their datacenter capacity to support these new workloads. A key bottleneck resource in datacenters is power, which LLMs are quickly saturating due to their rapidly increasing model sizes.We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the training and inference power consumption patterns. Based on our analysis, we claim that the average and peak power utilization in LLM inference clusters should not be very high. Our deductions align with data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment make it challenging to build a reliable and robust power management framework.We leverage the insights from our characterization to identify opportunities for better power management. As a detailed use case, we propose a new framework called POLCA, which enables power oversubscription in LLM inference clouds. POLCA is robust, reliable, and readily deployable. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in existing clusters with minimal performance loss.

DOI: 10.1145/3620666.3651329

CSSTs： A Dynamic Data Structure for Partial Orders in Concurrent Execution Analysis

作者: Tun\c{c
关键词: concurrency, happens-before, vector clocks, dynamic concurrency analyses, dynamic reachability

Abstract

Dynamic analyses are a standard approach to analyzing and testing concurrent programs. Such techniques observe program traces σ and analyze them to infer the presence or absence of bugs. At its core, each analysis maintains a partial order P that represents order dependencies between the events of σ. Naturally, the scalability of the analysis largely depends on maintaining P efficiently. The standard data structure for this task has thus far been Vector Clocks. These, however, are slow for analyses that follow a non-streaming style, costing O(n) time for inserting (and propagating) each new ordering in P, where n is the size of σ, while they cannot handle the deletion of existing orderings.In this paper we develop Collective Sparse Segment Trees (CSSTs), a simple but elegant data structure for maintaining a partial order P. CSSTs thrive when the width k of P is much smaller than the size n of its domain, allowing inserting, deleting, and querying for orderings in P to run in O(log n) time. For a concurrent trace, k normally equals the number of its threads, and is orders of magnitude smaller than its size n, making CSSTs fitting for this setting. Our experiments confirm that CSSTs are the best data structure currently to handle a range of dynamic analyses from existing literature.

DOI: 10.1145/3620666.3651358

Dr. DNA： Combating Silent Data Corruptions in Deep Learning using Distribution of Neuron Activations

作者: Ma, Dongning and Lin, Fred and Desmaison, Alban and Coburn, Joel and Moore, Daniel and Sankar, Sriram and Jiao, Xun
关键词: No keywords

Abstract

Deep neural networks (DNNs) have been widely-adopted in various safety-critical applications such as computer vision and autonomous driving. However, as technology scales and applications diversify, coupled with the increasing heterogeneity of underlying hardware architectures, silent data corruption (SDC) has been emerging as a pronouncing threat to the reliability of DNNs. Recent reports from industry hyperscalars underscore the difficulty in addressing SDC due to their “stealthy” nature and elusive manifestation. In this paper, we propose Dr. DNA, a novel approach to enhance the reliability of DNN systems by detecting and mitigating SDCs. Specifically, we formulate and extract a set of unique SDC signatures from the Distribution of Neuron Activations (DNA), based on which we propose early-stage detection and mitigation of SDCs during DNN inference. We perform an extensive evaluation across 3 vision tasks, 5 different datasets, and 10 different models, under 4 different error models. Results show that Dr. DNA achieves 100% SDC detection rate for most cases, 95% detection rate on average and >90% detection rate across all cases, representing 20% - 70% improvement over baselines. Dr. DNA can also mitigate the impact of SDCs by effectively recovering DNN model performance with <1% memory overhead and <2.5% latency overhead.

DOI: 10.1145/3620666.3651349

DTC-SpMM： Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor Cores

作者: Fan, Ruibo and Wang, Wei and Chu, Xiaowen
关键词: sparse matrix-matrix multiplication, SpMM, unstructured sparsity, GPU, tensor core

Abstract

Sparse Matrix-Matrix Multiplication (SpMM) is a building-block operation in scientific computing and machine learning applications. Recent advancements in hardware, notably Tensor Cores (TCs), have created promising opportunities for accelerating SpMM. However, harnessing these hardware accelerators to speed up general SpMM necessitates considerable effort. In this paper, we undertake a comprehensive analysis of the state-of-the-art techniques for accelerating TC-based SpMM and identify crucial performance gaps. Drawing upon these insights, we propose DTC-SpMM, a novel approach with systematic optimizations tailored for accelerating general SpMM on TCs. DTC-SpMM encapsulates diverse aspects, including efficient compression formats, reordering methods, and runtime pipeline optimizations. Our extensive experiments on modern GPUs with a diverse range of benchmark matrices demonstrate remarkable performance improvements in SpMM acceleration by TCs in conjunction with our proposed optimizations. The case study also shows that DTC-SpMM speeds up end-to-end GNN training by up to 1.91\texttimes{

DOI: 10.1145/3620666.3651378

Energy-Adaptive Buffering for Efficient, Responsive, and Persistent Batteryless Systems

作者: Williams, Harrison and Hicks, Matthew
关键词: No keywords

Abstract

Batteryless energy harvesting systems enable a wide array of new sensing, computation, and communication platforms untethered by power delivery or battery maintenance demands. Energy harvesters charge a buffer capacitor from an unreliable environmental source until enough energy is stored to guarantee a burst of operation despite changes in power input. Current platforms use a fixed-size buffer chosen at design time to meet constraints on charge time or application longevity, but static energy buffers are a poor fit for the highly volatile power sources found in real-world deployments: fixed buffers waste energy both as heat when they reach capacity during a power surplus and as leakage when they fail to charge the system during a power deficit.To maximize batteryless system performance in the face of highly dynamic input power, we propose REACT: a responsive buffering circuit which varies total capacitance according to net input power. REACT uses a variable capacitor bank to expand capacitance to capture incoming energy during a power surplus and reconfigures internal capacitors to reclaim additional energy from each capacitor as power input falls. Compared to fixed-capacity systems, REACT captures more energy, maximizes usable energy, and efficiently decouples system voltage from stored charge—enabling low-power and high-performance designs previously limited by ambient power. Our evaluation on real-world platforms shows that REACT eliminates the tradeoff between responsiveness, efficiency, and longevity, increasing the energy available for useful work by an average 25.6% over static buffers optimized for reactivity and capacity, improving event responsiveness by an average 7.7x without sacrificing capacity, and enabling programmer directed longevity guarantees.

DOI: 10.1145/3620666.3651370

Enforcing C/C++ Type and Scope at Runtime for Control-Flow and Data-Flow Integrity

作者: Ismail, Mohannad and Jelesnianski, Christopher and Jang, Yeongjin and Min, Changwoo and Xiong, Wenjie
关键词: No keywords

Abstract

Control-flow hijacking and data-oriented attacks are becoming more sophisticated. These attacks, especially data-oriented attacks, can result in critical security threats, such as leaking an SSL key. Data-oriented attacks are hard to defend against with acceptable performance due to the sheer amount of data pointers present. The root cause of such attacks is using pointers in unintended ways; fundamentally, these attacks rely on abusing pointers to violate the original scope they were used in or the original types that they were declared as.This paper proposes Scope Type Integrity (STI), a new defense policy that enforces all pointers (both code and data pointers) to conform to the original programmer’s intent, as well as Runtime Scope Type Integrity (RSTI) mechanisms to enforce STI at runtime leveraging ARM Pointer Authentication. STI gathers information about the scope, type, and permissions of pointers. This information is then leveraged by RSTI to ensure pointers are legitimately utilized at runtime. We implemented three defense mechanisms of RSTI, with varying levels of security and performance tradeoffs to showcase the versatility of RSTI. We employ these three variants on a variety of benchmarks and real-world applications for a full security and performance evaluation of these mechanisms. Our results show that they have overheads of 5.29%, 2.97%, and 11.12%, respectively.

DOI: 10.1145/3620666.3651342

EVT： Accelerating Deep Learning Training with Epilogue Visitor Tree

作者: Chen, Zhaodong and Kerr, Andrew and Cai, Richard and Kosaian, Jack and Wu, Haicheng and Ding, Yufei and Xie, Yuan
关键词: No keywords

Abstract

As deep learning models become increasingly complex, the deep learning compilers are critical for enhancing the system efficiency and unlocking hidden optimization opportunities. Although excellent speedups have been achieved in inference workloads, existing compilers face significant limitations in training. Firstly, the training computation graph involves intricate operations challenging to fuse, such as normalization, loss functions, and reductions, which limit optimization opportunities like kernel fusion. Secondly, the training graph’s additional edges connecting forward and backward operators pose challenges in finding optimal and feasible partitions for kernel fusion. More importantly, existing compilers cannot either generate kernels with state-of-the-art performance on modern GPUs or accommodate diverse fusion patterns.In this paper, we introduce Epilogue Visitor Tree (EVT), a novel compiler that overcomes these limitations. EVT employs novel graph-level compilation passes to unlock hidden fusion and optimization opportunities. It also incorporates a novel integer linear programming-based partitioner that efficiently solves the optimal and feasible partitions in complex joint forward-backward graphs. Moreover, we present the Epilogue Visitor Abstraction and introduce the EVT operator compiler that automatically generates flexible epilogues that can be integrated with high-performance main loop implementations from CUTLASS and other SOTA libraries. EVT is evaluated on diverse training workloads across domains and achieves 1.26~3.1\texttimes{

DOI: 10.1145/3620666.3651369

Explainable Port Mapping Inference with Sparse Performance Counters for AMD’s Zen Architectures

作者: Ritter, Fabian and Hack, Sebastian
关键词: No keywords

Abstract

Performance models are instrumental for optimizing performance-sensitive code. When modeling the use of functional units of out-of-order x86-64 CPUs, data availability varies by the manufacturer: Instruction-to-port mappings for Intel’s processors are available, whereas information for AMD’s designs are lacking. The reason for this disparity is that standard techniques to infer exact port mappings require hardware performance counters that AMD does not provide.In this work, we modify the port mapping inference algorithm of the widely used uops.info project to not rely on Intel’s performance counters. The modifications are based on a formal port mapping model with a counter-example-guided algorithm powered by an SMT solver. We investigate in how far AMD’s processors comply with this model and where unexpected performance characteristics prevent an accurate port mapping. Our results provide valuable insights for creators of CPU performance models as well as for software developers who want to achieve peak performance on recent AMD CPUs.

DOI: 10.1145/3620666.3651363

FaaSMem： Improving Memory Efficiency of Serverless Computing with Memory Pool Architecture

作者: Xu, Chuhao and Liu, Yiyu and Li, Zijun and Chen, Quan and Zhao, Han and Zeng, Deze and Peng, Qian and Wu, Xueqi and Zhao, Haifeng and Fu, Senbo and Guo, Minyi
关键词: FaaS, serverless computing, memory pool architecture, memory offloading

Abstract

In serverless computing, an idle container is not recycled directly, in order to mitigate time-consuming cold container startup. These idle containers still occupy the memory, exasperating the memory shortage of today’s data centers. By offloading their cold memory to remote memory pool could potentially resolve this problem. However, existing offloading policies either hurt the Quality of Service (QoS) or are too coarse-grained in serverless computing scenarios.We therefore propose FaaSMem, a dedicated memory offloading mechanism tailored for serverless computing with memory poor architecture. It is proposed based on our finding that the memory of a serverless container allocated in different stages has different usage patterns. Specifically, FaaSMem proposes Page Bucket (Pucket) to segregate the memory pages in different segments, and applies segment-wise offloading policies for them. FaaSMem also proposes a semi-warm period during keep-alive stage, to seek a sweet spot between the offloading effort and the remote access penalty. Experimental results show that FaaSMem reduces the average local memory footprint by 9.9% - 79.8% and improves the container deployment density to 108% - 218%, with negligible 95%-ile latency increase.

DOI: 10.1145/3620666.3651355

FEASTA： A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning

作者: Zhong, Kai and Zhu, Zhenhua and Dai, Guohao and Wang, Hongyi and Yang, Xinhao and Zhang, Haoyu and Si, Jin and Mao, Qiuli and Zeng, Shulin and Hong, Ke and Zhang, Genghan and Yang, Huazhong and Wang, Yu
关键词: sparse computation, tensor algebra, instruction set architecture, hardware accelerator

Abstract

Recently, sparse tensor algebra (SpTA) plays an increasingly important role in machine learning. However, due to the unstructured sparsity of SpTA, the general-purpose processors (e.g., GPU and CPU) are inefficient because of the underutilized hardware resources. Sparse kernel accelerators are optimized for specific tasks. However, their dedicated processing units and data paths cannot effectively support other SpTA tasks with different dataflow and various sparsity, resulting in performance degradation. This paper proposes FEASTA, a Flexible and Efficient Accelerator for Sparse Tensor Algebra. To process general SpTA tasks with various sparsity efficiently, we design FEASTA meticulously from three levels. At the dataflow abstraction level, we apply the Einstein Summation on the sparse fiber tree data structure to model the unified execution flow of general SpTA as joining and merging the fiber tree. At the instruction set architecture (ISA) level, a general SpTA ISA is proposed based on the execution flow. It includes different types of instructions for dense and sparse data, achieving flexibility and efficiency at the instruction level. At the architecture level, an instruction-driven architecture consisting of configurable and high-performance function units is designed, supporting the flexible and efficient ISA. Evaluations show that FEASTA has 5.40\texttimes{

DOI: 10.1145/3620666.3651336

Felix： Optimizing Tensor Programs with Gradient Descent

作者: Zhao, Yifan and Sharif, Hashim and Adve, Vikram and Misailovic, Sasa
关键词: No keywords

Abstract

Obtaining high-performance implementations of tensor programs such as deep neural networks on a wide range of hardware remains a challenging task. Search-based tensor program optimizers can automatically find high-performance programs on a given hardware platform, but the search process in existing tools suffer from low efficiency, requiring hours or days of time to discover good programs due to the size of the search space.We present Felix, a novel gradient-based compiler optimization framework for tensor-based programs. Felix creates a differentiable space of tensor programs that is amenable to search by gradient descent. Felix applies continuous relaxation on the space of programs and creates differentiable estimator of program latency, allowing efficient search of program candidates using gradient descent, in contrast to conventional approaches that search over a non-differentiable objective function over a discrete search space.We perform an extensive evaluation on six deep neural networks for vision and natural language processing tasks on three GPU-based platforms. Our experiments show that Felix surpasses the performance of off-the-shelf inference frameworks - PyTorch, Tensorflow, and TensorRT - within 7 minutes of search time on average. Felix also finds optimized programs significantly faster than TVM Ansor, a state-of-the-art search-based optimizer for tensor programs.

DOI: 10.1145/3620666.3651348

Fermihedral： On the Optimal Compilation for Fermion-to-Qubit Encoding

作者: Liu, Yuhao and Che, Shize and Zhou, Junyu and Shi, Yunong and Li, Gushu
关键词: quantum computing, fermion-to-qubit encoding, formal methods, boolean satisfiability

Abstract

This paper introduces Fermihedral, a compiler framework focusing on discovering the optimal Fermion-to-qubit encoding for targeted Fermionic Hamiltonians. Fermion-to-qubit encoding is a crucial step in harnessing quantum computing for efficient simulation of Fermionic quantum systems. Utilizing Pauli algebra, Fermihedral redefines complex constraints and objectives of Fermion-to-qubit encoding into a Boolean Satisfiability problem which can then be solved with high-performance solvers. To accommodate larger-scale scenarios, this paper proposed two new strategies that yield approximate optimal solutions mitigating the overhead from the exponentially large number of clauses. Evaluation across diverse Fermionic systems highlights the superiority of Fermihedral, showcasing substantial reductions in implementation costs, gate counts, and circuit depth in the compiled circuits. Real-system experiments on IonQ’s device affirm its effectiveness, notably enhancing simulation accuracy.

DOI: 10.1145/3620666.3651371

Flexible Non-intrusive Dynamic Instrumentation for WebAssembly

作者: Titzer, Ben L. and Gilbert, Elizabeth and Teo, Bradley Wei Jie and Anand, Yash and Takayama, Kazuyuki and Miller, Heather
关键词: No keywords

Abstract

A key strength of managed runtimes over hardware is the ability to gain detailed insight into the dynamic execution of programs with instrumentation. Analyses such as code coverage, execution frequency, tracing, and debugging, are all made easier in a virtual setting. As a portable, low-level byte-code, WebAssembly offers inexpensive in-process sandboxing with high performance. Yet to date, Wasm engines have not offered much insight into executing programs, supporting at best bytecode-level stepping and basic source maps, but no instrumentation capabilities. In this paper, we show the first non-intrusive dynamic instrumentation system for WebAssembly in the open-source Wizard Research Engine. Our innovative design offers a flexible, complete hierarchy of instrumentation primitives that support building high-level, complex analyses in terms of low-level, programmable probes. In contrast to emulation or machine code instrumentation, injecting probes at the bytecode level increases expressiveness and vastly simplifies the implementation by reusing the engine’s JIT compiler, interpreter, and deoptimization mechanism rather than building new ones. Wizard supports both dynamic instrumentation insertion and removal while providing consistency guarantees, which is key to composing multiple analyses without interference. We detail a fully-featured implementation in a high-performance multi-tier Wasm engine, show novel optimizations specifically designed to minimize instrumentation overhead, and evaluate performance characteristics under load from various analyses. This design is well-suited for production engine adoption as probes can be implemented to have no impact on production performance when not in use.

DOI: 10.1145/3620666.3651338

Amanda： Unified Instrumentation Framework for Deep Neural Networks

作者: Guan, Yue and Qiu, Yuxian and Leng, Jingwen and Yang, Fan and Yu, Shuo and Liu, Yunxin and Feng, Yu and Zhu, Yuhao and Zhou, Lidong and Liang, Yun and Zhang, Chen and Li, Chao and Guo, Minyi
关键词: deep neural network, instrumentation

Abstract

The success of deep neural networks (DNNs) has sparked efforts to analyze (e.g., tracing) and optimize (e.g., pruning) them. These tasks have specific requirements and ad-hoc implementations in current execution backends like TensorFlow/PyTorch, which require developers to manage fragmented interfaces and adapt their codes to diverse models. In this study, we propose a new framework called Amanda to streamline the development of these tasks. We formalize the implementation of these tasks as neural network instrumentation, which involves introducing instrumentation into the operator level of DNNs. This allows us to abstract DNN analysis and optimization tasks as instrumentation tools on various DNN models. We build Amanda with two levels of APIs to achieve a unified, extensible, and efficient instrumentation design. The user-level API provides a unified operator-grained instrumentation API for different backends. Meanwhile, internally, we design a set of callback-centric APIs for managing and optimizing the execution of original and instrumentation codes in different backends. Through these design principles, the Amanda framework can accommodate a broad spectrum of use cases, such as tracing, profiling, pruning, and quantization, across different backends (e.g., TensorFlow/PyTorch) and execution modes (graph/eager mode). Moreover, our efficient execution management ensures that the performance overhead is typically kept within 5%.

DOI: 10.1145/3617232.3624864

Automatic Generation of Vectorizing Compilers for Customizable Digital Signal Processors

作者: Thomas, Samuel and Bornholt, James
关键词: vectorization, DSPs, equality saturation, rewrite rules, program synthesis

Abstract

Embedded applications extract the best power-performance trade-off from digital signal processors (DSPs) by making extensive use of vectorized execution. Rather than handwriting the many customized kernels these applications use, DSP engineers rely on auto-vectorizing compilers to quickly produce effective code. Building these compilers is a large and error-prone investment, and each new DSP architecture or application-specific ISA customization must repeat this effort to derive a new high-performance compiler.We present Isaria, a framework for automatically generating vectorizing compilers for DSP architectures. Isaria uses equality saturation to search for vectorized DSP code using a system of rewrite rules. Rather than hand-crafting these rules, Isaria automatically synthesizes sound rewrite rules from an ISA specification, discovers phase structure within these rules that improve compilation performance, and schedules their application at compile time while pruning intermediate states of the search. We use Isaria to generate a compiler for an industrial DSP architecture, and show that the resulting kernels outperform existing DSP libraries by up to 6.9\texttimes{

DOI: 10.1145/3617232.3624873

BypassD： Enabling fast userspace access to shared SSDs

作者: Yadalam, Sujay and Alverti, Chloe and Karakostas, Vasileios and Gandhi, Jayneel and Swift, Michael
关键词: I/O performance, low latency, direct access, sharing, userspace, SSD, storage systems

Abstract

Modern storage devices, such as Optane NVMe SSDs, offer ultra-low latency of a few microseconds and high bandwidth of multiple gigabytes per second. At these speeds, the kernel software I/O stack is a substantial source of overhead. Userspace approaches avoid kernel software overheads but face challenges in supporting shared storage without major changes to file systems, the OS or the hardware.We propose a new I/O architecture, BypassD, for fast, userspace access to shared storage devices. BypassD takes inspiration from virtual memory: it uses virtual addresses to access a device and relies on hardware for translation and protection. Like memory-mapping a file, the OS kernel constructs a mapping for file contents in the page table. Userspace I/O requests then use virtual addresses from these mappings to specify which file and file offset to access. BypassD extends the IOMMU hardware to translate file offsets into device Logical Block Addresses. Existing applications require no modifications to use BypassD. Our evaluation shows that BypassD reduces latency for 4KB accesses by 42% compared to standard Linux kernel and performs close to userspace techniques like SPDK that do not support device sharing. By eliminating software overheads, BypassD improves performance of real workloads, such as the WiredTiger storage engine, by ~20%.

DOI: 10.1145/3617232.3624854

CC-NIC： a Cache-Coherent Interface to the NIC

作者: Schuh, Henry N. and Krishnamurthy, Arvind and Culler, David and Levy, Henry M. and Rizzo, Luigi and Khan, Samira and Stephens, Brent E.
关键词: No keywords

Abstract

Emerging interconnects make peripherals, such as the network interface controller (NIC), accessible through the processor’s cache hierarchy, allowing these devices to participate in the CPU cache coherence protocol. This is a fundamental change from the separate I/O data paths and read-write transaction primitives of today’s PCIe NICs. Our experiments show that the I/O data path characteristics cause NICs to prioritize CPU efficiency at the expense of inflated latency, an issue that can be mitigated by the emerging low-latency coherent interconnects. But, the coherence abstraction is not suited to current host-NIC access patterns. Applying existing signaling mechanisms and data structure layouts in a cache-coherent setting results in extraneous communication and cache retention, limiting performance. Redesigning the interface is necessary to minimize overheads and benefit from the new interactions coherence enables. This work contributes CC-NIC, a host-NIC interface design for coherent interconnects. We model CC-NIC using Intel’s Ice Lake and Sapphire Rapids UPI interconnects, demonstrating the potential of optimizing for coherence. Our results show a maximum packet rate of 1.5Gpps and 980Gbps packet throughput. CC-NIC has 77% lower minimum latency, and 88% lower at 80% load, than today’s PCIe NICs. We also demonstrate application-level core savings. Finally, we show that CC-NIC’s benefits hold across a range of interconnect performance characteristics.

DOI: 10.1145/3617232.3624868

Cocco： Hardware-Mapping Co-Exploration towards Memory Capacity-Communication Optimization

作者: Tan, Zhanhong and Zhu, Zijian and Ma, Kaisheng
关键词: design space exploration, memory, graph analysis, subgraph, genetic algorithm, deep learning accelerator

Abstract

Memory is a critical design consideration in current data-intensive DNN accelerators, as it profoundly determines energy consumption, bandwidth requirements, and area costs. As DNN structures become more complex, a larger on-chip memory capacity is required to reduce data movement overhead, but at the expense of silicon costs. Some previous works have proposed memory-oriented optimizations, such as different data reuse and layer fusion schemes. However, these methods are not general and potent enough to cope with various graph structures.In this paper, we explore the intrinsic connection between network structures and memory features to optimize both hardware and mapping. First, we introduce a graph-level execution scheme with a corresponding dataflow and memory management method. This scheme enables the execution of arbitrary graph patterns with high data reuse and low hardware overhead. Subsequently, we propose Cocco, a hardware-mapping co-exploration framework leveraging graph-level features of networks. It aims to minimize communication overhead, such as energy consumption and bandwidth requirements, with a smaller memory capacity. We formulate the graph-partition scheduling and memory configuration search as an optimization problem and employ a genetic-based method to achieve efficient co-exploration for large and irregular networks. Experiments demonstrate that Cocco obtains lower external memory access, lower bandwidth requirements, and more stable optimization for graph partition compared to the greedy algorithm and dynamic programming introduced in prior works. Cocco also reduces the costs by 1.89% to 50.33% using co-exploration compared to other typical methods.

DOI: 10.1145/3617232.3624865

CodeCrunch： Improving Serverless Performance via Function Compression and Cost-Aware Warmup Location Optimization

作者: Basu Roy, Rohan and Patel, Tirthak and Garg, Rohan and Tiwari, Devesh
关键词: serverless computing, function compression

Abstract

Serverless computing has a critical problem of function cold starts. To minimize cold starts, state-of-the-art techniques predict function invocation times to warm them up. Warmed-up functions occupy space in memory and incur a keep-alive cost, which can become exceedingly prohibitive under bursty load. To address this issue, we design CodeCrunch, which introduces the concept of serverless function compression and exploits server heterogeneity to make serverless computing more efficient, especially under high memory pressure.

DOI: 10.1145/3617232.3624866

CrossPrefetch： Accelerating I/O Prefetching for Modern Storage

作者: Garg, Shaleen and Zhang, Jian and Pitchumani, Rekha and Parashar, Manish and Xie, Bing and Kannan, Sudarsun
关键词: No keywords

Abstract

We introduce CrossPrefetch, a novel cross-layered I/O prefetching mechanism that operates across the OS and a user-level runtime to achieve optimal performance. Existing OS prefetching mechanisms suffer from rigid interfaces that do not provide information to applications on the prefetch effectiveness, suffer from high concurrency bottlenecks, and are inefficient in utilizing available system memory. CrossPrefetch addresses these limitations by dividing responsibilities between the OS and runtime, minimizing overhead, and achieving low cache misses, lock contentions, and higher I/O performance.CrossPrefetch tackles the limitations of rigid OS prefetching interfaces by maintaining and exporting cache state and prefetch effectiveness to user-level runtimes. It also addresses scalability and concurrency bottlenecks by distinguishing between regular I/O and prefetch operations paths and introduces fine-grained prefetch indexing for shared files. Finally, CrossPrefetch designs low-interference access pattern prediction combined with support for adaptive and aggressive techniques to exploit memory capacity and storage bandwidth. Our evaluation of CrossPrefetch, encompassing microbenchmarks, macrobenchmarks, and real-world workloads, illustrates performance gains of up to 1.22x-3.7x in I/O throughput. We also evaluate CrossPrefetch across different file systems and local and remote storage configurations.

DOI: 10.1145/3617232.3624872

EagleEye： Nanosatellite constellation design for high-coverage, high-resolution sensing

作者: Cheng, Zhuo and Denby, Bradley and McCleary, Kyle and Lucia, Brandon
关键词: orbital edge computing, nanosatellites, constellation design

Abstract

Advances in nanosatellite technology and low launch costs have led to more Earth-observation satellites in low-Earth orbit. Prior work shows that satellite images are useful for geospatial analysis applications (e.g., ship detection, lake monitoring, and oil tank volume estimation). To maximize its value, a satellite constellation should achieve high coverage and provide high-resolution images of the targets. Existing homogeneous constellation designs cannot meet both requirements: a constellation with low-resolution cameras provides high coverage but only delivers low-resolution images; a constellation with high-resolution cameras images smaller geographic areas. We develop EagleEye, a novel mixed-resolution, leader-follower constellation design. The leader satellite has a low-resolution, high-coverage camera to detect targets with onboard image processing. The follower satellite(s), equipped with a high-resolution camera, receive commands from the leader and take high-resolution images of the targets. The leader must consider actuation time constraints when scheduling follower target acquisitions. Additionally, the leader must complete both target detection and follower scheduling in a limited time. We propose an ILP-based algorithm to schedule follower satellite target acquisition, based on positions identified by a leader satellite. We evaluate on four datasets and show that Eagle-Eye achieves 11–194% more coverage compared to existing solutions.

DOI: 10.1145/3617232.3624851

Everywhere All at Once： Co-Location Attacks on Public Cloud FaaS

作者: Zhao, Zirui Neil and Morrison, Adam and Fletcher, Christopher W. and Torrellas, Josep
关键词: cloud computing, function-as-a-service (FaaS), co-location vulnerability, timestamp counter

Abstract

Microarchitectural side-channel attacks exploit shared hardware resources, posing significant threats to modern systems. A pivotal step in these attacks is achieving physical host co-location between attacker and victim. This step is especially challenging in public cloud environments due to the widespread adoption of the virtual private cloud (VPC) and the ever-growing size of the data centers. Furthermore, the shift towards Function-as-a-Service (FaaS) environments, characterized by dynamic function instance placements and limited control for attackers, compounds this challenge.In this paper, we present the first comprehensive study on risks of and techniques for co-location attacks in public cloud FaaS environments. We develop two physical host fingerprinting techniques and propose a new, inexpensive methodology for large-scale instance co-location verification. Using these techniques, we analyze how Google Cloud Run places function instances on physical hosts and identify exploitable placement behaviors. Leveraging our findings, we devise an effective strategy for instance launching that achieves 100% probability of co-locating the attacker with at least one victim instance. Moreover, the attacker co-locates with 61%–100% of victim instances in three major Cloud Run data centers.

DOI: 10.1145/3617232.3624867

Expanding Datacenter Capacity with DVFS Boosting： A safe and scalable deployment experience

作者: Piga, Leonardo and Narayanan, Iyswarya and Sundarrajan, Aditya and Skach, Matt and Deng, Qingyuan and Maity, Biswadip and Chakkaravarthy, Manoj and Huang, Alison and Dhanotia, Abhishek and Malani, Parth
关键词: No keywords

Abstract

COVID-19 pandemic created unexpected demand for our physical infrastructure. We increased our computing supply by growing our infrastructure footprint as well as expanded existing capacity by using various techniques among those DVFS boosting. This paper describes our experience in deploying DVFS boosting to expand capacity.There are several challenges in deploying DVFS boosting at scale. First, frequency scaling incurs additional power demand, which can exacerbate power over-subscription and incur unexpected capacity loss for the services due to power capping. Second, heterogeneity is commonplace in any large scale infrastructure. We need to deal with the service and hardware heterogeneity to determine the optimal setting for each service and hardware type. Third, there exists a long tail of services with scarce resources and support for performance evaluation. Finally and most importantly, we need to ensure that large scale changes to CPU frequency do not risk the reliability of the services and the infrastructure.We present our solution that has overcome the above challenges and has been running in production for over 3 years. It created 12 MW of supply which is equivalent to building and populating half a datacenter in our fleet. In addition to the real world performance of our solution, we also share our key takeaways to improve fleetwide efficiency via DVFS boosting in a safe manner.

DOI: 10.1145/3617232.3624853

Exploiting Human Color Discrimination for Memory- and Energy-Efficient Image Encoding in Virtual Reality

作者: Ujjainkar, Nisarg and Shahan, Ethan and Chen, Kenneth and Duinkharjav, Budmonde and Sun, Qi and Zhu, Yuhao
关键词: No keywords

Abstract

Virtual Reality (VR) has the potential of becoming the next ubiquitous computing platform. Continued progress in the burgeoning field of VR depends critically on an efficient computing substrate. In particular, DRAM access energy is known to contribute to a significant portion of system energy. Today’s framebuffer compression system alleviates the DRAM traffic by using a numerically lossless compression algorithm. Being numerically lossless, however, is unnecessary to preserve perceptual quality for humans. This paper proposes a perceptually lossless, but numerically lossy, system to compress DRAM traffic. Our idea builds on top of long-established psychophysical studies that show that humans cannot discriminate colors that are close to each other. The discrimination ability becomes even weaker (i.e., more colors are perceptually indistinguishable) in our peripheral vision. Leveraging the color discrimination (in)ability, we propose an algorithm that adjusts pixel colors to minimize the bit encoding cost without introducing visible artifacts. The algorithm is coupled with lightweight architectural support that, in real-time, reduces the DRAM traffic by 66.9% and outperforms existing framebuffer compression mechanisms by up to 20.4%. Psychophysical studies on human participants show that our system introduce little to no perceptual fidelity degradation.

DOI: 10.1145/3617232.3624860

Formal Mechanised Semantics of CHERI C： Capabilities, Undefined Behaviour, and Provenance

作者: Zaliva, Vadim and Memarian, Kayvan and Almeida, Ricardo and Clarke, Jessica and Davis, Brooks and Richardson, Alexander and Chisnall, David and Campbell, Brian and Stark, Ian and Watson, Robert N. M. and Sewell, Peter
关键词: No keywords

Abstract

Memory safety issues are a persistent source of security vulnerabilities, with conventional architectures and the C codebase chronically prone to exploitable errors. The CHERI research project has shown how one can provide radically improved security for that existing codebase with minimal modification, using unforgeable hardware capabilities in place of machine-word pointers in CHERI dialects of C, implemented as adaptions of Clang/LLVM and GCC. CHERI was first prototyped as extensions of MIPS and RISC-V; it is currently being evaluated by Arm and others with the Arm Morello experimental architecture, processor, and platform, to explore its potential for mass-market adoption, and by Microsoft in their CHERIoT design for embedded cores.There is thus considerable practical experience with CHERI C implementation and use, but exactly what CHERI C’s semantics is (or should be) remains an open question. In this paper, we present the first attempt to rigorously and comprehensively define CHERI C semantics, discuss key semantics design questions relating to capabilities, provenance, and undefined behaviour, and clarify them with semantics in multiple complementary forms: in prose, as an executable semantics adapting the Cerberus C semantics, and mechanised in Coq.This establishes a solid foundation for CHERI C, for those porting code to it, for compiler implementers, and for future semantics and verification.

DOI: 10.1145/3617232.3624859

GPU-based Private Information Retrieval for On-Device Machine Learning Inference

作者: Lam, Maximilian and Johnson, Jeff and Xiong, Wenjie and Maeng, Kiwan and Gupta, Udit and Li, Yang and Lai, Liangzhen and Leontiadis, Ilias and Rhu, Minsoo and Lee, Hsien-Hsin S. and Reddi, Vijay Janapa and Wei, Gu-Yeon and Brooks, David and Suh, Edward
关键词: privacy, security, cryptography, machine learning, GPU, performance

Abstract

On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1–10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20\texttimes{

DOI: 10.1145/3617232.3624855

HIDA： A Hierarchical Dataflow Compiler for High-Level Synthesis

作者: Ye, Hanchen and Jun, Hyegang and Chen, Deming
关键词: No keywords

Abstract

Dataflow architectures are growing in popularity due to their potential to mitigate the challenges posed by the memory wall inherent to the Von Neumann architecture. At the same time, high-level synthesis (HLS) has demonstrated its efficacy as a design methodology for generating efficient dataflow architectures within a short development cycle. However, existing HLS tools rely on developers to explore the vast dataflow design space, ultimately leading to suboptimal designs. This phenomenon is especially concerning as the size of the HLS design grows. To tackle these challenges, we introduce HIDA1, a new scalable and hierarchical HLS framework that can systematically convert an algorithmic description into a dataflow implementation on hardware. We first propose a collection of efficient and versatile dataflow representations for modeling the hierarchical dataflow structure. Capitalizing on these representations, we develop an automated optimizer that decomposes the dataflow optimization problem into multiple levels based on the inherent dataflow hierarchy. Using FPGAs as an evaluation platform, working with a set of neural networks modeled in PyTorch, HIDA achieves up to 8.54\texttimes{

DOI: 10.1145/3617232.3624850

Lightweight, Modular Verification for WebAssembly-to-Native Instruction Selection

作者: VanHattum, Alexa and Pardeshi, Monica and Fallin, Chris and Sampson, Adrian and Brown, Fraser
关键词: instruction selection, source code generation, compiler verification, webassembly, sandboxing

Abstract

Language-level guarantees—like module runtime isolation for WebAssembly (Wasm)—are only as strong as the compiler that produces a final, native-machine-specific executable. The process of lowering language-level constructions to ISA-specific instructions can introduce subtle bugs that violate security guarantees. In this paper, we present Crocus, a system for lightweight, modular verification of instruction-lowering rules within Cranelift, a production retargetable Wasm native code generator. We use Crocus to verify lowering rules that cover WebAssembly 1.0 support for integer operations in the ARM aarch64 backend. We show that Crocus can reproduce 3 known bugs (including a 9.9/10 severity CVE), identify 2 previously-unknown bugs and an underspecified compiler invariant, and help analyze the root causes of a new bug.

DOI: 10.1145/3617232.3624862

Loupe： Driving the Development of OS Compatibility Layers

作者: Lefeuvre, Hugo and Gain, Gaulthier and B\u{a
关键词: operating systems

Abstract

Supporting mainstream applications is fundamental for a new OS to have impact. It is generally achieved by developing a layer of compatibility allowing applications developed for a mainstream OS like Linux to run unmodified on the new OS. Building such a layer, as we show, results in large engineering inefficiencies due to the lack of efficient methods to precisely measure the OS features required by a set of applications.We propose Loupe, a novel method based on dynamic analysis that determines the OS features that need to be implemented in a prototype OS to bring support for a target set of applications and workloads. Loupe guides and boosts OS developers as they build compatibility layers, prioritizing which features to implement in order to quickly support many applications as early as possible. We apply our methodology to 100+ applications and several OSes currently under development, demonstrating high engineering effort savings vs. existing approaches: for example, for the 62 applications supported by the OSv kernel, we show that using Loupe, would have required implementing only 37 system calls vs. 92 for the non-systematic process followed by OSv developers.We study our measurements and extract novel key insights. Overall, we show that the burden of building compatibility layers is significantly less than what previous works suggest: in some cases, only as few as 20% of system calls reported by static analysis, and 50% of those reported by naive dynamic analysis need an implementation for an application to successfully run standard benchmarks.

DOI: 10.1145/3617232.3624861

ngAP： Non-blocking Large-scale Automata Processing on GPUs

作者: Ge, Tianao and Zhang, Tong and Liu, Hongyuan
关键词: finite state machine, GPU, parallel computing

Abstract

Finite automata serve as compute kernels for various applications that require high throughput. However, despite the increasing compute power of GPUs, their potential in processing automata remains underutilized. In this work, we identify three major challenges that limit GPU throughput. 1) The available parallelism is insufficient, resulting in underutilized GPU threads. 2) Automata workloads involve significant redundant computations since a portion of states matches with repeated symbols. 3) The mapping between threads and states is switched dynamically, leading to poor data locality. Our key insight is that processing automata “one-symbol-at-a-time” serializes the execution, and thus needs to be revamped. To address these challenges, we propose Non-blocking Automata Processing, which allows parallel processing of different symbols in the input stream and also enables further optimizations: 1) We prefetch a portion of computations to increase the chances of processing multiple symbols simultaneously, thereby utilizing GPU threads better. 2) To reduce redundant computations, we store repeated computations in a memoization table, enabling us to substitute them with table lookups. 3) We privatize some computations to preserve the mapping between threads and states, thus improving data locality. Experimental results demonstrate that our approach outperforms the state-of-the-art GPU automata processing engine by an average of 7.9\texttimes{

DOI: 10.1145/3617232.3624848

Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions

作者: Xia, Chunwei and Zhao, Jiacheng and Sun, Qianqi and Wang, Zheng and Wen, Yuan and Yu, Teng and Feng, Xiaobing and Cui, Huimin
关键词: deep neural network, compiler optimization, tensor expression, GPU

Abstract

Optimizing deep neural network (DNN) execution is important but becomes increasingly difficult as DNN complexity grows. Existing DNN compilers cannot effectively exploit optimization opportunities across operator boundaries, leaving room for improvement. To address this challenge, we present Souffle, an open-source compiler that optimizes DNN inference across operator boundaries. Souffle creates a global tensor dependency graph using tensor expressions, traces data flow and tensor information, and partitions the computation graph into subprograms based on dataflow analysis and resource constraints. Within a subprogram, Souffle performs local optimization via semantic-preserving transformations, finds an optimized program schedule, and improves instruction-level parallelism and data reuse. We evaluated Souffle using six representative DNN models on an NVIDIA A100 GPU. Experimental results show that Souffle consistently outperforms six state-of-the-art DNN optimizers by delivering a geometric mean speedup of up to 3.7\texttimes{

DOI: 10.1145/3617232.3624858

Performance-aware Scale Analysis with Reserve for Homomorphic Encryption

作者: Lee, Yongwoo and Cheon, Seonyoung and Kim, Dongkwan and Lee, Dongyoon and Kim, Hanjun
关键词: homomorphic encryption, CKKS, scale management, static analysis, reserve, compiler, privacy-preserve machine learning

Abstract

Thanks to the computation ability on encrypted data and the efficient fixed-point execution, the RNS-CKKS fully homo-morphic encryption (FHE) scheme is a promising solution for privacy-preserving machine learning services. However, writing an efficient RNS-CKKS program is challenging due to its manual scale management requirement. Each cipher-text has a scale value with its maximum scale capacity. Since each RNS-CKKS multiplication increases the scale, programmers should properly rescale a ciphertext by reducing the scale and capacity together. Existing compilers reduce the programming burden by automatically analyzing and managing the scales of ciphertexts, but they either conservatively rescale ciphertexts and thus give up further optimization opportunities, or require time-consuming scale management space exploration.This work proposes a new performance-aware static scale analysis for an RNS-CKKS program, which generates an efficient scale management plan without expensive space exploration. This work analyzes the scale budget, called “reserve”, of each ciphertext in a backward manner from the end of a program and redistributes the budgets to the cipher-texts, thus enabling performance-aware scale management. This work also designs a new type system for the proposed scale analysis and ensures the correctness of the analysis result. This work achieves 41.8% performance improvement over EVA that uses conservative static scale analysis. It also shows similar performance improvement to exploration-based Hecate yet with 15526\texttimes{

DOI: 10.1145/3617232.3624870

Proteus： A High-Throughput Inference-Serving System with Accuracy Scaling

作者: Ahmad, Sohaib and Guan, Hui and Friedman, Brian D. and Williams, Thomas and Sitaraman, Ramesh K. and Woo, Thomas
关键词: inference serving, model serving, machine learning, autoscaling

Abstract

Existing machine learning inference-serving systems largely rely on hardware scaling by adding more devices or using more powerful accelerators to handle increasing query demands. However, hardware scaling might not be feasible for fixed-size edge clusters or private clouds due to their limited hardware resources. A viable alternate solution is accuracy scaling, which adapts the accuracy of ML models instead of hardware resources to handle varying query demands. This work studies the design of a high-throughput inference-serving system with accuracy scaling that can meet throughput requirements while maximizing accuracy. To achieve the goal, this work proposes to identify the right amount of accuracy scaling by jointly optimizing three sub-problems: how to select model variants, how to place them on heterogeneous devices, and how to assign query workloads to each device. It also proposes a new adaptive batching algorithm to handle variations in query arrival times and minimize SLO violations. Based on the proposed techniques, we build an inference-serving system called Proteus and empirically evaluate it on real-world and synthetic traces. We show that Proteus reduces accuracy drop by up to 3\texttimes{

DOI: 10.1145/3617232.3624849

作者: Yu, Hanfei and Basu Roy, Rohan and Fontenot, Christian and Tiwari, Devesh and Li, Jian and Zhang, Hong and Wang, Hao and Park, Seung-Jong
关键词: serverless computing, container caching, container sharing, cold-start

Abstract

Serverless computing has grown rapidly as a new cloud computing paradigm that promises ease-of-management, cost-efficiency, and auto-scaling by shipping functions via self-contained virtualized containers. Unfortunately, serverless computing suffers from severe cold-start problems—starting containers incurs non-trivial latency. Full container caching is widely applied to mitigate cold-starts, yet has recently been outperformed by two lines of research: partial container caching and container sharing. However, either partial container caching or container sharing techniques exhibit their drawbacks. Partial container caching effectively deals with burstiness while leaving cold-start mitigation halfway; container sharing reduces cold-starts by enabling containers to serve multiple functions while suffering from excessive memory waste due to over-packed containers.This paper proposes RainbowCake, a layer-wise container pre-warming and keep-alive technique that effectively mitigates cold-starts with sharing awareness at minimal waste of memory. With structured container layers and sharing-aware modeling, RainbowCake is robust and tolerant to invocation bursts. We seize the opportunity of container sharing behind the startup process of standard container techniques. RainbowCake breaks the container startup process of a container into three stages and manages different container layers individually. We develop a sharing-aware algorithm that makes event-driven layer-wise caching decisions in real-time. Experiments on OpenWhisk clusters with real-world workloads show that RainbowCake reduces 68% function startup latency and 77% memory waste compared to state-of-the-art solutions.

DOI: 10.1145/3617232.3624871

Scaling Up Memory Disaggregated Applications with SMART

作者: Ren, Feng and Zhang, Mingxing and Chen, Kang and Xia, Huaxia and Chen, Zuoning and Wu, Yongwei
关键词: disaggrgated memory, one-sided RDMA, scale-up

Abstract

Recent developments in RDMA networks are leading to the trend of memory disaggregation. However, the performance of each compute node is still limited by the network, especially when it needs to perform a large number of concurrent fine-grained remote accesses. According to our evaluations, existing IOPS-bound disaggregated applications do not scale well beyond 32 cores, and therefore do not take full advantage of today’s many-core machines.After an in-depth analysis of the internal architecture of RNIC, we found three major scale-up bottlenecks that limit the throughput of today’s disaggregated applications: (1) implicit contention of doorbell registers, (2) cache trashing caused by excessive outstanding work requests, and (3) wasted IOPS from unsuccessful CAS retries. However, the solutions to these problems involve many low-level details that are not familiar to application developers. To ease the burden on developers, we propose Smart, an RDMA programming framework that hides the above details by providing an interface similar to one-sided RDMA verbs.We take 44 and 16 lines of code to refactor the state-of-the-art disaggregated hash table (RACE) and persistent transaction processing system (FORD) with Smart, improving their throughput by up to 132.4\texttimes{

DOI: 10.1145/3617232.3624857

SoCFlow： Efficient and Scalable DNN Training on SoC-Clustered Edge Servers

作者: Xu, Daliang and Xu, Mengwei and Lou, Chiheng and Zhang, Li and Huang, Gang and Jin, Xin and Liu, Xuanzhe
关键词: SoC-cluster, distributed machine learning, mixed precision training

Abstract

SoC-Cluster, a novel server architecture composed of massive mobile system-on-chips (SoCs), is gaining popularity in industrial edge computing due to its energy efficiency and compatibility with existing mobile applications. However, we observe that the deployed SoC-Cluster servers are not fully utilized, because the hosted workloads are mostly user-triggered and have significant tidal phenomena. To harvest the free cycles, we propose to co-locate deep learning tasks on them.We present SoCFlow, the first framework that can efficiently train deep learning models on SoC-Cluster. To deal with the intrinsic inadequacy of commercial SoC-Cluster servers, SoCFlow incorporates two novel techniques: (1) the group-wise parallelism with delayed aggregation that can train deep learning models fast and scalably without being influenced by the network bottleneck; (2) the data-parallel mixed-precision training algorithm that can fully unleash the heterogeneous processors’ capability of mobile SoCs. We have fully implemented SoCFlow and demonstrated its effectiveness through extensive experiments. The experiments show that SoCFlow significantly and consistently outperforms all baselines regarding the training speed while preserving the convergence accuracy, e.g., 1.6\texttimes{

DOI: 10.1145/3617232.3624847

SoD2： Statically Optimizing Dynamic Deep Neural Network Execution

作者: Niu, Wei and Agrawal, Gagan and Ren, Bin
关键词: dynamic neural network, compiler optimization, mobile device

Abstract

Though many compilation and runtime systems have been developed for DNNs in recent years, the focus has largely been on static DNNs. Dynamic DNNs, where tensor shapes and sizes and even the set of operators used are dependent upon the input and/or execution, are becoming common. This paper presents SoD2, a comprehensive framework for optimizing Dynamic DNNs. The basis of our approach is a classification of common operators that form DNNs, and the use of this classification towards a Rank and Dimension Propagation (RDP) method. This framework statically determines the shapes of operators as known constants, symbolic constants, or operations on these. Next, using RDP we enable a series of optimizations, like fused code generation, execution (order) planning, and even runtime memory allocation plan generation. By evaluating the framework on 10 emerging Dynamic DNNs and comparing it against several existing systems, we demonstrate both reductions in execution latency and memory requirements, with RDP-enabled key optimizations responsible for much of the gains. Our evaluation results show that SoD2 runs up to 3.9\texttimes{

DOI: 10.1145/3617232.3624869

TrackFM： Far-out Compiler Support for a Far Memory World

作者: Tauro, Brian R. and Suchy, Brian and Campanoni, Simone and Dinda, Peter and Hale, Kyle C.
关键词: disaggregated memory, compilers, far memory

Abstract

Large memory workloads with favorable locality of reference can benefit by extending the memory hierarchy across machines. Systems that enable such far memory configurations can improve application performance and overall memory utilization in a cluster. There are two current alternatives for software-based far memory: kernel-based and library-based. Kernel-based approaches sacrifice performance to achieve programmer transparency, while library-based approaches sacrifice programmer transparency to achieve performance. We argue for a novel third approach, the compiler-based approach, which sacrifices neither performance nor programmer transparency. Modern compiler analysis and transformation techniques, combined with a suitable tightly-coupled runtime system, enable this approach. We describe the design, implementation, and evaluation of TrackFM, a new compiler-based far memory system. Through extensive benchmarking, we demonstrate that TrackFM outperforms kernel-based approaches by up to 2\texttimes{

DOI: 10.1145/3617232.3624856

Training Job Placement in Clusters with Statistical In-Network Aggregation

作者: Zhao, Bohan and Xu, Wei and Liu, Shuo and Tian, Yang and Wang, Qiaoling and Wu, Wenfei
关键词: in-network aggregation, distributed training, placement

Abstract

In-Network Aggregation (INA) offloads the gradient aggregation in distributed training (DT) onto programmable switches, where the switch memory could be allocated to jobs in either synchronous or statistical multiplexing mode. Statistical INA has advantages in switch memory utilization, control-plane simplicity, and management safety, but it faces the problem of cross-layer resource efficiency in job placement. This paper presents a job placement system NetPack for clusters with statistical INA, which aims to maximize the utilization of both computation and network resources. NetPack periodically batches and places jobs into the cluster. When placing a job, NetPack runs a steady state estimation algorithm to acquire the available resources in the cluster, heuristically values each server according to its available resources (GPU and bandwidth), and runs a dynamic programming algorithm to efficiently search for servers with the highest value for the job. Our prototype of NetPack and the experiments demonstrate that NetPack outperforms prior job placement methods by 45% in terms of average job completion time on production traces.

DOI: 10.1145/3617232.3624863

UBFuzz： Finding Bugs in Sanitizer Implementations

作者: Li, Shaohua and Su, Zhendong
关键词: undefined behavior, sanitizer, compiler, program generation, fuzzing

Abstract

In this paper, we propose a testing framework for validating sanitizer implementations in compilers. Our core components are (1) a program generator specifically designed for producing programs containing undefined behavior (UB), and (2) a novel test oracle for sanitizer testing. The program generator employs Shadow Statement Insertion, a general and effective approach for introducing UB into a valid seed program. The generated UB programs are subsequently utilized for differential testing of multiple sanitizer implementations. Nevertheless, discrepant sanitizer reports may stem from either compiler optimization or sanitizer bugs. To accurately determine if a discrepancy is caused by sanitizer bugs, we introduce a new test oracle called crash-site mapping.We have incorporated our techniques into UBfuzz, a practical tool for testing sanitizers. Over a five-month testing period, UBfuzz successfully found 31 bugs in both GCC and LLVM sanitizers. These bugs reveal the serious false negative problems in sanitizers, where certain UBs in programs went unreported. This research paves the way for further investigation in this crucial area of study.

DOI: 10.1145/3617232.3624874

ZENO： A Type-based Optimization Framework for Zero Knowledge Neural Network Inference

作者: Feng, Boyuan and Wang, Zheng and Wang, Yuke and Yang, Shu and Ding, Yufei
关键词: ZKP, neural networks, privacy

Abstract

Zero knowledge Neural Networks draw increasing attention for guaranteeing computation integrity and privacy of neural networks (NNs) based on zero-knowledge Succinct Non-interactive ARgument of Knowledge (zkSNARK) security scheme. However, the performance of zkSNARK NNs is far from optimal due to the million-scale circuit computation with heavy scalar-level dependency. In this paper, we propose a type-based optimizing framework for efficient zero-knowledge NN inference, namely ZENO (ZEro knowledge Neural network Optimizer). We first introduce ZENO language construct to maintain high-level semantics and the type information (e.g., privacy and tensor) for allowing more aggressive optimizations. We then propose privacy-type driven and tensor-type driven optimizations to further optimize the generated zkSNARK circuit. Finally, we design a set of NN-centric system optimizations to further accelerate zkSNARK NNs. Experimental results show that ZENO achieves up to 8.5\texttimes{

DOI: 10.1145/3617232.3624852