ASPLOS 2022

TaskStream： accelerating task-parallel workloads by recovering program structure

作者: Dadu, Vidushi and Nowatzki, Tony
关键词: Irregularity, accelerators, dataflow, generality, load-balance, reconfigurable, streaming, tasks

Abstract

Reconfigurable accelerators, like CGRAs and dataflow architectures, have come to prominence for addressing data-processing problems. However, they are largely limited to workloads with regular parallelism, precluding their applicability to prevalent task-parallel workloads. Reconfigurable architectures and task parallelism seem to be at odds, as the former requires repetitive and simple program structure, and the latter breaks program structure to create small, individually scheduled program units. Our insight is that if tasks and their potential for communication structure are first-class primitives in the hardware, it is possible to recover program structure with extremely low overhead. We propose a task execution model for accelerators called TaskStream, which annotates task dependences with information sufficient to recover inter-task structure. TaskStream enables work-aware load balancing, recovery of pipelined inter-task dependences, and recovery of inter-task read sharing through multicasting. We apply TaskStream to a reconfigurable dataflow architecture, creating a seamless hierarchical dataflow model for task-parallel workloads. We compare our accelerator, Delta, with an equivalent static-parallel design. Overall, we find that our execution model can improve performance by 2.2\texttimes{

DOI: 10.1145/3503222.3507706

DOTA： detect and omit weak attentions for scalable transformer acceleration

作者: Qu, Zheng and Liu, Liu and Tu, Fengbin and Chen, Zhaodong and Ding, Yufei and Xie, Yuan
关键词: SW-HW Co-design, Sparse Architecture, Transformer Acceleration

Abstract

Transformer Neural Networks have demonstrated leading performance in many applications spanning over language understanding, image processing, and generative modeling. Despite the impressive performance, long-sequence Transformer processing is expensive due to quadratic computation complexity and memory consumption of self-attention. In this paper, we present DOTA, an algorithm-architecture co-design that effectively addresses the challenges of scalable Transformer inference. Based on the insight that not all connections in an attention graph are equally important, we propose to jointly optimize a lightweight Detector with the Transformer model to accurately detect and omit weak connections during runtime. Furthermore, we design a specialized system architecture for end-to-end Transformer acceleration using the proposed attention detection mechanism. Experiments on a wide range of benchmarks demonstrate the superior performance of DOTA over other solutions. In summary, DOTA achieves 152.6x and 4.5x performance speedup and orders of magnitude energy-efficiency improvements over GPU and customized hardware, respectively.

DOI: 10.1145/3503222.3507738

A full-stack search technique for domain optimized deep learning accelerators

作者: Zhang, Dan and Huda, Safeen and Songhori, Ebrahim and Prabhu, Kartik and Le, Quoc and Goldie, Anna and Mirhoseini, Azalia
关键词: design space exploration, hardware-software codesign, machine learning, operation fusion, tensor processing unit

Abstract

The rapidly-changing deep learning landscape presents a unique opportunity for building inference accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator Search Technique (FAST), a hardware accelerator search framework that defines a broad optimization environment covering key design decisions within the hardware-software stack, including hardware datapath, software scheduling, and compiler passes such as operation fusion and tensor padding. In this paper, we analyze bottlenecks in state-of-the-art vision and natural language processing (NLP) models, including EfficientNet and BERT, and use FAST to design accelerators capable of addressing these bottlenecks. FAST-generated accelerators optimized for single workloads improve Perf/TDP by 3.7\texttimes{

DOI: 10.1145/3503222.3507767

FINGERS： exploiting fine-grained parallelism in graph mining accelerators

作者: Chen, Qihang and Tian, Boyu and Gao, Mingyu
关键词: graph mining, hardware acceleration, parallelism

Abstract

Graph mining is an emerging application of high importance and also with high complexity, thus requiring efficient hardware acceleration. Current accelerator designs only utilize coarse-grained parallelism, leaving large room for further optimizations. Our key insight is to fully exploit fine-grained parallelism to overcome the existing issues of hardware underutilization, inefficient resource provision, and limited single-thread performance under imbalanced loads. Targeting pattern-aware graph mining algorithms, we first comprehensively identify and analyze the abundant fine-grained parallelism at the branch, set, and segment levels during search tree exploration and set operations. We then propose a novel graph mining accelerator, FINGERS, which effectively exploits these multiple levels of fine-grained parallelism to achieve significant performance improvements. FINGERS mainly enhances the design of each single processing element with parallel compute units for set operations, and efficient techniques for task scheduling, load balancing, and data aggregation. FINGERS outperforms the state-of-the-art design by 2.8\texttimes{

DOI: 10.1145/3503222.3507730

BiSon-e： a lightweight and high-performance accelerator for narrow integer linear algebra computing on the edge

作者: Reggiani, Enrico and Lazo, Crist'{o
关键词: Binary Segmentation, Convolutional Neural Network, Edge Computing, Hardware Accelerator, Low-power design, Narrow Integer Arithmetic, Number Representation, RISC-V, String Matching

Abstract

Linear algebra computational kernels based on byte and sub-byte integer data formats are at the base of many classes of applications, ranging from Deep Learning to Pattern Matching. Porting the computation of these applications from cloud to edge and mobile devices would enable significant improvements in terms of security, safety, and energy efficiency. However, despite their low memory and energy demands, their intrinsically high computational intensity makes the execution of these workloads challenging on highly resource-constrained devices. In this paper, we present BiSon-e, a novel RISC-V based architecture that accelerates linear algebra kernels based on narrow integer computations on edge processors by performing Single Instruction Multiple Data (SIMD) operations on off-the-shelf scalar Functional Units (FUs). Our novel architecture is built upon the binary segmentation technique, which allows to significantly reduce the memory footprint and the arithmetic intensity of linear algebra kernels requiring narrow data sizes. We integrate BiSon-e into a complete System-on-Chip (SoC) based on RISC-V, synthesized and Place&Routed in 65nm and 22nm technologies, introducing a negligible 0.07% area overhead with respect to the baseline architecture. Our experimental evaluation shows that, when computing the Convolution and Fully-Connected layers of the AlexNet and VGG-16 Convolutional Neural Networks (CNNs) with 8-, 4-, and 2-bit, our solution gains up to 5.6\texttimes{

DOI: 10.1145/3503222.3507746

Software-defined address mapping： a case on 3D memory

作者: Zhang, Jialiang and Swift, Michael and Li, Jing (Jane)
关键词: 3D memory, Address mapping, Software defined memory

Abstract

3D-stacking memory such as High-Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) provides orders of magnitude more bandwidth and significantly increased channel-level parallelism (CLP) due to its new parallel memory architecture. However, it is challenging to fully exploit the abundant CLP for performance as the bandwidth utilization is highly dependent on address mapping in the memory controller. Unfortunately, CLP is very sensitive to a program’s data access pattern, which is not made available to OS/hardware by existing mechanisms. In this work, we address these challenges with software-defined address mapping (SDAM) that, for the first time, enables user program to obtain a direct control of the low-level memory hardware in a more intelligent and fine-grained manner. In particular, we develop new mechanisms that can effectively communicate a program’s data access properties to the OS and hardware and to use it to control data placement in hardware. To guarantee correctness and reduce overhead in storage and performance, we extend Linux kernel and C-language memory allocators to support multiple address mappings. For advanced system optimization, we develop machine learning methods that can automatically identify access patterns of major variables in a program and cluster these with similar access patterns to reduce the overhead for SDAM. We demonstrate the benefits of our design on real system prototype, comprising (1) a RISC-V processor, near memory accelerators and HBM modules using Xilinx FPGA platform, and (2) modified Linux and glibc. Our evaluation on standard CPU benchmarks and data-intensive benchmarks (for both CPU and accelerators) demonstrates 1.41\texttimes{

DOI: 10.1145/3503222.3507774

Parallel virtualized memory translation with nested elastic cuckoo page tables

作者: Stojkovic, Jovan and Skarlatos, Dimitrios and Kokolis, Apostolos and Xu, Tianyin and Torrellas, Josep
关键词: Page Tables, Virtual Memory, Virtualization

Abstract

A major reason why nested or virtualized address translations are slow is because current systems organize page tables in a multi-level tree that is accessed in a sequential manner. A nested translation may potentially require up to twenty-four sequential memory accesses. To address this problem, this paper presents the first page table design that supports parallel nested address translation. The design is based on using hashed page tables (HPTs) for both guest and host. However, directly extending a native HPT design to a nested environment leads to minor gains. Instead, our design solves a new set of challenges that appear in nested environments. Our scheme eliminates all but three of the potentially twenty-four sequential steps of a nested translation—while judiciously limiting the number of parallel memory accesses issued to avoid over-consuming cache bandwidth. As a result, compared to conventional nested radix tables, our design speeds-up the execution of a set of applications by an average of 1.19x (for 4KB pages) and 1.24x (when huge pages are used). In addition, we also show a migration path from current nested radix page tables to our design.

DOI: 10.1145/3503222.3507720

Source Kernel for Conference Paper：

作者: Suchy, Brian and Ghosh, Souradip and Kersnar, Drew and Chai, Siyuan and Huang, Zhen and Nelson, Aaron and Cuevas, Michael and Bernat, Alex and Chaudhary, Gaurav and Hardavellas, Nikos and Campanoni, Simone and Dinda, Peter
关键词: compiler, kernel, nautilus, operating system, paging, virtual memory

Abstract

Nautilus is an example of an Aerokernel, a very thin kernel-layer exposed (much like Unikernel) directly to a runtime system and/or application. An Aerokernel does not, by default, have a user-mode! There are several reasons for this, simplicity and performance among the most important. Furthermore, there are no heavy-weight processes—only threads, all of which share an address space. Therefore, Nautilus is also an example of a single address-space OS (SASOS). The runtime can implement user-mode features or address space isolation if this is required for its execution model.

This version of Nautilus has been modified to accommodate running with a CARAT address space abstraction (CARAT CAKE). CARAT CAKE is an extension of work found in PLDI ’20 with the paper describing this work appearing in ASPLOS ’22.

The concept of CARAT CAKE is to replace paging with a system that can operate using only physical addresses. Doing this enables the underlying system to have significant energy savings as well as allow new performance minded optimizations in both the micro-architecture and in software.

DOI: 10.1145/3503222.3507771

NVAlloc： rethinking heap metadata management in persistent memory allocators

作者: Dang, Zheng and He, Shuibing and Hong, Peiyi and Li, Zhenxin and Zhang, Xuechen and Sun, Xian-He and Chen, Gang
关键词: dynamic memory allocation, memory fragmentation, persistent memory

Abstract

Persistent memory allocation is a fundamental building block for developing high-performance and in-memory applications. Existing persistent memory allocators suffer from suboptimal heap organizations that introduce repeated cache line flushes and small random accesses in persistent memory. Worse, many allocators use static slab segregation resulting in a dramatic increase in memory consumption when allocation request size is changed. In this paper, we design a novel allocator, named NVAlloc, to solve the above issues simultaneously. First, NVAlloc eliminates cache line reflushes by mapping contiguous data blocks in slabs to interleaved metadata entries stored in different cache lines. Second, it writes small metadata units to a persistent bookkeeping log in a sequential pattern to remove random heap metadata accesses in persistent memory. Third, instead of using static slab segregation, it supports slab morphing, which allows slabs to be transformed between size classes to significantly improve slab usage. NVAlloc is complementary to the existing consistency models. Results on 6 benchmarks demonstrate that NVAlloc improves the performance of state-of-the-art persistent memory allocators by up to 6.4x and 57x for small and large allocations, respectively. Using NVAlloc reduces memory usage by up to 57.8%. Besides, we integrate NVAlloc in a persistent FPTree. Compared to the state-of-the-art allocators, NVAlloc improves the performance of this application by up to 3.1x.

DOI: 10.1145/3503222.3507743

Every walk’s a hit： making page walks single-access cache hits

作者: Park, Chang Hyun and Vougioukas, Ilias and Sandberg, Andreas and Black-Schaffer, David
关键词: Flattened page table, page table cache prioritization

Abstract

As memory capacity has outstripped TLB coverage, large data applications suffer from frequent page table walks. We investigate two complementary techniques for addressing this cost: reducing the number of accesses required and reducing the latency of each access. The first approach is accomplished by opportunistically “flattening” the page table: merging two levels of traditional 4 KB page table nodes into a single 2 MB node, thereby reducing the table’s depth and the number of indirections required to traverse it. The second is accomplished by biasing the cache replacement algorithm to keep page table entries during periods of high TLB miss rates, as these periods also see high data miss rates and are therefore more likely to benefit from having the smaller page table in the cache than to suffer from increased data cache misses. We evaluate these approaches for both native and virtualized systems and across a range of realistic memory fragmentation scenarios, describe the limited changes needed in our kernel implementation and hardware design, identify and address challenges related to self-referencing page tables and kernel memory allocation, and compare results across server and mobile systems using both academic and industrial simulators for robustness. We find that flattening does reduce the number of accesses required on a page walk (to 1.0), but its performance impact (+2.3%) is small due to Page Walker Caches (already 1.5 accesses). Prioritizing caching has a larger effect (+6.8%), and the combination improves performance by +9.2%. Flattening is more effective on virtualized systems (4.4 to 2.8 accesses, +7.1% performance), due to 2D page walks. By combining the two techniques we demonstrate a state-of-the-art +14.0% performance gain and -8.7% dynamic cache energy and -4.7% dynamic DRAM energy for virtualized execution with very simple hardware and software changes.

DOI: 10.1145/3503222.3507718

Replication Package for Article： GPM： Leveraging Persistent Memory from a GPU

作者: Pandey, Shweta and Kamath, Aditya K and Basu, Arkaprava
关键词: GPU, Persistent Memory

Abstract

GPM is a system which allows a GPU to leverage Persistent Memory and enables writing highly performant recoverable GPU applications. The repository contains the source of our benchmark suite: GPMBench and a CUDA library: LibGPM. GPMBench comprises of 9 benchmarks categorized as transactional, native and checkpointing. LibGPM contains the source of our CUDA library which provides a user-friendly interface for GPU-accelerated recoverable applications. More details about the work can be found in our paper ASPLOS’22 paper: Leveraging Persistent Memory from a GPU. The artifact also allows a user to reproduce some of the key results published in the paper.

DOI: 10.1145/3503222.3507758

GPUReplay： a 50-KB GPU stack for client ML

作者: Park, Heejin and Lin, Felix Xiaozhu
关键词: GPU stack, client ML, record and replay, secure GPU computation

Abstract

GPUReplay (GR) is a novel way for deploying GPU-accelerated computation on mobile and embedded devices. It addresses high complexity of a modern GPU stack for deployment ease and security. The idea is to record GPU executions on the full GPU stack ahead of time and replay the executions on new input at run time. We address key challenges towards making GR feasible, sound, and practical to use. The resultant replayer is a drop-in replacement of the original GPU stack. It is tiny (50 KB of executable), robust (replaying long executions without divergence), portable (running in a commodity OS, in TEE, and baremetal), and quick to launch (speeding up startup by up to two orders of magnitude). We show that GPUReplay works with a variety of integrated GPU hardware, GPU APIs, ML frameworks, and 33 neural network (NN) implementations for inference or training. The code is available at https://github.com/bakhi/GPUReplay.

DOI: 10.1145/3503222.3507754

Replication Package for Article： ValueExpert, Exploring Value Patterns in GPU-Accelerated Applications

作者: Zhou, Keren and Hao, Yueming and Mellor-Crummey, John and Meng, Xiaozhu and Liu, Xu
关键词: GPU profilers, GPUs, Profiling Tools, Value Analysis, Value Patterns

Abstract

Our artifact includes ValueExpert and benchmark code in this paper, along with instructions to use benchmarks to generate results for Figure 2, Figure 6, and Table 3 on NVIDIA A100 and RTX 2080 Ti GPUs. The speedup and overhead of each benchmark are averaged among 10 runs.

We provide a docker image with pre-installed prerequisites to simplify the experiment workflow. Users can also use a script to install all software from scratch.

DOI: 10.1145/3503222.3507708

SparseCore： stream ISA and processor specialization for sparse computation

作者: Rao, Gengyu and Chen, Jingji and Yik, Jason and Qian, Xuehai
关键词: Deep learning, Graph analytics, Sparse computation acceleration, Stream ISA

Abstract

Computation on sparse data is becoming increasingly important for many applications. Recent sparse computation accelerators are designed for specific algorithm/application, making them inflexible with software optimizations. This paper proposes SparseCore, the first general-purpose processor extension for sparse computation that can flexibly accelerate complex code patterns and fast-evolving algorithms. We extend the instruction set architecture (ISA) to make stream or sparse vector first-class citizens, and develop efficient architectural components to support the stream ISA. The novel ISA extension intrinsically operates on streams, realizing both efficient data movement and computation. The simulation results show that SparseCore achieves significant speedups for sparse tensor computation and graph pattern computation.

DOI: 10.1145/3503222.3507705

JSONSki： streaming semi-structured data with bit-parallel fast-forwarding

作者: Jiang, Lin and Zhao, Zhijia
关键词: Bit-Parallel Algorithm, JSON, Parser, SIMD, Semi-structured Data

Abstract

Semi-structured data, such as JSON, are fundamental to the Web and document data stores. Streaming analytics on semi-structured data combines parsing and query evaluation into one pass to avoid generating parse trees. Though promising, its conventional design requires to parse the data stream in detail character by character, which limits the efficiency of streaming analytics. This work reveals a wide range of opportunities to fast-forward the streaming over certain data substructures irrelevant to the query evaluation. However, identifying these substructures itself may need detailed parsing. To resolve this dilemma, this work designs a highly bit-parallel solution that intensively utilizes bitwise and SIMD operations to identify the irrelevant substructures during the streaming. It includes a new streaming model—recursive-descent streaming, for an easy adoption of fast-forward optimizations, a concept—structural intervals, for partitioning the data stream, and a group of bit-parallel algorithms implementing various fast-forward cases. The solution is implemented as a JSON streaming framework, called JSONSki. It offers a set of APIs that can be invoked during the streaming to dynamically fast-forward over different cases of irrelevant substructures. Evaluation using real-world datasets and standard path queries shows that JSONSki can achieve significant speedups over the state-of-the-art JSON processing tools while taking a minimum memory footprint.

DOI: 10.1145/3503222.3507719

Research data supporting “MineSweeper： a “clean sweep” for drop-in use-after-free prevention”

作者: Erd\H{o
关键词: programming language security, temporal safety, use-after-free

Abstract

This artifact contains our MineSweeper implementation, an allocator extension implemented on top of JeMalloc to mitigate use-after-free attacks, together with scripts to evaluate its running time and memory overheads on the SPEC CPU2006 benchmarks. The base implementation itself and a minimally modified JeMalloc memory allocator are fetched from their own repositories, compiled, and dynamically loaded in the SPEC config scripts. The dynamically linked libraries can be used to evaluate SPEC CPU2006 overheads using our scripts (benchmarks not included), or they can be loaded to protect a pre-compiled program from use-after-reallocate and double-free exploits.

DOI: 10.1145/3503222.3507712

Replication Package for Article： Revizor - Testing Black-Box CPUs against Speculation Contracts

作者: Oleksenko, Oleksii and Fetzer, Christof and K"{o
关键词: contracts, spectre, Speculation, testing

Abstract

The artifact includes the source code of Revizor, a set of scripts for reproducing the results, and a description of how to use them. They help to reproduce the contract violations described in the paper and validate the claimed fuzzing speed.

DOI: 10.1145/3503222.3507729

Replication Package for Article： Protecting Adaptive Sampling from Information Leakage on Low-Power Sensors

作者: Kannan, Tejas and Hoffmann, Henry
关键词: Adaptive Sampling, Data Privacy, Embedded Systems, Lossy Data Encoding

Abstract

This artifact provides an implementation of Adaptive Group Encoding (AGE). AGE is a framework that protects adaptive sampling procedures on low-power sensors from leaking information through the size of batched messages. The system works by encoding all measurement batches as fixed-length messages, thereby breaking the relationship between the message size and the adaptive policy’s collection rate. This repository implements AGE both in a simulated environment and on a microcontroller (MCU). The simulator, written in Python, represents the sensor and server as individual processes. These components communicate using a local (encrypted) socket, and the simulator tracks the sensor’s energy consumption using traces from a TI MSP430 MCU. The hardware setting executes AGE on a TI MSP430 FR5994. The MCU transmits measurement batches to a separate server over a Bluetooth link. These experimental settings confirm AGE’s ability to maintain the low error of adaptive sampling while preventing information leakage and incurring negligible energy overhead. The repository https://github.com/tejaskannan/adaptive-group-encoding contains all the code for this work.

DOI: 10.1145/3503222.3507775

uTrimmer： Security Hardening of MIPS Embedded Systems via Static Binary Debloating for Shared Libraries

作者: Zhang, Haotian and Ren, Mengfei and Lei, Yu and Ming, Jiang
关键词: embeded system, shared library debloating, static binary analysis

Abstract

This abstract is used to evaluate performance of our debloating framework uTrimmer on SPEC CPU2017, MIPS firmware applications, and a real MIPS embedded application. uTrimmer is built on top of angr to identify and wipe out unused basic blocks from shared libraries’ binary code in MIPS firmware applications. For a given MIPS binary program and its dependent shared libraries, uTrimmer can export a debated shared libraries of the program. uTrimmer itself does not need additional software to work. However, to evaluate the debloating result, It requires IDA pro for function boundary detection and QEMU to emulate execution environment for programs under the test. The required execution scripts to reproduce the experiment results are provided in the VM image.

We performed several experiments to evaluate uTrimmer’s performance. The first experiment evaluates debloating capability of uTrimmer on SPEC CPU2017 and real firmware applications. The result is shown in Table 3 on page 10. The second experiment compares uTrimmer with the static linker about the debloating efficiency, which is shown in Table 4 on page 10. The third experiment demonstrates uTrimmer’s ability to reduce ROP gadgets on SPEC 2017 and firmware applications. We show the execution results in Table 5 on page 10. We also conducted an experiment on real firmware to evaluate uTrimmer’s performance, shown in Table 6 on page 12.

DOI: 10.1145/3503222.3507768

ViK： practical mitigation of temporal memory safety violations through object ID inspection

作者: Cho, Haehyun and Park, Jinbum and Oest, Adam and Bao, Tiffany and Wang, Ruoyu and Shoshitaishvili, Yan and Doup'{e
关键词: Operating System Kernels, Temporal Memory Safety Violations

Abstract

Temporal memory safety violations, such as use-after-free (UAF) vulnerabilities, are a critical security issue for software written in memory-unsafe languages such as C and C++. In this paper, we introduce ViK, a novel, lightweight, and widely applicable runtime defense that can protect both operating system (OS) kernels and user-space applications against temporal memory safety violations. ViK performs object ID inspection, where it assigns a random identifier to every allocated object and stores the identifier in the unused bits of the corresponding pointer. When the pointer is used, ViK inspects the value of a pointer before dereferencing, ensuring that the pointer still references the original object. To the best of our knowledge, this is the first mitigation against temporal memory safety violations that scales to OS kernels. We evaluated the software prototype of ViK on Android and Linux kernels and observed runtime overhead of around 20%. Also, we evaluated a hardware-assisted prototype of ViK on Android kernel, where the runtime overhead was as low as 2%.

DOI: 10.1145/3503222.3507780

Replication package of paper： Eavesdropping User Credentials via GPU Side Channels on Smartphones

作者: Yang, Boyuan and Chen, Ruirong and Huang, Kai and Yang, Jun and Gao, Wei
关键词: Input Eavesdropping, Mobile GPU, Performance Counters, Side Channel, Smartphones

Abstract

This repository contains artifacts of the paper Eavesdropping User Credentials via GPU Side Channels on Smartphones. It contains 1) the source codes of smartphone app and backend server program that are needed to launch the eavesdropping attack; 2）the mobile user apps that are listed as the victims of this attack; 3) the automated scripts that operate the attacking programs for replicating the experiment results reported in the paper.

DOI: 10.1145/3503222.3507757

CRISP： critical slice prefetching

作者: Litz, Heiner and Ayers, Grant and Ranganathan, Parthasarathy
关键词: branch prediction, criticality, instruction scheduling, out-of-order execution, prefetching

Abstract

The high access latency of DRAM continues to be a performance challenge for contemporary microprocessor systems. Prefetching is a well-established technique to address this problem, however, existing implemented designs fail to provide any performance benefits in the presence of irregular memory access patterns. The hardware complexity of prior techniques that can predict irregular memory accesses such as runahead execution has proven untenable for implementation in real hardware. We propose a lightweight mechanism to hide the high latency of irregular memory access patterns by leveraging criticality-based scheduling. In particular, our technique executes delinquent loads and their load slices as early as possible, hiding a significant fraction of their latency. Furthermore, we observe that the latency induced by branch mispredictions and other high latency instructions can be hidden with a similar approach. Our proposal only requires minimal hardware modifications by performing memory access classification, load and branch slice extraction, as well as priority analysis exclusively in software. As a result, our technique is feasible to implement, introducing only a simple new instruction prefix while requiring minimal modifications of the instruction scheduler. Our technique increases the IPC of memory-latency-bound applications by up to 38% and by 8.4% on average.

DOI: 10.1145/3503222.3507745

Pinned Loads： Taming Speculative Loads in Secure Processors

作者: Zhao, Zirui Neil and Ji, Houxiang and Morrison, Adam and Marinov, Darko and Torrellas, Josep
关键词: Cache coherence protocol, Memory consistency, Processor design, Speculative execution defense

Abstract

Our artifact provides a complete gem5 implementation of Pinned Loads, along with scripts to evaluate Pinned Loads’ performance on SPEC17, PARSEC, and SPLASH2X benchmark suites. We further provide access to a server with SPEC17 SimPoint checkpoints, PARSEC&SPLASH2X checkpoints and disk images that allow a recreation of all the evaluation figures of the paper. Finally, we open sourced our implementation and scripts on GitHub.

DOI: 10.1145/3503222.3507724

Gem5/Rosette Simulation Packages for DAGguise

作者: Deutsch, Peter W. and Yang, Yuheng and Bourgeat, Thomas and Drean, Jules and Emer, Joel S. and Yan, Mengjia
关键词: dagguise, dramsim2, gem5, rosette

Abstract

Our artifact comprises of two distinct parts: a unified gem5 / DRAMSim2 model (for performance evaluation), and a Rosette model (for security verification). The unified gem5/DRAMSim2 model is able to evaluate the performance of DAGguise and FS-BTA against an insecure baseline. We use gem5’s OoO core to perform baseline measurements, profile candidate rDAGs, and report final performance numbers. We also include the sample victim programs (DocDist and DNA) as described in the paper, in addition to an rDAG generation tool, and plotting scripts for Figures 7 and 9. The Rosette model symbolically executes the DAGguise system and verifies the Security Property with K-Induction as described in Section 5 of the paper.

DOI: 10.1145/3503222.3507747

RecShard： statistical feature-based memory optimization for industry-scale neural recommendation

作者: Sethi, Geet and Acun, Bilge and Agarwal, Niket and Kozyrakis, Christos and Trippel, Caroline and Wu, Carole-Jean
关键词: AI training systems, Deep learning recommendation models, Memory optimization, Neural networks

Abstract

We propose RecShard, a fine-grained embedding table (EMB) partitioning and placement technique for deep learning recommendation models (DLRMs). RecShard is designed based on two key observations. First, not all EMBs are equal, nor all rows within an EMB are equal in terms of access patterns. EMBs exhibit distinct memory characteristics, providing performance optimization opportunities for intelligent EMB partitioning and placement across a tiered memory hierarchy. Second, in modern DLRMs, EMBs function as hash tables. As a result, EMBs display interesting phenomena, such as the birthday paradox, leaving EMBs severely under-utilized. RecShard determines an optimal EMB sharding strategy for a set of EMBs based on training data distributions and model characteristics, along with the bandwidth characteristics of the underlying tiered memory hierarchy. In doing so, RecShard achieves over 6 times higher EMB training throughput on average for capacity constrained DLRMs. The throughput increase comes from improved EMB load balance by over 12 times and from the reduced access to the slower memory by over 87 times.

DOI: 10.1145/3503222.3507777

ASPLOS22 Artifact - AStitch Machine Learning Optimizing Compiler

作者: Zheng, Zhen and Yang, Xuanda and Zhao, Pengzhan and Long, Guoping and Zhu, Kai and Zhu, Feiwen and Zhao, Wenyi and Liu, Xiaoyong and Yang, Jun and Zhai, Jidong and Song, Shuaiwen Leon and Lin, Wei
关键词: Compiler Optimization, Kernel Fusion, Machine Learning System, Memory-intensive Computation

Abstract

The artifact contains the necessary software components to validate the main results in AStitch paper. We provide a docker image to ease the environment setup. The docker image contains the compiled binary of AStitch, scripts to evaluate the inference and training performance, and scripts to draw the figures. It requires a Linux system with NVIDIA driver (capable to run CUDA 10.0) running on a NVIDIA V100 GPU equipped x86_64 machine to create the docker container. After launching the docker container, people can run one script to collect all performance numbers. It requires some manual finishing to fill the performance numbers into several python scripts to draw the most important figures in the paper, showing the speedup of AStitch and breakdown information.

DOI: 10.1145/3503222.3507723

Replication Package for Article： NASPipe： High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism

作者: Zhao, Shixiong and Li, Fanxin and Chen, Xusheng and Shen, Tianxiang and Chen, Li and Wang, Sen and Zhang, Nicholas and Li, Cheng and Cui, Heming
关键词: Distributed Training, Neural Architecture Search, Pipeline training

Abstract

The artifact provides the availability, functionality, and key reproducible results of the paper (NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism): a causal parallel training execution framework. The artifact requires a host with at least 100GB CPU RAM and 4 Nvidia GPUs, and each GPU requires at least 11GB memory. The runtime environment is installed by docker with a few command lines. The experiments contain a throughput evaluation and reproducible training evaluation. The artifact provides one-click shell scripts to conduct the experiments.

DOI: 10.1145/3503222.3507735

VELTAIR： towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling

作者: Liu, Zihan and Leng, Jingwen and Zhang, Zhihui and Chen, Quan and Li, Chao and Guo, Minyi
关键词: Compiling, Deep Learning Service, Multi-tenant, Scheduling

Abstract

Deep learning (DL) models have achieved great success in many application domains. As such, many industrial companies such as Google and Facebook have acknowledged the importance of multi-tenant DL services. Although the multi-tenant service has been studied in conventional workloads, it is not been deeply studied on deep learning service, especially on general-purpose hardware. In this work, we systematically analyze the opportunities and challenges of providing multi-tenant deep learning services on the general-purpose CPU architecture from the aspects of scheduling granularity and code generation. We propose an adaptive granularity scheduling scheme to both guarantee resource usage efficiency and reduce the scheduling conflict rate. We also propose an adaptive compilation strategy, by which we can dynamically and intelligently pick a program with proper exclusive and shared resource usage to reduce overall interference-induced performance loss. Compared to the existing works, our design can serve more requests under the same QoS target in various scenarios (e.g., +71%, +62%, +45% for light, medium, and heavy workloads, respectively), and reduce the averaged query latency by 50%.

DOI: 10.1145/3503222.3507752

CoCoNet： Co-optimizing Computation and Communication for Distributed Neural Networks

作者: Jangda, Abhinav and Huang, Jun and Liu, Guodong and Sabet, Amir Hossein Nodehi and Maleki, Saeed and Miao, Youshan and Musuvathi, Madanlal and Mytkowicz, Todd and Saarikivi, Olli
关键词: CUDA, Distributed Machine Learning, MPI, NCCL

Abstract

CoCoNet is a domain specific language for expressing and optimizing distributed machine learning workloads. This artifact contains the CoCoNet implementation and our benchmark infrastructure.

DOI: 10.1145/3503222.3507778

Replication Package for Article： Clio： A Hardware-Software Co-Designed Disaggregated Memory System

作者: Guo, Zhiyuan and Shan, Yizhou and Luo, Xuhao and Huang, Yutong and Zhang, Yiying
关键词: FPGA, Hardware Software Co-design, Resource Disaggregation, Virtual Memory

Abstract

This artifact provides the source code of Clio, a hardware software co-designed disaggregated memory system. The Clio artifact has a C-based host-side library, a C-based ARM SoC management path, and a SpinalHDL-based FPGA data path along with a set of comprehensive FPGA building scripts. The artifact suite also has a set of microbenchmark examples and ported applications

DOI: 10.1145/3503222.3507762

The Enzian Research Computer; Altium Design Sources

作者: Cock, David and Ramdas, Abishek and Schwyn, Daniel and Giardino, Michael and Turowski, Adam and He, Zhenhao and Hossle, Nora and Korolija, Dario and Licciardello, Melissa and Martsenko, Kristina and Achermann, Reto and Alonso, Gustavo and Roscoe, Timothy
关键词: cache coherence., FPGAs, heterogeneous systems

Abstract

CAD design sources for the computer

DOI: 10.1145/3503222.3507742

ASPLOS’22 artifact for Efficient and Scalable Core Multiplexing with M³v

作者: Asmussen, Nils and Haas, Sebastian and Weinhold, Carsten and Miemietz, Till and Roitzsch, Michael
关键词: communications management, operating systems, operating-systems security, process management, tiled architecture

Abstract

This is the artifact for the ASPLOS’22 paper “Efficient and Scalable Core Multiplexing with M³v”. The archive contains the source code of the software part, including the modified Linux kernel we compared M³v against, and all scripts to run the benchmarks. The archive also contains the FPGA bitfiles for the hardware platform.

DOI: 10.1145/3503222.3507741

FlexOS： Towards Flexible OS Isolation

作者: Lefeuvre, Hugo and B\u{a
关键词: compartmentalization, isolation, operating system, operating system security

Abstract

This artifact contains the source code of FlexOS, the proof of-concept of our flexible isolation approach presented at ASPLOS’22 (“FlexOS: Towards Flexible OS Isolation”), along with all scripts necessary to reproduce the paper’s measurements and plots. The goal of this artifact is to allow readers to reproduce the paper’s results, and build new research on top of FlexOS.

Abstract of the paper:

At design time, modern operating systems are locked in a specific safety and isolation strategy that mixes one or more hardware/software protection mechanisms (e.g. user/kernel separation); revisiting these choices after deployment requires a major refactoring effort. This rigid approach shows its limits given the wide variety of modern applications’ safety/performance requirements, when new hardware isolation mechanisms are rolled out, or when existing ones break.

We present FlexOS, a novel OS allowing users to easily specialize the safety and isolation strategy of an OS at compilation/deployment time instead of design time. This modular LibOS is composed of fine-grained components that can be isolated via a range of hardware protection mechanisms with various data sharing strategies and additional software hardening. The OS ships with an exploration technique helping the user navigate the vast safety/performance design space it unlocks. We implement a prototype of the system and demonstrate, for several applications (Redis/Nginx/SQLite), FlexOS’ vast configuration space as well as the efficiency of the exploration technique: we evaluate 80 FlexOS configurations for Redis and show how that space can be probabilistically subset to the 5 safest ones under a given performance budget. We also show that, under equivalent configurations, FlexOS performs similarly or better than several baselines/competitors.

DOI: 10.1145/3503222.3507759

Adelie： Continuous Address Space Layout Re-randomization for Linux Drivers - Artifact for ASPLOS’22

作者: Nikolaev, Ruslan and Nadeem, Hassan and Stone, Cathlyn and Ravindran, Binoy
关键词: ASLR, PIC, ROP

Abstract

Artifact for ASPLOS’22 paper “Adelie: Continuous Address Space Layout Re-randomization for Linux Drivers”. The artifact contains source code, benchmark scripts, and preinstalled VM images that should be used with VirtualBox. The server VM image is in Adelie.zip, and the client (load generator) VM image is in Client.zip. Please see README.txt for more information. Please also see the licensing terms in LICENSE.

DOI: 10.1145/3503222.3507779

Suppressing ZZ crosstalk of Quantum computers through pulse and scheduling co-optimization

作者: Xie, Lei and Zhai, Jidong and Zhang, ZhenXing and Allcock, Jonathan and Zhang, Shengyu and Zheng, Yi-Cong
关键词: Error Suppression, Quantum Computing, ZZ Crosstalk

Abstract

Noise is a significant obstacle to quantum computing, and ZZ crosstalk is one of the most destructive types of noise affecting superconducting qubits. Previous approaches to suppressing ZZ crosstalk have mainly relied on specific chip design that can complicate chip fabrication and aggravate decoherence. To some extent, special chip design can be avoided by relying on pulse optimization to suppress ZZ crosstalk. However, existing approaches are non-scalable, as their required time and memory grow exponentially with the number of qubits involved. To address the above problems, we propose a scalable approach by co-optimizing pulses and scheduling. We optimize pulses to offer an ability to suppress ZZ crosstalk surrounding a gate, and then design scheduling strategies to exploit this ability and achieve suppression across the whole circuit. A main advantage of such co-optimization is that it does not require special hardware support. Besides, we implement our approach as a general framework that is compatible with different pulse optimization methods. We have conducted extensive evaluations by simulation and on a real quantum computer. Simulation results show that our proposal can improve the fidelity of quantum computing on 4∼12 qubits by up to 81\texttimes{

DOI: 10.1145/3503222.3507761

QUEST (ASPLOS’22) Code and Data

作者: Patel, Tirthak and Younis, Ed and Iancu, Costin and de Jong, Wibe and Tiwari, Devesh
关键词: Quantum Circuit Approximation, Quantum Circuit Synthesis, Quantum Computing

Abstract

This appendix describes the code and data artifacts related to QUEST. The artifacts are open-source at https://doi.org/10.5281/zenodo.5747894. They include the input files for the executed benchmarks, the code for partitioning, synthesis, dual annealing, and simulation, as well as a docker image set up with the code. Please see the following sections for more details, especially the Experiment Workflow section to read in detail about how the artifact directories and code files are organized.

DOI: 10.1145/3503222.3507739

HAMMER： boosting fidelity of noisy Quantum circuits by exploiting Hamming behavior of erroneous outcomes

作者: Tannu, Swamit and Das, Poulami and Ayanzadeh, Ramin and Qureshi, Moinuddin
关键词: NISQ, Quantum Compilers, Quantum Computing

Abstract

Quantum computers with hundreds of qubits will be available soon. Unfortunately, high device error-rates pose a significant challenge in using these near-term quantum systems to power real-world applications. Executing a program on existing quantum systems generates both correct and incorrect outcomes, but often, the output distribution is too noisy to distinguish between them. In this paper, we show that erroneous outcomes are not arbitrary but exhibit a well-defined structure when represented in the Hamming space. Our experiments on IBM and Google quantum computers show that the most frequent erroneous outcomes are more likely to be close in the Hamming space to the correct outcome. We exploit this behavior to improve the ability to infer the correct outcome. We propose Hamming Reconstruction (HAMMER), a post-processing technique that leverages the observation of Hamming behavior to reconstruct the noisy output distribution, such that the resulting distribution has higher fidelity. We evaluate HAMMER using experimental data from Google and IBM quantum computers with more than 500 unique quantum circuits and obtain an average improvement of 1.37x in the quality of solution. On Google’s publicly available QAOA datasets, we show that HAMMER sharpens the gradients on the cost function landscape.

DOI: 10.1145/3503222.3507703

LILLIPUT： a lightweight low-latency lookup-table decoder for near-term Quantum error correction

作者: Das, Poulami and Locharla, Aditya and Jones, Cody
关键词: Decoding, Fault-tolerant quantum computing, Lookup-Table decoder, Quantum error correction, Surface codes

Abstract

The error rates of quantum devices are orders of magnitude higher than what is needed to run most quantum applications. To close this gap, Quantum Error Correction (QEC) encodes logical qubits and distributes information using several physical qubits. By periodically executing a syndrome extraction circuit on the logical qubits, information about errors (called syndrome) is extracted while running programs. A decoder uses these syndromes to identify and correct errors in real time, which is necessary to prevent accumulation of errors. Unfortunately, software decoders are slow and hardware decoders are fast but less accurate. Thus, almost all QEC studies so far have relied on offline decoding. To enable real-time decoding in near-term QEC, we propose LILLIPUT– a Lightweight Low Latency Look-Up Table decoder. LILLIPUT consists of two parts– First, it translates syndromes into error detection events that index into a Look-Up Table (LUT) whose entry provides the error information in real-time. Second, it programs the LUTs with error assignments for all possible error events by running a software decoder offline. LILLIPUT tolerates an error on any operation in the quantum hardware, including gates and measurements, and the number of tolerated errors grows with the size of the code. LILLIPUT utilizes less than 7% logic on off-the-shelf FPGAs enabling practical adoption, as FPGAs are already used to design the control and readout circuits in existing systems. LILLIPUT incurs a latency of a few nanoseconds and enables real-time decoding. We also propose Compressed LUTs (CLUTs) to reduce the memory required by LILLIPUT. By exploiting the fact that not all error events are equally likely and only storing data for the most probable error events, CLUTs reduce the memory needed by up-to 107x (from 148 MB to 1.38 MB) without degrading the accuracy.

DOI: 10.1145/3503222.3507707

Artifact for Article： Paulihedral： A Generalized Block-Wise Compiler Optimization Framework for Quantum Simulation Kernels

作者: Li, Gushu and Wu, Anbang and Shi, Yunong and Javadi-Abhari, Ali and Ding, Yufei and Xie, Yuan
关键词: compiler, quantum computing, quantum simulation

Abstract

See appendix for artifact description

DOI: 10.1145/3503222.3507715

Astraea： towards QoS-aware and resource-efficient multi-stage GPU services

作者: Zhang, Wei and Chen, Quan and Fu, Kaihua and Zheng, Ningxin and Huang, Zhiyi and Leng, Jingwen and Guo, Minyi
关键词: GPU, Microservice, QoS, Resource management

Abstract

Multi-stage user-facing applications on GPUs are widely-used nowa- days, and are often implemented to be microservices. Prior re- search works are not applicable to ensuring QoS of GPU-based microservices due to the different communication patterns and shared resource contentions. We propose Astraea to manage GPU microservices considering the above factors. In Astraea, a microser- vice deployment policy is used to maximize the supported peak service load while ensuring the required QoS. To adaptively switch the communication methods between microservices according to different deployments, we propose an auto-scaling GPU communi- cation framework. The framework automatically scales based on the currently used hardware topology and microservice location, and adopts global memory-based techniques to reduce intra-GPU communication. Astraea increases the supported peak load by up to 82.3% while achieving the desired 99%-ile latency target compared with state-of-the-art solutions.

DOI: 10.1145/3503222.3507721

Memory-harvesting VMs in cloud platforms

作者: Fuerst, Alexander and Novakovi'{c
关键词: Cloud computing, memory management, resource harvesting

Abstract

loud platforms monetize their spare capacity by renting “Spot” virtual machines (VMs) that can be evicted in favor of higher-priority VMs. Recent work has shown that resource-harvesting VMs are more effective at exploiting spare capacity than Spot VMs, while also reducing the number of evictions. However, the prior work focused on harvesting CPU cores while keeping memory size fixed. This wastes a substantial monetization opportunity and may even limit the ability of harvesting VMs to leverage spare cores. Thus, in this paper, we explore memory harvesting and its challenges in real cloud platforms, namely its impact on VM creation time, NUMA spanning, and page fragmentation. We start by characterizing the amount and dynamics of the spare memory in Azure. We then design and implement memory-harvesting VMs (MHVMs), introducing new techniques for memory buffering, batching, and pre-reclamation. To demonstrate the use of MHVMs, we also extend a popular cluster scheduling framework (Hadoop) and a FaaS platform to adapt to them. Our main results show that (1) there is plenty of scope for memory harvesting in real platforms; (2) MHVMs are effective at mitigating the negative impacts of harvesting; and (3) our extensions of Hadoop and FaaS successfully hide the MHVMs’ varying memory size from the users’ data-processing jobs and functions. We conclude that memory harvesting has great potential for practical deployment and users can save up to 93% of their costs when running workloads on MHVMs.

DOI: 10.1145/3503222.3507725

IOCost： block IO control for containers in datacenters

作者: Heo, Tejun and Schatzberg, Dan and Newell, Andrew and Liu, Song and Dhakshinamurthy, Saravanan and Narayanan, Iyswarya and Bacik, Josef and Mason, Chris and Tang, Chunqiang and Skarlatos, Dimitrios
关键词: Containers, Datacenters, I/O, Operating Systems

Abstract

Resource isolation is a fundamental requirement in datacenter environments. However, our production experience in Meta’s large-scale datacenters shows that existing IO control mechanisms for block storage are inadequate in containerized environments. IO control needs to provide proportional resources to containers while taking into account the hardware heterogeneity of storage devices and the idiosyncrasies of the workloads deployed in datacenters. The speed of modern SSDs requires IO control to execute with low-overheads. Furthermore, IO control should strive for work conservation, take into account the interactions with the memory management subsystem, and avoid priority inversions that lead to isolation failures. To address these challenges, this paper presents IOCost, an IO control solution that is designed for containerized environments and provides scalable, work-conserving, and low-overhead IO control for heterogeneous storage devices and diverse workloads in datacenters. IOCost performs offline profiling to build a device model and uses it to estimate device occupancy of each IO request. To minimize runtime overhead, it separates IO control into a fast per-IO issue path and a slower periodic planning path. A novel work-conserving budget donation algorithm enables containers to dynamically share unused budget. We have deployed IOCost across the entirety of Meta’s datacenters comprised of millions of ma- chines, upstreamed IOCost to the Linux kernel, and open-sourced our device-profiling tools. IOCost has been running in production for two years, providing IO control for Meta’s fleet. We describe the design of IOCost and share our experience deploying it at scale.

DOI: 10.1145/3503222.3507727

TMO： transparent memory offloading in datacenters

作者: Weiner, Johannes and Agarwal, Niket and Schatzberg, Dan and Yang, Leon and Wang, Hao and Sanouillet, Blaise and Sharma, Bikash and Heo, Tejun and Jain, Mayank and Tang, Chunqiang and Skarlatos, Dimitrios
关键词: Datacenters, Memory Management, Non-volatile Memory, Operating Systems

Abstract

The unrelenting growth of the memory needs of emerging datacenter applications, along with ever increasing cost and volatility of DRAM prices, has led to DRAM being a major infrastructure expense. Alternative technologies, such as NVMe SSDs and upcoming NVM devices, offer higher capacity than DRAM at a fraction of the cost and power. One promising approach is to transparently offload colder memory to cheaper memory technologies via kernel or hypervisor techniques. The key challenge, however, is to develop a datacenter-scale solution that is robust in dealing with diverse workloads and large performance variance of different offload devices such as compressed memory, SSD, and NVM. This paper presents TMO, Meta’s transparent memory offloading solution for heterogeneous datacenter environments. TMO introduces a new Linux kernel mechanism that directly measures in realtime the lost work due to resource shortage across CPU, memory, and I/O. Guided by this information and without any prior application knowledge, TMO automatically adjusts how much memory to offload to heterogeneous devices (e.g., compressed memory or SSD) according to the device’s performance characteristics and the application’s sensitivity to memory-access slowdown. TMO holistically identifies offloading opportunities from not only the application containers but also the sidecar containers that provide infrastructure-level functions. To maximize memory savings, TMO targets both anonymous memory and file cache, and balances the swap-in rate of anonymous memory and the reload rate of file pages that were recently evicted from the file cache. TMO has been running in production for more than a year, and has saved between 20-32% of the total memory across millions of servers in our large datacenter fleet. We have successfully upstreamed TMO into the Linux kernel.

DOI: 10.1145/3503222.3507731

SOL： safe on-node learning in cloud platforms

作者: Wang, Yawen and Crankshaw, Daniel and Yadwadkar, Neeraja J. and Berger, Daniel and Kozyrakis, Christos and Bianchini, Ricardo
关键词: Cloud computing, machine learning for systems, on-node agents, systems for machine learning

Abstract

Cloud platforms run many software agents on each server node. These agents manage all aspects of node operation, and in some cases frequently collect data and make decisions. Unfortunately, their behavior is typically based on pre-defined static heuristics or offline analysis; they do not leverage on-node machine learning (ML). In this paper, we first characterize the spectrum of node agents in Azure, and identify the classes of agents that are most likely to benefit from on-node ML. We then propose SOL, an extensible framework for designing ML-based agents that are safe and robust to the range of failure conditions that occur in production. SOL provides a simple API to agent developers and manages the scheduling and running of the agent-specific functions they write. We illustrate the use of SOL by implementing three ML-based agents that manage CPU cores, node power, and memory placement. Our experiments show that (1) ML substantially improves our agents, and (2) SOL ensures that agents operate safely under a variety of failure conditions. We conclude that ML-based agents show significant potential and that SOL can help build them.

DOI: 10.1145/3503222.3507704

GenStore： a high-performance in-storage processing system for genome sequence analysis

作者: Mansouri Ghiasi, Nika and Park, Jisung and Mustafa, Harun and Kim, Jeremie and Olgun, Ataberk and Gollwitzer, Arvid and Senol Cali, Damla and Firtina, Can and Mao, Haiyu and Almadhoun Alserr, Nour and Ausavarungnirun, Rachata and Vijaykumar, Nandita and Alser, Mohammed and Mutlu, Onur
关键词: Filtering, Genomics, Near-Data Processing, Read Mapping, Storage

Abstract

Read mapping is a fundamental step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). Read mapping is costly because it needs to perform approximate string matching (ASM) on large amounts of data. To address the computational challenges in genome analysis, many prior works propose various approaches such as accurate filters that select the reads within a dataset of genomic reads (called a read set) that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the amount of expensive computation, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different properties such as read lengths and error rates, which highly depend on the sequencing technology, and 2) different degrees of genetic variation compared to the reference genome, which highly depends on the genomes that are being compared. Through rigorous analysis of read mapping processes of reads with different properties and degrees of genetic variation, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based solid-state drive (SSD). Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern NAND flash-based SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05\texttimes{

DOI: 10.1145/3503222.3507702

ProSE： the architecture and design of a protein discovery engine

作者: Robson, Eyes and Xu, Ceyu and Wills, Lisa Wu
关键词: Accelerators, Domain-Specific Architecture, NLP, Neural Networks, Protein Design, Transformers

Abstract

Protein language models have enabled breakthrough approaches to protein structure prediction, function annotation, and drug discovery. A primary limitation to the widespread adoption of these powerful models is the high computational cost associated with the training and inference of these models, especially at longer sequence lengths. We present the architecture, microarchitecture, and hardware implementation of a protein design and discovery accelerator, ProSE (Protein Systolic Engine). ProSE has a collection of custom heterogeneous systolic arrays and special functions that process transfer learning model inferences efficiently. The architecture marries SIMD-style computations with systolic array architectures, optimizing coarse-grained operation sequences across model layers to achieve efficiency without sacrificing generality. ProSE performs Protein BERT inference at up to 6.9\texttimes{

DOI: 10.1145/3503222.3507722

Replication package for paper “A One-for-All and $O(V\log(V))$ -Cost Solution for Parallel Merge Style Operations on Sorted Key-Value Arrays”

作者: Wang, Bangyan and Deng, Lei and Sun, Fei and Dai, Guohao and Liu, Liu and Wang, Yu and Xie, Yuan
关键词: GCC, Gem5, Graph, Join, Key-value array, Merge sort, SIMD, Sparse linear algebra, SpGEMM

Abstract

It contains the necessary source code to reproduce the result in paper “A One-for-All and O(Vlog (V))-Cost Solution for Parallel Merge Style Operations on Sorted Key-Value Arrays”. It contains:

A modified GCC compiler that support the new SIMD primitives
A modified Gem5 simulator that support the new SIMD primitives
A collection of kernels written in C++ that use the new SIMD primitives. It should be compiled using the modified GCC.
A dockerfile to help you setup the environement.

DOI: 10.1145/3503222.3507728

Client-optimized algorithms and acceleration for encrypted compute offloading

作者: van der Hagen, McKenzie and Lucia, Brandon
关键词: Compute Offloading, Hardware Acceleration, Homomorphic Encryption, IoT, Privacy-Preserving Computation

Abstract

Homomorphic Encryption (HE) enables secure cloud offload processing on encrypted data. HE schemes are limited in the complexity and type of operations they can perform, motivating client-aided implementations that distribute computation between client (unencrypted) and server (encrypted). Prior client-aided systems optimize server performance, ignoring client costs: client-aided models put encryption and decryption on the critical path and require communicating large ciphertexts. We introduce Client-aided HE for Opaque Compute Offloading (CHOCO), a client-optimized system for encrypted offload processing. CHOCO reduces ciphertext size, reducing communication and computing costs through HE parameter minimization and through “rotational redundancy”, a new HE algorithm optimization. We present Client-aided HE for Opaque Compute Offloading Through Accelerated Cryptographic Operations (CHOCO-TACO), an accelerator for HE encryption and decryption, making client-aided HE feasible for even resource-constrained clients. CHOCO supports two popular HE schemes (BFV and CKKS) and several applications, including DNNs, PageRank, KNN, and K-Means. CHOCO reduces communication by up to 2948\texttimes{

DOI: 10.1145/3503222.3507737

ASPLOS 2022 Artifact for “Finding Missed Optimizations through the Lens of Dead Code Elimination”

作者: Theodoridis, Theodoros and Rigger, Manuel and Su, Zhendong
关键词: compilers, missed optimizations, testing

Abstract

The artifact contains the code and dataset we used for our experiments, as well as scripts to generate the numbers and tables of our evaluation. Specifically, it includes (a) the corpus of randomly generated programs that we used in Section 4’s evaluation; (b)scripts for generating a new corpus and validating the existing one; (c) our LLVM-based optimization marker instrumenter; (d) scripts for generating the missed optimization statistics presented in Section 4; (e) the full list of submitted bug reports with links to the respective compiler bug trackers; (f) end-to-end examples that led to bug reports. Everything is packaged and pre-built as a docker image. A standard X86 Linux machine running docker is necessary to evaluate this artifact.

DOI: 10.1145/3503222.3507764

Replication Package for Article： A Tree Clock Data Structure for Causal Orderings in Concurrent Executions

作者: Mathur, Umang and Pavlogiannis, Andreas and Tun\c{c
关键词: concurrency, dynamic analyses, happens-before, vector clocks

Abstract

This artifact contains all the source codes and experimental data for replicating our evaluation in the paper. We implemented the analyses programs as part of the tool Rapid. The provided experimental data contains all the 153 trace logs used in our evaluation. In our artifact we also provide Python scripts that fully automate the process of replicating our evaluation.

DOI: 10.1145/3503222.3507734

RSSD： defend against ransomware with hardware-isolated network-storage codesign and post-attack analysis

作者: Reidys, Benjamin and Liu, Peng and Huang, Jian
关键词: NVMe over Fabrics, Ransomware Attacks and Defenses, Solid-State Drive, Storage Forensics

Abstract

Encryption ransomware has become a notorious malware. It encrypts user data on storage devices like solid-state drives (SSDs) and demands a ransom to restore data for users. To bypass existing defenses, ransomware would keep evolving and performing new attack models. For instance, we identify and validate three new attacks, including (1) garbage-collection (GC) attack that exploits storage capacity and keeps writing data to trigger GC and force SSDs to release the retained data; (2) timing attack that intentionally slows down the pace of encrypting data and hides its I/O patterns to escape existing defense; (3) trimming attack that utilizes the trim command available in SSDs to physically erase data. To enhance the robustness of SSDs against these attacks, we propose RSSD, a ransomware-aware SSD. It redesigns the flash management of SSDs for enabling the hardware-assisted logging, which can conservatively retain older versions of user data and received storage operations in time order with low overhead. It also employs hardware-isolated NVMe over Ethernet to expand local storage capacity by transparently offloading the logs to remote cloud/servers in a secure manner. RSSD enables post-attack analysis by building a trusted evidence chain of storage operations to assist the investigation of ransomware attacks. We develop RSSD with a real-world SSD FPGA board. Our evaluation shows that RSSD can defend against new and future ransomware attacks, while introducing negligible performance overhead.

DOI: 10.1145/3503222.3507773

Artifact for： Creating Concise and Efficient Dynamic Analyses with ALDA

作者: Cheng, Xiang and Devecsery, David
关键词: compiler optimization, domain specific language, dynamic analysis

Abstract

This artifact description contains information about the complete workflow required to set up and reproduce experiments in ALDA. We describe how the software can be obtained and the build process as well as necessary preprocessing steps to generate the test program and baseline. All the programs and benchmarks are publicly available except for the SPEC 2006 benchmark. In addition, we provide a VM with all the programs and input data pre-pared and as well as instructions on how to build such a VM.

DOI: 10.1145/3503222.3507760

IceBreaker： Warming Serverless Functions Better with Heterogeneity

作者: Roy, Rohan Basu and Patel, Tirthak and Tiwari, Devesh
关键词: Cloud Computing, Cold Start, Heterogeneous Hardware, Keep-alive Cost, Serverless Computing

Abstract

IceBreaker is technique that reduces the service time and keep-alive cost of serverless functions, which are executed on a heterogeneous system consisting of costly and cheaper nodes. IceBreaker’s design consists of two major components: (1) Function Invocation Prediction Scheme (FIP), and (2) Placement Decision Maker (PDM). The FIP uses a Fourier transform based approach to determine the invocation concurrency of a function. The PDM decides where to warm up a serverless function: on a high-end server, or on a low- end server, or no warm up at all. This decision is made based upon an utility score which is calculated by considering several factors like probability of function invocation, speedup of a function on a high-end server, etc. Our artifact packages the scripts for setting up and invoking IceBreaker. It also contains the data obtained in our experimentation.

DOI: 10.1145/3503222.3507750

Replication package for INFless

作者: Yang, Yanan and Zhao, Laiping and Li, Yiming and Zhang, Huanyu and Li, Jie and Zhao, Mingyang and Chen, Xingzhen and Li, Keqiu
关键词: DNN inference, serverless platform

Abstract

The artifact includes the source code and scripts of INFless system, which is a native serverless platform for high throughput DNN inference.

DOI: 10.1145/3503222.3507709

Artifact for Article： FaaSFlow： Enable Efficient Workflow Execution for Function-as-a-Service

作者: Li, Zijun and Liu, Yushi and Guo, Linsong and Chen, Quan and Cheng, Jiagan and Zheng, Wenli and Guo, Minyi
关键词: FaaS, graph partition, master-worker, serverless workflows

Abstract

FaaSFlow is a serverless workflow engine that enables efficient workflow execution in 2 ways: a worker-side workflow schedule pattern to reduce scheduling overhead, and an adaptive storage library to use local memory to transfer data between functions on the same node.

DOI: 10.1145/3503222.3507717

Artifact for Paper： Serverless Computing on Heterogeneous Computers

作者: Du, Dong and Liu, Qingyuan and Jiang, Xueqiang and Xia, Yubin and Zang, Binyu and Chen, Haibo
关键词: heterogeneous architecture, operating system, Serverless computing, smart computers

Abstract

The artifact is the main repo for paper, Serverless Computing on Heterogeneous Computers. It contains the instructions to build and run the experiments, and the top directory of the project.

DOI: 10.1145/3503222.3507732

CoolEdge： hotspot-relievable warm water cooling for energy-efficient edge datacenters

作者: Pei, Qiangyu and Chen, Shutong and Zhang, Qixia and Zhu, Xinhui and Liu, Fangming and Jia, Ziyang and Wang, Yishuo and Yuan, Yongjie
关键词: edge datacenter energy, hotspot relieving, vapor chamber, warm water cooling

Abstract

As the computing frontier drifts to the edge, edge datacenters play a crucial role in supporting various real-time applications. Different from cloud datacenters, the requirements of proximity to end-users, high density, and heterogeneity, present new challenges to cool the edge datacenters efficiently. Although warm water cooling has become a promising cooling technique for this infrastructure, the one-size-fits-all cooling control would lower the cooling efficiency considerably because of the severe thermal imbalance across servers, hardware, and even inside one hardware component in an edge datacenter. In this work, we propose CoolEdge, a hotspot-relievable warm water cooling system for improving the cooling efficiency and saving costs of edge datacenters. Specifically, through the elaborate design of water circulations, CoolEdge can dynamically adjust the water temperature and flow rate for each heterogeneous hardware component to eliminate the hardware-level hotspots. By redesigning cold plates, CoolEdge can quickly disperse the chip-level hotspots without manual intervention. We further quantify the power saving achieved by the warm water cooling theoretically, and propose a custom-designed cooling solution to decide an appropriate water temperature and flow rate periodically. Based on a hardware prototype and real-world traces from SURFsara, the evaluation results show that CoolEdge reduces the cooling energy by 81.81% and 71.92%, respectively, compared with conventional and state-of-the-art water cooling systems.

DOI: 10.1145/3503222.3507713

Replication Package for Article： Yashme： Detecting Persistency Races

作者: Gorjiara, Hamed and Xu, Guoqing Harry and Demsky, Brian
关键词: CCEH, Compiler, FastFair, Memcached, Persistency Race, PMDK, RECIPE, Redis, Software Verification, Yashme

Abstract

This artifact contains a vagrant repository that downloads and compiles the source code for Yashme, its companion compiler pass, and benchmarks. The artifact enables users to reproduce the bugs that are found by Yashme in PMDK, Memcached, and Redis, and RECIPE as well as the performance results to compare Yashme with Jaaru, the underlying model checker.

DOI: 10.1145/3503222.3507766

EXAMINER： automatically locating inconsistent instructions between real devices and CPU emulators for ARM

作者: Jiang, Muhui and Xu, Tianyi and Zhou, Yajin and Hu, Yufeng and Zhong, Ming and Wu, Lei and Luo, Xiapu and Ren, Kui
关键词: Differential Testing, Emulator, Inconsistent Instructions

Abstract

Emulators are widely used to build dynamic analysis frameworks due to its fine-grained tracing capability, full system monitoring functionality, and scalability of running on different operating systems and architectures. However, whether emulators are consistent with real devices is unknown. To understand this problem, we aim to automatically locate inconsistent instructions, which behave differently between emulators and real devices. We target the ARM architecture, which provides machine-readable specifications. Based on the specification, we propose a sufficient test case generator by designing and implementing the first symbolic execution engine for the ARM architecture specification language (ASL). We generate 2,774,649 representative instruction streams and conduct differential testing between four ARM real devices in different architecture versions (i.e., ARMv5, ARMv6, ARMv7, and ARMv8) and three state-of-the-art emulators (i.e., QEMU, Unicorn, and Angr). We locate a huge number of inconsistent instruction streams (171,858 for QEMU, 223,264 for unicorn, and 120,169 for Angr). We find that undefined implementation in ARM manual and bugs of emulators are the major causes of inconsistencies. Furthermore, we discover 12 bugs, which influence commonly used instructions (e.g., BLX). With the inconsistent instructions, we build three security applications and demonstrate the capability of these instructions on detecting emulators, anti-emulation, and anti-fuzzing.

DOI: 10.1145/3503222.3507736

Path-sensitive and alias-aware typestate analysis for detecting OS bugs

作者: Li, Tuo and Bai, Jia-Ju and Sui, Yulei and Hu, Shi-Min
关键词: bug detection, operation system, static analysis

Abstract

Operating system (OS) is the cornerstone for modern computer systems. It manages devices and provides fundamental service for user-level applications. Thus, detecting bugs in OSes is important to improve reliability and security of computer systems. Static typestate analysis is a common technique for detecting different types of bugs, but it is often inaccurate or unscalable for large-size OS code, due to imprecision of identifying alias relationships as well as high costs of typestate tracking and path-feasibility validation. In this paper, we present PATA, a novel path-sensitive and aliasaware typestate analysis framework to detect OS bugs. To improve the precision of identifying alias relationships in OS code, PATA performs a path-based alias analysis based on control-flow paths and access paths. With these alias relationships, PATA reduces the costs of typestate tracking and path-feasibility validation, to boost the efficiency of path-sensitive typestate analysis for bug detection. We have evaluated PATA on the Linux kernel and three popular IoT OSes (Zephyr, RIOT and TencentOS-tiny) to detect three common types of bugs (null-pointer dereferences, uninitialized variable accesses and memory leaks). PATA finds 574 real bugs with a false positive rate of 28%. 206 of these bugs have been confirmed by the developers of the four OSes.We also compare PATA to seven state-of-the-art static approaches (Cppcheck, Coccinelle, Smatch,CSA, Infer, Saber and SVF). PATA finds many real bugs missed by them, with a lower false positive rate.

DOI: 10.1145/3503222.3507770

Replication Package for Article： Efficiently Detecting Concurrency Bugs in Persistent Memory Programs

作者: Chen, Zhangyu and Hua, Yu and Zhang, Yongle and Ding, Luochangqi
关键词: Concurrency, Crash Consistency, Debugging, Persistent Memory, Testing

Abstract

This is the finalized artifact of PMRace, a debugging tool for PM concurrency bugs. The artifact is maintained at GitHub and developed by Zhangyu and Luochangqi.

DOI: 10.1145/3503222.3507755

Replication Package for Article： Who Goes First? Detecting Go Concurrency Bugs via Message Reordering

作者: Liu, Ziheng and Xia, Shihao and Liang, Yu and Song, Linhai and Hu, Hong
关键词: bug, concurrent, fuzzing, golang

Abstract

The source code of GFuzz, an effective bug detector for Golang.

DOI: 10.1145/3503222.3507753

CryoWire： wire-driven microarchitecture designs for cryogenic computing

作者: Min, Dongmoon and Chung, Yujin and Byun, Ilkwon and Kim, Junpyo and Kim, Jangwoo
关键词: Chip Multi Processor, Cryogenic Computing, Multicore Architectures, Network on Chip, Pipelining, Superscalar Architectures

Abstract

Cryogenic computing, which runs a computer device at an extremely low temperature, is promising thanks to its significant reduction of wire resistance as well as leakage current. Recent studies on cryogenic computing have focused on various architectural units including the main memory, cache, and CPU core running at 77K. However, little research has been conducted to fully exploit the fast cryogenic wires, even though the slow wires are becoming more serious performance bottleneck in modern processors. In this paper, we propose a CPU microarchitecture which extensively exploits the fast wires at 77K. For this goal, we first introduce our validated cryogenic-performance models for the CPU pipeline and network on chip (NoC), whose performance can be significantly limited by the slow wires. Next, based on the analysis with the models, we architect CryoSP and CryoBus as our pipeline and NoC designs to fully exploit the fast wires. Our evaluation shows that our cryogenic computer equipped with both microarchitectures achieves 3.82 times higher system-level performance compared to the conventional computer system thanks to the 96% higher clock frequency of CryoSP and five times lower NoC latency of CryoBus.

DOI: 10.1145/3503222.3507749

REVAMP： A Systematic Framework for Heterogeneous CGRA Realization

作者: Bandara, Thilini Kaushalya and Wijerathne, Dhananjaya and Mitra, Tulika and Peh, Li-Shiuan
关键词: CGRA design space exploration, Coarse Grained Reconfigurable Arrays (CGRAs), Heterogeneous CGRAs

Abstract

REVAMP artifact includes the complete framework comprising the heterogeneous architecture generator, heterogeneous CGRA mapper, parameterized RTL and scripts for power, area calculation. We elaborate on the REVAMP tool flow with an example of generating a pareto-optimal heterogeneous CGRA from a 4x4 homogeneous CGRA targeting five application kernels.

DOI: 10.1145/3503222.3507772

作者: Xiao, Yuanlong and Micallef, Eric and Butt, Andrew and Hofmann, Matthew and Alston, Marc and Goldsmith, Matthew and Merczynski-Hait, Andrew and DeHon, Andr'{e
关键词: Compilation, DFX, Data Center, FPGA, Partial Reconfiguration

Abstract

FPGA-based accelerators are demonstrating significant absolute performance and energy efficiency compared with general-purpose CPUs. While FPGA computations can now be described in standard, programming languages, like C, development for FPGAs accelerators remains tedious and inaccessible to modern software engineers. Slow compiles (potentially taking tens of hours) inhibit the rapid, incremental refinement of designs that is the hallmark of modern software engineering. To address this issue, we introduce separate compilation and linkage into the FPGA design flow, providing faster design turns more familiar to software development. To realize this flow, we provide abstractions, compiler options, and compiler flow that allow the same C source code to be compiled to processor cores in seconds and to FPGA regions in minutes, providing the missing -O0 and -O1 options familiar in software development. This raises the FPGA programming level and standardizes the programming experience, bringing FPGA-based accelerators into a more familiar software platform ecosystem for software engineers.

DOI: 10.1145/3503222.3507740

Replication Package for Paper： Debugging in the Brave New World of Reconfigurable Hardware

作者: Ma, Jiacheng and Zuo, Gefei and Loughlin, Kevin and Zhang, Haoyang and Quinn, Andrew and Kasikci, Baris
关键词: Bug Study, Debugging, FPGA, Reconfigurable Hardware

Abstract

20 hardware bugs and the debugging tools mentioned in the paper “Debugging in the Brave New World of Reconfigurable Hardware”.

DOI: 10.1145/3503222.3507701

Artifacts for article： Temporal and SFQ Pulse-Streams Encoding for Area-Efficient Superconducting Accelerators

作者: Gonzalez-Guerrero, Patricia and Bautista, Meriam Gay and Lyles, Darren and Michelogiannakis, George
关键词: FIR, Josephson junctions, netlist, processing elements, superconducting digital, WRSPICE

Abstract

This artifact contains WRSPICE circuit netlists and Octave scripts that implement key circuits and analysis in our article. In the “library” folder you can find netlists for cells and building blocks that we use in our designs. In directory “spice_netlist” you can find WRSPICE netlists for some of our key circuits that we propose in the article. Directory “octave” contains scripts for error and other design space exploration that we perform in our article. Finally, directory “perl” contains auxiliary scripts. You can find more info in the README.md file.

DOI: 10.1145/3503222.3507765

ASPLOS 2022 Artifact for “Understanding and Exploiting Optimal Function Inlining”

作者: Theodoridis, Theodoros and Grosser, Tobias and Su, Zhendong
关键词: autotuning, compiler optimization, optimal inlining, program size

Abstract

The artifact contains the code and dataset we used for our experiments, as well as scripts to generate the numbers, figures, and tables of our evaluation. Specifically, it includes (a) the LLVM-IR files we used both for exhaustive search and autotuning (b) a modified LLVM that we use for exhaustive search and autotuning; (c) scripts to run exhaustive search and autotuning; (d) the expected outputs; (e) scripts to generate the tables and figures of our paper; (f) scripts to perform exhaustive search and autotuning only on smaller callgraphs and to validate the results against the provided ones. Everything is packaged and pre-built as a docker image. A standard X86 Linux machine running docker is necessary to evaluate this artifact.

DOI: 10.1145/3503222.3507744

CirFix： Automatically Repairing Defects in Hardware Design Code (Artifact)

作者: Ahmad, Hammad and Huang, Yu and Weimer, Westley
关键词: automated program repair, hardware bugs, hardware designs, HDL benchmark

Abstract

We provide the public repository for CirFix, both on Zenodo and GitHub. The artifact includes instructions for installing and running CirFix, as well as scripts and instructions used to reproduce core results from our paper.

Please contact Hammad Ahmad (hammada@umich.edu) if you have any questions.

DOI: 10.1145/3503222.3507763

Vector instruction selection for digital signal processors using program synthesis

作者: Ahmad, Maaz Bin Safeer and Root, Alexander J. and Adams, Andrew and Kamil, Shoaib and Cheung, Alvin
关键词: Instruction selection, compiler optimizations, program synthesis

Abstract

Instruction selection, whereby input code represented in an intermediate representation is translated into executable instructions from the target platform, is often the most target-dependent component in optimizing compilers. Current approaches include pattern matching, which is brittle and tedious to design, or search-based methods, which are limited by scalability of the search algorithm. In this paper, we propose a new algorithm that first abstracts the target platform instructions into high-level uber-instructions, with each uber-instruction unifying multiple concrete instructions from the target platform. Program synthesis is used to lift input code sequences into semantically equivalent sequences of uber-instructions and then to lower from uber-instructions to machine code. Using 21 real-world benchmarks, we show that our synthesis-based instruction selection algorithm can generate instruction sequences for a hardware target, with the synthesized code performing up to 2.1x faster as compared to code generated by a professionally-developed optimizing compiler for the same platform.

DOI: 10.1145/3503222.3507714

Artifact for Article： HeteroGen： Transpiling C to Heterogeneous HLS Code with Automated Test Generation and Program Repair

作者: Zhang, Qian and Wang, Jiyuan and Xu, Guoqing Harry and Kim, Miryung
关键词: heterogeneous applications, program repair, test generation

Abstract

This artifact includes an error study, a fuzzing-based test generation tool, and an automated code edit tool for error removal.

DOI: 10.1145/3503222.3507748

Tree Traversal Synthesis Using Domain-Specific Symbolic Compilation

作者: Chen, Yanju and Liu, Junrui and Feng, Yu and Bodik, Rastislav
关键词: program synthesis, symbolic compilation, tree traversal

Abstract

Tree Traversal Synthesis Using Domain-Specific Symbolic Compilation - Artifact for ASPLOS 2022 Submission

DOI: 10.1145/3503222.3507751

SRAM has no chill： exploiting power domain separation to steal on-chip secrets

作者: Mahmod, Jubayer and Hicks, Matthew
关键词: SRAM attack, cold boot, power domain separation

Abstract

The abundance of embedded systems and smart devices increases the risk of physical memory disclosure attacks. One such classic non-invasive attack exploits dynamic RAM’s temperature-dependent ability to retain information across power cycles—known as a cold boot attack. When exposed to low temperatures, DRAM cells preserve their state for a short time without power, mimicking non-volatile memories in that time frame. Attackers exploit this physical phenomenon to gain access to a system’s secrets, leading to data theft from encrypted storage. To prevent cold boot attacks, programmers hide secrets on-chip in Static Random-Access Memory (SRAM); by construction, on-chip SRAM is isolated from external probing and has little intrinsic capacitance, making it robust against cold boot attacks. While it is the case that SRAM protects against traditional cold boot attacks, we show that there is another way to retain information in on-chip SRAM across power cycles and software changes. This paper presents Volt Boot, an attack that demonstrates a vulnerability of on-chip volatile memories due to the physical separation common to modern system-on-chip power distribution networks. Volt Boot leverages asymmetrical power states (e.g., on vs. off) to force SRAM state retention across power cycles, eliminating the need for traditional cold boot attack enablers, such as low-temperature or intrinsic data retention time. Using several modern ARM Cortex-A devices, we demonstrate the effectiveness of the attack in caches, registers, and iRAMs. Unlike other forms of SRAM data retention attacks, Volt Boot retrieves data with 100% accuracy—without any complex post-processing.

DOI: 10.1145/3503222.3507710

Randomized Row-Swap： Mitigating Row Hammer by Breaking Spatial Correlation between Aggressor and Victim Rows

作者: Saileshwar, Gururaj and Wang, Bolin and Qureshi, Moinuddin and Nair, Prashant J.
关键词: DRAM, Fault-Injection Attacks, Memory System, Row Hammer

Abstract

This artifact presents the code and methodology to simulate Randomized Row-Swap (RRS), our defense against Rowhammer attacks. We provide the C code for the implementation of RRS which is encapsulated within the USIMM, a memory system simulator. The RRS structures and operations are implemented within the memory controller module in our artifact. We provide scripts to compile our simulator, and run the baseline and RRS. We also provide scripts to parse the results and collate the performance results.

DOI: 10.1145/3503222.3507716

Artifact for Article： ShEF： Shielded Enclaves for Cloud FPGAs

作者: Zhao, Mark and Gao, Mingyu and Kozyrakis, Christos
关键词: cloud computing, enclaves, FPGAs, reconfigurable computing, trusted execution

Abstract

In our artifact, we provide the entirety of the ShEF source code, including the Shield and implementations of the Secure Boot and Remote Attestation protocols. Our artifacts also include a number of reference benchmarks that we use to evaluate ShEF. We provide instructions on how to build, run, and evaluate Shield benchmarks on AWS F1 instances. Our archival and GitHub repository also provides a README containing more details on using ShEF.

DOI: 10.1145/3503222.3507733

Invisible bits： hiding secret messages in SRAM’s analog domain

作者: Mahmod, Jubayer and Hicks, Matthew
关键词: SRAM aging, Steganography, covert channel

Abstract

Electronic devices are increasingly the subject of inspection by authorities. While encryption hides secret messages, it does not hide the transmission of those secret messages—in fact, it calls attention to them. Thus, an adversary, seeing encrypted data, turns to coercion to extract the credentials required to reveal the secret message. Steganographic techniques hide secret messages in plain sight, providing the user with plausible deniability, removing the threat of coercion. This paper unveils Invisible Bits a new steganographic technique that hides secret messages in the analog domain of Static Random Access Memory (SRAM) embedded within a computing device. Unlike other memory technologies, the power-on state of SRAM reveals the analog-domain properties of its individual cells. We show how to quickly and systematically change the analog-domain properties of SRAM cells to encode data in the analog domain and how to reveal those changes by capturing SRAM’s power-on state. Experiments with commercial devices show that Invisible Bits provides over 90% capacity—two orders-of-magnitude more than previous on-chip steganographic approaches, while retaining device functionality—even when the device undergoes subsequent normal operation or is shelved for months. Experiments also show that adversaries cannot differentiate between devices with encoded messages and those without. Lastly, we show how to layer encryption and error correction on top of our message encoding scheme in an end-to-end demonstration.

DOI: 10.1145/3503222.3507756

Reproduction Package for Article 'Taurus： A Data Plane Architecture for Per-Packet ML ’

作者: Swamy, Tushar and Rucker, Alexander and Shahbaz, Muhammad and Gaur, Ishan and Olukotun, Kunle
关键词: Anomaly Detection, FPGA, MapReduce, P4, Per-packet ML, Self-driving Networks, Spatial

Abstract

Taurus MapReduce Block in FPGA: This repository contains the source code and instructions for building an FPGA-based implementation of the Taurus’s MapReduce block. For more details, please read our Taurus: A Data Plane Architecture for Per-Packet ML paper (appearing in ASPLOS ’22).

Taurus Anomaly-Detection Application: In this repository, we share the source code for the anomaly-detection application (AD) presented in our Taurus: A Data Plane Architecture for Per-Packet ML paper (to appear in ASPLOS ’22). We also provide details on what is needed to replicate the end-to-end testbed used for evaluating the AD application.

DOI: 10.1145/3503222.3507726

FlexDriver： a network driver for your accelerator

作者: Eran, Haggai and Fudim, Maxim and Malka, Gabi and Shalom, Gal and Cohen, Noam and Hermony, Amit and Levi, Dotan and Liss, Liran and Silberstein, Mark
关键词: accelerator disaggregation, accelerator networking, network function acceleration

Abstract

We propose a new system design for connecting hardware and FPGA accelerators to the network, allowing the accelerator to directly control commodity Network Interface Cards (NICs) without using the CPU. This enables us to solve the key challenge of leveraging existing NIC hardware offloads such as virtualization, tunneling, and RDMA for accelerator networking. Our approach supports a diverse set of use cases, from direct network access for disaggregated accelerators to inline-acceleration of the network stack, all without the complex networking logic in the accelerator. To demonstrate the feasibility of this approach, we build FlexDriver (FLD), an on-accelerator hardware module that implements a NIC data-plane driver. Our main technical contribution is a mechanism that compresses the NIC control structures by two orders of magnitude, allowing FLD to achieve high networking scalability with low die area cost and no bandwidth interference with the accelerator logic. The prototype for NVIDIA Innova-2 FPGA SmartNICs showcases our design’s utility for three different accelerators: a disaggregated LTE cipher, an IP-defragmentation inline accelerator, and an IoT cryptographic-token authentication offload. These accelerators reach 25 Gbps line rate and leverage the NIC for RDMA processing, VXLAN tunneling, and traffic shaping without CPU involvement.

DOI: 10.1145/3503222.3507776

Artifact for ‘The Benefits of General-Purpose On-NIC Memory’

作者: Pismenny, Boris and Liss, Liran and Morrison, Adam and Tsafrir, Dan
关键词: NFV acceleration, NIC memory, nicmem

Abstract

This repository contains scripts for ASPLOS’22 artifact evaluation of the The Benefits of General-Purpose on-NIC Memory paper by Boris Pismenny, Liran Liss, Adam Morrison, and Dan Tsafrir.

DOI: 10.1145/3503222.3507711

Morpheus： Domain Specific Run Time Optimization for Software Data Planes - Artifact for ASPLOS’22

作者: Miano, Sebastiano and Sanaee, Alireza and Risso, Fulvio and R'{e
关键词: Data Plane Compilation, DPDK, eBPF, LLVM, XDP

Abstract

This is the artifact for the “Morpheus: Domain Specific Run Time Optimization for Software Data Planes” paper published at ASPLOS’22. This artifact contains the source code, the experimental workflow, and additional information to 1) compile and build Morpheus, 2) install the software dependencies and setup the testbed to run all the experiments, 3) the scripts that can be used to perform some of the experiments presented in the paper, and 4) the scripts to generate the plots based on the obtained results.

DOI: 10.1145/3503222.3507769

AQUATOPE： QoS-and-Uncertainty-Aware Resource Management for Multi-stage Serverless Workflows

作者: Zhou, Zhuangzhuang and Zhang, Yanqi and Delimitrou, Christina
关键词: serverless computing, resource management, resource efficiency, resource allocation, quality of service, machine learning for systems, function-as-a-service, datacenter, Cloud computing

Abstract

Multi-stage serverless applications, i.e., workflows with many computation and I/O stages, are becoming increasingly representative of FaaS platforms. Despite their advantages in terms of fine-grained scalability and modular development, these applications are subject to suboptimal performance, resource inefficiency, and high costs to a larger degree than previous simple serverless functions.

We present Aquatope, a QoS-and-uncertainty-aware resource scheduler for end-to-end serverless workflows that takes into account the inherent uncertainty present in FaaS platforms, and improves performance predictability and resource efficiency. Aquatope uses a set of scalable and validated Bayesian models to create pre-warmed containers ahead of function invocations, and to allocate appropriate resources at function granularity to meet a complex workflow’s end-to-end QoS, while minimizing resource cost. Across a diverse set of analytics and interactive multi-stage serverless workloads, Aquatope significantly outperforms prior systems, reducing QoS violations by 5X, and cost by 34% on average and up to 52% compared to other QoS-meeting methods.

DOI: 10.1145/3567955.3567960

CAFQA： A Classical Simulation Bootstrap for Variational Quantum Algorithms

作者: Ravi, Gokul Subramanian and Gokhale, Pranav and Ding, Yi and Kirby, William and Smith, Kaitlin and Baker, Jonathan M. and Love, Peter J. and Hoffmann, Henry and Brown, Kenneth R. and Chong, Frederic T.
关键词: variational quantum eigensolver, variational quantum algorithms, quantum computing, noisy intermediate-scale quantum, clifford, chemistry, bayesian optimization

Abstract

Classical computing plays a critical role in the advancement of quantum frontiers in the NISQ era. In this spirit, this work uses classical simulation to bootstrap Variational Quantum Algorithms (VQAs). VQAs rely upon the iterative optimization of a parameterized unitary circuit (ansatz) with respect to an objective function. Since quantum machines are noisy and expensive resources, it is imperative to classically choose the VQA ansatz initial parameters to be as close to optimal as possible to improve VQA accuracy and accelerate their convergence on today’s devices.

This work tackles the problem of finding a good ansatz initialization, by proposing CAFQA, a Clifford Ansatz For Quantum Accuracy. The CAFQA ansatz is a hardware-efficient circuit built with only Clifford gates. In this ansatz, the parameters for the tunable gates are chosen by searching efficiently through the Clifford parameter space via classical simulation. The resulting initial states always equal or outperform traditional classical initialization (e.g., Hartree-Fock), and enable high-accuracy VQA estimations. CAFQA is well-suited to classical computation because: a) Clifford-only quantum circuits can be exactly simulated classically in polynomial time, and b) the discrete Clifford space is searched efficiently via Bayesian Optimization.

For the Variational Quantum Eigensolver (VQE) task of molecular ground state energy estimation (up to 18 qubits), CAFQA’s Clifford Ansatz achieves a mean accuracy of nearly 99% and recovers as much as 99.99% of the molecular correlation energy that is lost in Hartree-Fock initialization. CAFQA achieves mean accuracy improvements of 6.4x and 56.8x, over the state-of-the-art, on different metrics. The scalability of the approach allows for preliminary ground state energy estimation of the challenging chromium dimer (Cr2) molecule. With CAFQA’s high-accuracy initialization, the convergence of VQAs is shown to accelerate by 2.5x, even for small molecules.

Furthermore, preliminary exploration of allowing a limited number of non-Clifford (T) gates in the CAFQA framework, shows that as much as 99.9% of the correlation energy can be recovered at bond lengths for which Clifford-only CAFQA accuracy is relatively limited, while remaining classically simulable.

DOI: 10.1145/3567955.3567958

Cooperative Concurrency Control for Write-Intensive Key-Value Workloads

作者: Sutherland, Mark and Falsafi, Babak and Daglis, Alexandros
关键词: tail latency, synchronization, load balancing, linearizability, key-value stores, concurrency, NIC architecture

Abstract

Key-Value Stores (KVS) are foundational infrastructure components for online services. Due to their latency-critical nature, today’s best-performing KVS contain a plethora of full-stack optimizations commonly targeting read-mostly, popularity-skewed workloads. Motivated by production studies showing the increased prevalence of write-intensive workloads, we break down the KVS workload space into four distinct classes, and argue that current designs are only sufficient for two of them. The reason is that KVS concurrency control protocols expose a fundamental tradeoff: avoiding synchronization by partitioning writes across threads is mandatory for high throughput, but necessarily creates load imbalance that grows with core count and write fraction. We break this tradeoff with C-4, a co-design between NIC hardware and KVS software that judiciously separates write requests into two classes: independent ones that can be balanced across threads, and dependent ones which must be queued. C-4 dynamically partitions independent writes with the NIC to increase the load balancing flexibility of current KVS designs, and adds a software layer to the KVS to compact dependent writes into batches. Our evaluation shows that for write-intensive workloads, C-4 reduces 99th% tail latency by 1.3−5\texttimes{

DOI: 10.1145/3567955.3567957

DecoMine： A Compilation-Based Graph Pattern Mining System with Pattern Decomposition

作者: Chen, Jingji and Qian, Xuehai
关键词: pattern decomposition, graph pattern mining, graph analytics compiler

Abstract

Graph pattern mining (GPM) is an important application that identifies structures from graphs. Despite the recent progress, the performance gap between the state-of-the-art GPM systems and an efficient algorithm—pattern decomposition—is still at least an order of magnitude. This paper clears the fundamental obstacles of adopting pattern decomposition to a GPM system.

First, the performance of pattern decomposition algorithms depends on how to decompose the whole pattern into subpatterns. The original method performs complexity analysis of algorithms for different choices, and selects the one with the lowest complexity upper bound. Clearly, this approach is not feasible for average or even expert users. To solve this problem, we develop a GPM compiler with conventional and GPM-specific optimizations to generate algorithms for different decomposition choices, which are evaluated based on an accurate cost model. The executable of the GPM task is obtained from the algorithm with the best performance. Second, we propose a novel partial-embedding API that is sufficient to construct advanced GPM applications while preserving pattern decomposition algorithm advantages. Compared to state-of-the-art systems, our new GPM system, DecoMine, developed based on the ideas, reduces the execution time of GPM on large graphs and patterns from days to a few hours with low programming effort.

DOI: 10.1145/3567955.3567956

Erms： Efficient Resource Management for Shared Microservices with SLA Guarantees

作者: Luo, Shutian and Xu, Huanle and Ye, Kejiang and Xu, Guoyao and Zhang, Liping and He, Jian and Yang, Guodong and Xu, Chengzhong
关键词: Shared Microservices, SLA Guarantees, Resource Management

Abstract

A common approach to improving resource utilization in data centers is to adaptively provision resources based on the actual workload. One fundamental challenge of doing this in microservice management frameworks, however, is that different components of a service can exhibit significant differences in their impact on end-to-end performance. To make resource management more challenging, a single microservice can be shared by multiple online services that have diverse workload patterns and SLA requirements.

We present an efficient resource management system, namely Erms, for guaranteeing SLAs in shared microservice environments. Erms profiles microservice latency as a piece-wise linear function of the workload, resource usage, and interference. Based on this profiling, Erms builds resource scaling models to optimally determine latency targets for microservices with complex dependencies. Erms also designs new scheduling policies at shared microservices to further enhance resource efficiency. Experiments across microservice benchmarks as well as trace-driven simulations demonstrate that Erms can reduce SLA violation probability by 5\texttimes{

DOI: 10.1145/3567955.3567964

Glign： Taming Misaligned Graph Traversals in Concurrent Graph Processing

作者: Yin, Xizhe and Zhao, Zhijia and Gupta, Rajiv
关键词: iterative graph algorithm, graph traversal, graph system, data locality, concurrent graph processing

Abstract

In concurrent graph processing, different queries are evaluated on the same graph simultaneously, sharing the graph accesses via the memory hierarchy. However, different queries may traverse the graph differently, especially for those starting from different source vertices. When these graph traversals are ”misaligned”, the benefits of graph access sharing can be seriously compromised. As more concurrent queries are added to the evaluation batch, the issue tends to become even worse. To address the above issue, this work introduces Glign, a runtime system that automatically aligns the graph traversals for concurrent queries. Glign introduces three levels of graph traversal alignment for iterative evaluation of concurrent queries. First, it synchronizes the accesses of different queries to the active parts of the graph within each iteration of the evaluation—intra-iteration alignment. On top of that, Glign leverages a key insight regarding the “heavy iterations” in query evaluation to achieve inter-iteration alignment and alignment-aware batching. The former aligns the iterations of different queries to increase the graph access sharing, while the latter tries to group queries of better graph access sharing into the same evaluation batch. Together, these alignment techniques can substantially boost the data locality of concurrent query evaluation. Based on our experiments, Glign outperforms the state-of-the-art concurrent graph processing systems Krill and GraphM by 3.6\texttimes{

DOI: 10.1145/3567955.3567963

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

作者: Wang, Shibo and Wei, Jinliang and Sabne, Amit and Davis, Andy and Ilbeyi, Berkin and Hechtman, Blake and Chen, Dehao and Murthy, Karthik Srinivasa and Maggioni, Marcello and Zhang, Qiao and Kumar, Sameer and Guo, Tongfei and Xu, Yuanzhong and Zhou, Zongwei
关键词: Large scale machine learning, Compiler optimization, Collective communication hiding

Abstract

Large deep learning models have shown great potential with state-of-the-art results in many tasks. However, running these large models is quite challenging on an accelerator (GPU or TPU) because the on-device memory is too limited for the size of these models. Intra-layer model parallelism is an approach to address the issues by partitioning individual layers or operators across multiple devices in a distributed accelerator cluster. But, the data communications generated by intra-layer model parallelism can contribute to a significant proportion of the overall execution time and severely hurt the computational efficiency. As intra-layer model parallelism is critical to enable large deep learning models, this paper proposes a novel technique to effectively reduce its data communication overheads by overlapping communication with computation. With the proposed technique, an identified original communication collective is decomposed along with the dependent computation operation into a sequence of finer-grained operations. By creating more overlapping opportunities and executing the newly created, finer-grained communication and computation operations in parallel, it effectively hides the data transfer latency and achieves a better system utilization. Evaluated on TPU v4 Pods using different types of large models that have 10 billion to 1 trillion parameters, the proposed technique improves system throughput by 1.14 - 1.38x. The achieved highest peak FLOPS utilization is 72% on 1024 TPU chips with a large language model that has 500 billion parameters.

DOI: 10.1145/3567955.3567959

Risotto： A Dynamic Binary Translator for Weak Memory Model Architectures

作者: Gouicem, Redha and Sprokholt, Dennis and Ruehl, Jasper and Rocha, Rodrigo C. O. and Spink, Tom and Chakraborty, Soham and Bhatotia, Pramod
关键词: memory models, formal verification, Binary translation

Abstract

Dynamic Binary Translation (DBT) is a powerful approach to support cross-architecture emulation of unmodified binaries. However, DBT systems face correctness and performance challenges, when emulating concurrent binaries from strong to weak memory consistency architectures. As a matter of fact, we report several translation errors in QEMU, when emulating x86 binaries on Arm hosts. To address these challenges, we propose an end-to-end approach that provides correct and efficient emulation for weak memory model architectures. Our contributions are twofold: we formalize QEMU’s intermediate representation’s memory model, and use it to propose formally verified mapping schemes to bridge the strong-on-weak memory consistency mismatch. Secondly, we implement these verified mappings in Risotto, a QEMU-based DBT system that optimizes memory fence placement while ensuring correctness. Risotto further enhances the emulation performance via cross-architecture dynamic linking of native shared libraries, and fast and correct translation of compare-and-swap operations. We evaluate Risotto using multi-threaded benchmark suites and real-world applications, and show that Risotto improves the emulation performance by 6.7% on average over ”erroneous” QEMU, while ensuring correctness.

DOI: 10.1145/3567955.3567962

TelaMalloc： Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators

作者: Maas, Martin and Beaugnon, Ulysse and Chauhan, Arun and Ilbeyi, Berkin
关键词: Memory Allocation, Machine Learning, ML for Systems, ILP, CP

Abstract

Memory buffer allocation for on-chip memories is a major challenge in modern machine learning systems that target ML accelerators. In interactive systems such as mobile phones, it is on the critical path of launching ML-enabled applications. In data centers, it is part of complex optimization loops that run many times and are the limiting factor for the quality of compilation results.

In contrast to the traditional memory allocation problem in languages such as C++, where allocation requests dynamically arrive as the application is executing, ML systems typically execute a static control flow graph that is known in advance. The task of the memory allocator is to choose buffer locations in device memory such that the total amount of used memory never exceeds the total memory available on-device. This is a high dimensional, NP-hard optimization problem that is challenging to solve.

Today, ML frameworks approach this problem either using ad-hoc heuristics or solver-based methods. Heuristic solutions work for simple cases but fail for more complex instances of this problem. Solver-based solutions can handle these more complex instances, but are expensive and impractical in scenarios where memory allocation is on the critical path, such as on mobile devices that compile models on-the-fly. We encountered this problem in the development of Google’s Pixel 6 phone, where some important models took prohibitively long to compile.

We introduce an approach that solves this challenge by combining constraint optimization with domain-specific knowledge to achieve the best properties of both. We combine a heuristic-based search with a solver to guide its decision making. Our approach matches heuristics for simple inputs while being significantly faster than the best Integer Linear Program (ILP) solver-based approach for complex inputs. We also show how ML can be used to continuously improve the search for the long tail of workloads. Our approach is shipping in two production systems: Google’s Pixel 6 phone and TPUv4. It achieves up to two orders of magnitude allocation time speed-up on real ML workloads compared to a highly-tuned production ILP approach that it replaces and enables important real-world models that could not otherwise be supported.

DOI: 10.1145/3567955.3567961