ASPLOS 2021

PacketMill： Toward Per-Core 100-Gbps Networking - Artifact for ASPLOS’21

作者: Farshin, Alireza and Barbette, Tom and Roozbeh, Amir and Maguire Jr., Gerald Q. and Kosti'{c
关键词: DPDK., FastClick, LLVM, Middleboxes, Packet Processing, PacketMill, X-Change

Abstract

This is the artifact for the “PacketMill: Toward per-core 100-Gbps Networking” paper published at ASPLOS’21.

PacketMill is a system that optimizes the performance of network functions via holistic inter-stack optimizations. More specifically, PacketMill provides a new metadata management model, called X-Change, enabling the packet processing frameworks to provide their custom buffer to DPDK and fully bypass rte_mbuf. Additionally, PacketMill performs a set of source-code \& intermediate representation (IR) code optimizations.

Our paper’s artifact contains the source code, the experimental workflow, and additional information to (i) set upPacketMill \& its testbed, (ii) perform some of the experiments presented in the paper, and (iii) validates the reusability \& effectiveness of PacketMill.

For more information, please refer to https://github.com/aliireza/packetmill

DOI: 10.1145/3445814.3446724

Autonomous NIC offloads

作者: Pismenny, Boris and Eran, Haggai and Yehezkel, Aviad and Liss, Liran and Morrison, Adam and Tsafrir, Dan
关键词: operating system, hardware/software co-design, NIC

Abstract

CPUs routinely offload to NICs network-related processing tasks like packet segmentation and checksum. NIC offloads are advantageous because they free valuable CPU cycles. But their applicability is typically limited to layer≤4 protocols (TCP and lower), and they are inapplicable to layer-5 protocols (L5Ps) that are built on top of TCP. This limitation is caused by a misfeature we call ”offload dependence,” which dictates that L5P offloading additionally requires offloading the underlying layer≤4 protocols and related functionality: TCP, IP, firewall, etc. The dependence of L5P offloading hinders innovation, because it implies hard-wiring the complicated, ever-changing implementation of the lower-level protocols. We propose ”autonomous NIC offloads,” which eliminate offload dependence. Autonomous offloads provide a lightweight software-device architecture that accelerates L5Ps without having to migrate the entire layer≤4 TCP/IP stack into the NIC. A main challenge that autonomous offloads address is coping with out-of-sequence packets. We implement autonomous offloads for two L5Ps: (i) NVMe-over-TCP zero-copy and CRC computation, and (ii) https authentication, encryption, and decryption. Our autonomous offloads increase throughput by up to 3.3x, and they deliver CPU consumption and latency that are as low as 0.4x and 0.7x, respectively. Their implementation is already upstreamed in the Linux kernel, and they will be supported in the next-generation of Mellanox NICs.

DOI: 10.1145/3445814.3446732

Dagger： efficient and fast RPCs in cloud microservices with near-memory reconfigurable NICs

作者: Lazarev, Nikita and Xiang, Shaojie and Adit, Neil and Zhang, Zhiru and Delimitrou, Christina
关键词: smartNICs, microservices, datacenters, cloud computing, cache-coherent FPGAs, RPC frameworks, FPGAs, End-host networking

Abstract

The ongoing shift of cloud services from monolithic designs to mi- croservices creates high demand for efficient and high performance datacenter networking stacks, optimized for fine-grained work- loads. Commodity networking systems based on software stacks and peripheral NICs introduce high overheads when it comes to delivering small messages. We present Dagger, a hardware acceleration fabric for cloud RPCs based on FPGAs, where the accelerator is closely-coupled with the host processor over a configurable memory interconnect. The three key design principle of Dagger are: (1) offloading the entire RPC stack to an FPGA-based NIC, (2) leveraging memory interconnects instead of PCIe buses as the interface with the host CPU, and (3) making the acceleration fabric reconfigurable, so it can accommodate the diverse needs of microservices. We show that the combination of these principles significantly improves the efficiency and performance of cloud RPC systems while preserving their generality. Dagger achieves 1.3 − 3.8\texttimes{

DOI: 10.1145/3445814.3446696

BCD deduplication： effective memory compression using partial cache-line deduplication

作者: Park, Sungbo and Kang, Ingab and Moon, Yaebin and Ahn, Jung Ho and Suh, G. Edward
关键词: deduplication, Memory compression, DRAM

Abstract

In this paper, we identify new partial data redundancy among multiple cache lines that are not exploited by traditional memory compression or memory deduplication. We propose Base and Compressed Difference (BCD) deduplication that effectively utilizes the partial matches among cache lines through a novel combination of compression and deduplication to increase the effective capacity of main memory. Experimental results show that BCD achieves the average compression ratio of 1.94\texttimes{

DOI: 10.1145/3445814.3446722

KLOCs： kernel-level object contexts for heterogeneous memory systems

作者: Kannan, Sudarsun and Ren, Yujie and Bhattacharjee, Abhishek
关键词: Virtual Memory, OS, Nonvolatile Memory, Heterogeneous Memory

Abstract

Heterogeneous memory systems promise better performance, energy-efficiency, and cost trade-offs in emerging systems. But delivering on this promise requires efficient OS mechanisms and policies for data tiering and migration. Unfortunately, modern OSes are lacking inefficient support for data tiering. While this problem is known for application data, the question of how best to manage kernel objects for filesystems and networking—i.e., inodes, dentry caches, journal blocks, socket buffers, etc.—has largely been ignored and presents a performance challenge for I/O-intensive workloads. We quantify the scale of this challenge and introduce a new OS abstraction, kernel-level object contexts (KLOCs), to enable efficient tiering of kernel objects. We use KLOCs to identify and group kernel objects with similar hotness, reuse, and liveness, and demonstrate their use in data placement and migration across several heterogeneous memory system configurations, including Intel’s Optane systems. Performance evaluations using RocksDB, Redis, Cassandra, and Spark show that KLOCs enable up to 2.7\texttimes{

DOI: 10.1145/3445814.3446745

Artifacts for Article： Rethinking Software Runtimes for Disaggregated Memory

作者: Calciu, Irina and Imran, M. Talha and Puddu, Ivan and Kashyap, Sanidhya and Maruf, Hasan Al and Mutlu, Onur and Kolli, Aasheesh
关键词: average memory access time, cache-line granularity dirty data tracking

Abstract

These artifacts have been developed for the ASPLOS 2021 article “Rethinking Software Runtimes for Disaggregated Memory”. The artifacts provide tools to track applications and determine their memory accesses: cache-line granularity memory writes and average memory access time (AMAT).

DOI: 10.1145/3445814.3446713

DiAG： a dataflow-inspired architecture for general-purpose processors

作者: Wang, Dong Kai and Kim, Nam Sung
关键词: parallelism, general-purpose, dataflow architecture

Abstract

The end of Dennard scaling and decline of Moore’s law has prompted the proliferation of hardware accelerators for a wide range of application domains. Yet, at the dawn of an era of specialized computing, left behind the trend is the general-purpose processor that is still most easily programmed and widely used but has seen incremental changes for decades. This work uses an accelerator-inspired approach to rethink CPU microarchitecture to improve its energy efficiency while retaining its generality. We propose DiAG, a dataflow-based general-purpose processor architecture that can minimize latency by exploiting instruction-level parallelism or maximize throughput by exploiting data-level parallelism. DiAG is designed to support any RISC-like instruction set without explicitly requiring specialized languages, libraries, or compilers. Central to this architecture is the abstraction of the register file as register ‘lanes’ that allow implicit construction of the program’s dataflow graph in hardware. At the cost of increased area, DiAG offers three main benefits over conventional out-of-order microarchitectures: reduced front-end overhead, efficient instruction reuse, and thread-level pipelining. We implement a DiAG prototype that supports the RISC-V ISA in SystemVerilog and evaluate its performance, power consumption, and area with EDA tools. In the tested Rodinia and SPEC CPU2017 benchmarks, DiAG configured with 512 PEs achieves a 1.18x speedup and 1.63x improvement in energy efficiency against an aggressive out-of-order CPU baseline.

DOI: 10.1145/3445814.3446703

LifeStream

作者: Jayarajan, Anand and Hau, Kimberly and Goodwin, Andrew and Pekhimenko, Gennady
关键词: LifeStream, Numpy, Python, Scikit-learn, SciPy, stream data analytics, temporal query processing, Trill

Abstract

This artifact contains code and a synthetic data set to evaluate LifeStream, Trill, and numerical library-based data processing pipelines.

DOI: 10.1145/3445814.3446725

When application-specific ISA meets FPGAs： a multi-layer virtualization framework for heterogeneous cloud FPGAs

作者: Zha, Yue and Li, Jing
关键词: Virtualization, Parallel patterns, Heterogeneous cloud FPGAs, Application-specific ISA

Abstract

While field-programmable gate arrays (FPGAs) have been widely deployed into cloud platforms, the high programming complexity and the inability to manage FPGA resources in an elastic/scalable manner largely limits the adoption of FPGA acceleration. Existing FPGA virtualization mechanisms partially address these limitations. Application-specific (AS) ISA provides a nice abstraction to enable a simple software programming flow that makes FPGA acceleration accessible by the mainstream software application developers. Nevertheless, existing AS ISA-based approaches can only manage FPGA resources at a per-device granularity, leading to a low resource utilization. Alternatively, hardware-specific (HS) abstraction improves the resource utilization by spatially sharing one FPGA among multiple applications. But it cannot reduce the programming complexity due to the lack of a high-level programming model. In this paper, we propose a virtualization mechanism for heterogeneous cloud FPGAs that combines AS ISA and HS abstraction to fully address aforementioned limitations. To efficiently combine these two abstractions, we provide a multi-layer virtualization framework with a new system abstraction as an indirection layer between them. This indirection layer hides the FPGA-specific resource constraints and leverages parallel pattern to effectively reduce the mapping complexity. It simplifies the mapping process into two steps, where the first step decomposes an AS ISA-based accelerator under no resource constraint to extract all fine-grained parallel patterns, and the second step leverages the extracted parallel patterns to simplify the process of mapping the decomposed accelerators onto the underlying HS abstraction. While system designers might be able to manually perform these steps for small accelerator designs, we develop a set of custom tools to automate this process and achieve a high mapping quality. By hiding FPGA-specific resource constraints, the proposed system abstraction provides a homogeneous view for the heterogeneous cloud FPGAs to simplify the runtime resource management. The extracted parallel patterns could also be leveraged by the runtime system to improve the performance of scale-out acceleration by maximally hiding the inter-FPGA communication latency. We use an AS ISA similar to the one proposed in BrainWave project and a recently proposed HS abstraction as a case study to demonstrate the effectiveness of the proposed virtualization framework. The performance is evaluated on a custom-built FPGA cluster with heterogeneous FPGA resources. Compared with the baseline system that only uses AS ISA, the proposed framework effectively combines these two abstractions and improves the aggregated system throughput by 2.54\texttimes{

DOI: 10.1145/3445814.3446699

Sage： practical and scalable ML-driven performance debugging in microservices

作者: Gan, Yu and Liang, Mingyu and Dev, Sundar and Lo, David and Delimitrou, Christina
关键词: variational autoencoder, performance debugging, microservices, counterfactual, cloud computing, QoS, Bayesian network

Abstract

Cloud applications are increasingly shifting from large monolithic services to complex graphs of loosely-coupled microservices. Despite the advantages of modularity and elasticity microservices offer, they also complicate cluster management and performance debugging, as dependencies between tiers introduce backpressure and cascading QoS violations. Prior work on performance debugging for cloud services either relies on empirical techniques, or uses supervised learning to diagnose the root causes of performance issues, which requires significant application instrumentation, and is difficult to deploy in practice. We present Sage, a machine learning-driven root cause analysis system for interactive cloud microservices that focuses on practicality and scalability. Sage leverages unsupervised ML models to circumvent the overhead of trace labeling, captures the impact of dependencies between microservices to determine the root cause of unpredictable performance online, and applies corrective actions to recover a cloud service’s QoS. In experiments on both dedicated local clusters and large clusters on Google Compute Engine we show that Sage consistently achieves over 93% accuracy in correctly identifying the root cause of QoS violations, and improves performance predictability.

DOI: 10.1145/3445814.3446700

Nightcore： Efficient and Scalable Serverless Computing for Latency-Sensitive, Interactive Microservices (Artifacts)

作者: Jia, Zhipeng and Witchel, Emmett
关键词: Cloud computing, function-as-a-service, microservices, serverless computing

Abstract

Our artifact includes the prototype implementation of Nightcore, the DeathStarBench and HipsterShop microservices ported to Nightcore, and the experiment workflow to run these workloads on AWS EC2 instances.

DOI: 10.1145/3445814.3446701

Replication package for article： Sinan： ML-Based and QoS-Aware Resource Management for Cloud Microservices

作者: Zhang, Yanqi and Hua, Weizhe and Zhou, Zhuangzhuang and Suh, G. Edward and Delimitrou, Christina
关键词: cloud computing, cluster management, datacenter, machine learn-ing for systems, mi-croservices, quality of service, resource efficiency, tail latency

Abstract

The artifact includes codes and documentation to reproduce the google cloud experiments presented in Sinan: ML-Based and QoS-Aware Resource Management for Cloud Microservices

DOI: 10.1145/3445814.3446693

NOREBA： a compiler-informed non-speculative out-of-order commit processor

作者: Hajiabadi, Ali and Diavastos, Andreas and Carlson, Trevor E.
关键词: processor design, out-of-order commit, hardware-software co-design, compilers

Abstract

Modern superscalar processors execute instructions out-of-order, but commit them in program order to provide precise exception handling and safe instruction retirement. However, in-order instruction commit is highly conservative and holds on to critical resources far longer than necessary, severely limiting the reach of general-purpose processors, ultimately reducing performance. Solutions that allow for efficient, early reclamation of these critical resources could seize the opportunity to improve performance. One such solution is out-of-order commit, which has traditionally been challenging due to inefficient, complex hardware used to guarantee safe instruction retirement and provide precise exception handling. In this work, we present NOREBA, a processor for Non-speculative Out-of-order Retirement via Branch Reconvergence Analysis. In NOREBA, we enable non-speculative out-of-order commit and resource reclamation in a light-weight manner, improving performance and efficiency. We accomplish this through a combination of (1) automatic compiler annotation of true branch dependencies, and (2) an efficient re-design of the reorder buffer from traditional processors. By exploiting compiler branch dependency information, this system achieves 95% of the performance of aggressive, speculative solutions, without any additional speculation, and while maintaining energy efficiency.

DOI: 10.1145/3445814.3446726

Fast Local Page-Tables for Virtualized NUMA Servers with vMitosis

作者: Panwar, Ashish and Achermann, Reto and Basu, Arkaprava and Bhattacharjee, Abhishek and Gopinath, K. and Gandhi, Jayneel
关键词: ASPLOS’21, NUMA, Page-Tables, VMItosis-Linux

Abstract

This repository contains artifacts of the paper Fast Local Page-Tables for Virtualized NUMA Servers with vMitosis by Ashish Panwar, Reto Achermann, Arkaprava Basu, Abhishek Bhattacharjee, K. Gopinath, and Jayneel Gandhi to appear in the 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’21).

DOI: 10.1145/3445814.3446709

Artifact evaluation pack for PTEMagnet (paper #111 in ASPLOS’21)

作者: Margaritov, Artemiy and Ustiugov, Dmitrii and Shahab, Amna and Grot, Boris
关键词: operating system, virtual memory, virtualization

Abstract

The artifact contains a Linux kernel patch for enabling PTEMagnet, shell scripts for Linux kernel compilation, a virtual machine disk image with precompiled benchmarks, and Python/shell scripts that are expected to reproduce the results presented in Figure 6 of the paper for non-SPEC benchmarks.

DOI: 10.1145/3445814.3446704

In-fat pointer： hardware-assisted tagged-pointer spatial memory safety defense with subobject granularity protection

作者: Xu, Shengjie and Huang, Wei and Lie, David
关键词: Tagged-pointer, Subobject Bound Checking, Spatial Memory Errors, Object Metadata, Memory Safety, Buffer Overflows, Bound Checking

Abstract

Programming languages like C and C++ are not memory-safe because they provide programmers with low-level pointer manipulation primitives. The incorrect use of these primitives can result in bugs and security vulnerabilities: for example, spatial memory safety errors can be caused by dereferencing pointers outside the legitimate address range belonging to the corresponding object. While a range of schemes to provide protection against these vulnerabilities have been proposed, they all suffer from the lack of one or more of low performance overhead, compatibility with legacy code, or comprehensive protection for all objects and subobjects. We present In-Fat Pointer, the first hardware-assisted defense that can achieve spatial memory safety at subobject granularity while maintaining compatibility with legacy code and low overhead. In-Fat Pointer improves the protection granularity of tagged-pointer schemes using object metadata, which is efficient and binary-compatible for object-bound spatial safety. Unlike previous work that devotes all pointer tag bits to object metadata lookup, In-Fat Pointer uses three complementary object metadata schemes to reduce the number pointer tag bits needed for metadata lookup, allowing it to use the left-over bits, along with in-memory type metadata, to refine the object bounds to subobject granularity. We show that this approach provides practical protection of fine-grained spatial memory safety.

DOI: 10.1145/3445814.3446761

Replication package for Article： Judging a Type by Its Pointer： Optimizing Virtual Function Calls on GPUs

作者: Zhang, Mengchi and Alawneh, Ahmad and Rogers, Timothy G.
关键词: GPU, Object-oriented Programming, Virtual Function Call

Abstract

The artifact contains the source code for the SharedOA, COAL, and TypePointer that applied to all workloads. We also include the instructions to configure, build, run, and acquire the workload’s performance. Users can reproduce the results in Figure 6. We also contain a tutorial with examples to apply SharedOA, COAL and TypePointer to show that the three techniques are reusable on other CUDA applications.

DOI: 10.1145/3445814.3446734

Enclosure： language-based restriction of untrusted libraries

作者: Ghosn, Adrien and Kogias, Marios and Payer, Mathias and Larus, James R. and Bugnion, Edouard
关键词: software packages, programming languages, intra-address space isolation, Security

Abstract

Programming languages and systems have failed to address the security implications of the increasingly frequent use of public libraries to construct modern software. Most languages provide tools and online repositories to publish, import, and use libraries; however, this double-edged sword can incorporate a large quantity of unknown, unchecked, and unverified code into an application. The risk is real, as demonstrated by malevolent actors who have repeatedly inserted malware into popular open-source libraries. This paper proposes a solution: enclosures, a new programming language construct for library isolation that provides a developer with fine-grain control over the resources that a library can access, even for libraries with complex inter-library dependencies. The programming abstraction is language-independent and could be added to most languages. These languages would then be able to take advantage of hardware isolation mechanisms that are effective across language boundaries. The enclosure policies are enforced at run time by LitterBox, a language-independent framework that uses hardware mechanisms to provide uniform and robust isolation guarantees, even for libraries written in unsafe languages. LitterBox currently supports both Intel VT-x (with general-purpose extended page tables) and the emerging Intel Memory Protection Keys (MPK). We describe an enclosure implementation for the Go and Pythonlanguages. Our evaluation demonstrates that the Go implementation can protect sensitive data in real-world applications constructed using complex untrusted libraries with deep dependencies. It requires minimal code refactoring and incurs acceptable performance overhead. The Python implementation demonstrates LitterBox’s ability to support dynamic languages.

DOI: 10.1145/3445814.3446728

Switches for HIRE： Resource Scheduling for Data Center In-Network Computing

作者: Bl"{o
关键词: data center, heterogeneity, in-network computing, non-linear resource usage, scheduling

Abstract

The artifact consists of three parts. (1) the source code of the HIRE simulator, including the implementations of Yarn++, Sparrow++, K8++, and CoCo++; (2) the runner tool (a Python3 program) that runs the experiments with the configurations presented in the paper and plotting scripts; and (3) Docker configurations to ease the setup. Users can reproduce all simulation results (Fig. 8 and Fig. 7). Furthermore, the artifact can be easily extended/modified to bench- mark other schedulers, INC configurations, and workloads.

DOI: 10.1145/3445814.3446760

Probabilistic profiling of stateful data planes for adversarial testing

作者: Kang, Qiao and Xing, Jiarong and Qiu, Yiming and Chen, Ang
关键词: symbolic execution, adversarial testing, Programmable data planes

Abstract

Recently, there is a flurry of projects that develop data plane systems in programmable switches, and these systems perform far more sophisticated processing than simply deciding a packet’s next hop (i.e., traditional forwarding). This presents challenges to existing network program profilers, which are developed primarily to handle stateless forwarding programs. We develop P4wn, a program profiler that can analyze program behaviors of stateful data plane systems; it captures the fact that these systems process packets differently based on program state, which in turn depends on the underlying stochastic traffic pattern. Whereas existing profilers can only analyze stateless network processing, P4wn can analyze stateful processing behaviors and their respective probabilities. Although program profilers have general applications, we showcase a concrete use case in detail: adversarial testing. Unlike regular program testing, adversarial testing distinguishes and specifically stresses low-probability edge cases in a program. Our evaluation shows that P4wn can analyze complex programs that existing tools cannot handle, and that it can effectively identify edge-case traces.

DOI: 10.1145/3445814.3446764

MERCI： efficient embedding reduction on commodity hardware via sub-query memoization

作者: Lee, Yejin and Seo, Seong Hoon and Choi, Hyunji and Sul, Hyoung Uk and Kim, Soosung and Lee, Jae W. and Ham, Tae Jun
关键词: Recommender Systems, Memoization, Embedding Lookup

Abstract

Deep neural networks (DNNs) with embedding layers are widely adopted to capture complex relationships among entities within a dataset. Embedding layers aggregate multiple embeddings — a dense vector used to represent the complicated nature of a data feature— into a single embedding; such operation is called embedding reduction. Embedding reduction spends a significant portion of its runtime on reading embeddings from memory and thus is known to be heavily memory-bandwidth-bound. Recent works attempt to accelerate this critical operation, but they often require either hardware modifications or emerging memory technologies, which makes it hardly deployable on commodity hardware. Thus, we propose MERCI, Memoization for Embedding Reduction with ClusterIng, a novel memoization framework for efficient embedding reduction. MERCI provides a mechanism for memoizing partial aggregation of correlated embeddings and retrieving the memoized partial result at a low cost. MERCI substantially reduces the number of memory accesses by 44% (29%), leading to 102% (74%) throughput improvement on real machines and 40.2% (28.6%) energy savings at the expense of 8\texttimes{

DOI: 10.1145/3445814.3446717

SherLock-v2

作者: Li, Guangpu and Chen, Dongjie and Lu, Shan and Musuvathi, Madanlal and Nath, Suman
关键词: Happens-before inducing, Synchronization Detection

Abstract

Synchronizations are fundamental to the correctness and performance of concurrent software. They determine which operations can execute concurrently and which can-not—the key to detecting and fixing concurrency bugs, as well as understanding and tuning performance. Unfortunately, correctly identifying all synchronizations has become extremely difficult in modern software systems due to the various forms of concurrency and various types of synchronizations.

Previous work either only infers specific type of synchronization by code analysis or relies on manual effect to annotate the synchronization. This paper proposes SherLock, a tool that automatically infers synchronizations without code analysis or annotation. SherLock leverages the fact that most synchronizations appear around the conflicting operations and encodes the inference problem into a linear system with properties and hypotheses about how synchronizations are typically used. To collect useful observations, SherLock runs the target problem for a small number of runs with feedback-guided delay injection.

We have applied SherLock on 8 C# open-source applications. Without any prior knowledge, SherLock automatically inferred more than 120 unique synchronizations, with few false positives. These inferred synchronizations cover a wide variety of types, including lock operations, fork-join operations, asynchronous operations, framework synchronization, and custom synchronization.

DOI: 10.1145/3445814.3446754

SIMDRAM： a framework for bit-serial SIMD processing using DRAM

作者: Hajinazar, Nastaran and Oliveira, Geraldo F. and Gregorio, Sven and Ferreira, Jo~{a
关键词: Processing-in-Memory, Processing-Using-Memory, Performance, Energy, DRAM, Bulk Bitwise Operations

Abstract

Processing-using-DRAM has been proposed for a limited set of basic operations (i.e., logic operations, addition). However, in order to enable full adoption of processing-using-DRAM, it is necessary to provide support for more complex operations. In this paper, we propose SIMDRAM, a flexible general-purpose processing-using-DRAM framework that (1) enables the efficient implementation of complex operations, and (2) provides a flexible mechanism tosupport the implementation of arbitrary user-defined operations. The SIMDRAM framework comprises three key steps. The first step builds an efficient MAJ/NOT representation of a given desired operation. The second step allocates DRAM rows that are reserved for computation to the operation’s input and output operands, and generates the required sequence of DRAM commands to perform the MAJ/NOT implementation of the desired operation in DRAM. The third step uses the SIMDRAM control unit located inside the memory controller to manage the computation of the operation from start to end, by executing the DRAM commands generated in the second step of the framework. We design the hardware and ISA support for SIMDRAM framework to (1) address key system integration challenges, and (2) allow programmers to employ new SIMDRAM operations without hardware changes. We evaluate SIMDRAM for reliability, area overhead, throughput, and energy efficiency using a wide range of operations and seven real-world applications to demonstrate SIMDRAM’s generality. Our evaluations using a single DRAM bank show that (1) over 16 operations, SIMDRAM provides 2.0X the throughput and 2.6X the energy efficiency of Ambit, a state-of-the-art processing-using-DRAM mechanism; (2) over seven real-world applications, SIMDRAM provides 2.5X the performance of Ambit. Using 16 DRAM banks, SIMDRAM provides (1) 88X and 5.8X the throughput, and 257X and 31X the energy efficiency, of a CPU and a high-end GPU, respectively, over 16 operations; (2) 21X and 2.1X the performance of the CPU and GPU, over seven real-world applications. SIMDRAM incurs an area overhead of only 0.2% in a high-end CPU.

DOI: 10.1145/3445814.3446749

Clobber-NVM： Log Less, Re-execute More

作者: Xu, Yi and Izraelevitz, Joseph and Swanson, Steven
关键词: Clobber Logging, Compiler, Non-volatile Memory, Persistent Memory, Storage Systems, Undo Logging

Abstract

Clobber-NVM is a failure-atomicity library that ensures data consistency by reexecution. Clobber-NVM’s novel logging strategy, clobber logging, records only those transaction inputs that are overwritten during transaction execution. Then, after a failure, it recovers to a consistent state by restoring overwritten inputs and reexecuting any interrupted transactions. Clobber-NVM utilizes a clobber logging compiler pass for identifying the minimal set of writes that need to be logged.

This artifact includes the Clobber-NVM compiler passes, as well as necessary runtime components. It contains code of all seven benchmarks (four data structures and three applications) reported in the paper. The evaluation results can be reproduced by running the experiments on a machine equipped with at least 24 physical cores per socket and 32 GB of memory. In absence of access to real NVMM (e.g., Intel Optane DC), you need to reserve 32 GB of memory to emulate NVMM. The artifacts also includes a script to download and install main software dependencies. We have evaluated Clobber-NVM on Ubuntu 18.04, with GNU7.3.1, and LLVM 7.0.0.

DOI: 10.1145/3445814.3446730

Time-optimal Qubit mapping

作者: Zhang, Chi and Hayes, Ari B. and Qiu, Longfei and Jin, Yuwei and Chen, Yanhao and Zhang, Eddy Z.
关键词: Qubit Mapping, Quantum Fourier Transformation, Quantum Computing, QFT, Noisy Intermediate Quantum Computers, NISQ

Abstract

Rapid progress in the physical implementation of quantum computers gave birth to multiple recent quantum machines implemented with superconducting technology. In these NISQ machines, each qubit is physically connected to a bounded number of neighbors. This limitation prevents most quantum programs from being directly executed on quantum devices. A compiler is required for converting a quantum program to a hardware-compliant circuit, in particular, making each two-qubit gate executable by mapping the two logical qubits to two physical qubits with a link between them. To solve this problem, existing studies focus on inserting SWAP gates to dynamically remap logical qubits to physical qubits. However, most of the schemes lack the consideration of time-optimality of generated quantum circuits, or are achieving time-optimality with certain constraints. In this work, we propose a theoretically time-optimal SWAP insertion scheme for the qubit mapping problem. Our model can also be extended to practical heuristic algorithms. We present exact analysis results by using our model for quantum programs with recurring execution patterns. We have for the first time discovered an optimal qubit mapping pattern for quantum fourier transformation (QFT) on 2D nearest neighbor architecture. We also present a scalable extension of our theoretical model that can be used to solve qubit mapping for large quantum circuits.

DOI: 10.1145/3445814.3446706

Orchestrated trios： compiling for efficient communication in Quantum programs with 3-Qubit gates

作者: Duckering, Casey and Baker, Jonathan M. and Litteken, Andrew and Chong, Frederic T.
关键词: quantum computing, compiler, Toffoli, NISQ

Abstract

Current quantum computers are especially error prone and require high levels of optimization to reduce operation counts and maximize the probability the compiled program will succeed. These computers only support operations decomposed into one- and two-qubit gates and only two-qubit gates between physically connected pairs of qubits. Typical compilers first decompose operations, then route data to connected qubits. We propose a new compiler structure, Orchestrated Trios, that first decomposes to the three-qubit Toffoli, routes the inputs of the higher-level Toffoli operations to groups of nearby qubits, then finishes decomposition to hardware-supported gates. This significantly reduces communication overhead by giving the routing pass access to the higher-level structure of the circuit instead of discarding it. A second benefit is the ability to now select an architecture-tuned Toffoli decomposition such as the 8-CNOT Toffoli for the specific hardware qubits now known after the routing pass. We perform real experiments on IBM Johannesburg showing an average 35% decrease in two-qubit gate count and 23% increase in success rate of a single Toffoli over Qiskit. We additionally compile many near-term benchmark algorithms showing an average 344% increase in (or 4.44x) simulated success rate on the Johannesburg architecture and compare with other architecture types.

DOI: 10.1145/3445814.3446718

Artifacts for ‘FaasCache： Keeping Serverless Computing Alive with Greedy-Dual Caching’

作者: Fuerst, Alexander and Sharma, Prateek
关键词: FaaS, OpenWhisk, Serverless

Abstract

This contains two experiments. A Python discrete-event simulator, and an edited OpenWhisk that implements Greey-Dual caching

DOI: 10.1145/3445814.3446757

Replication Package for Article： HIPPOCRATES： Healing Persistent Memory Bugs without Doing Any Harm

作者: Neal, Ian and Quinn, Andrew and Kasikci, Baris
关键词: persistent memory, program repair

Abstract

This package contains the artifact for HIPPOCRATES. The artifact includes instructions for building and running HIPPOCRATES, as well as scripts and instructions used to reproduce the core results from the original article.

DOI: 10.1145/3445814.3446694

Replication Package for Article： Jaaru： Efficiently Model Checking Persistent Memory Programs

作者: Gorjiara, Hamed and Xu, Guoqing Harry and Demsky, Brian
关键词: Crash Consistency, Debugging, Jaaru, Persistent Memory, Testing

Abstract

This artifact contains a vagrant repository that downloads and compiles the source code for Jaaru, its companion compiler pass, and benchmarks. The artifact enables users to reproduce the bugs that are found by in PMDK (i.e., Figure 11 of the paper) and RECIPE (i.e., Figure 12) as well as the performance results to compare with Yat (i.e., Figure 13).

DOI: 10.1145/3445814.3446735

Replication Package for Artifact： Corundum： Statically-Enforced Persistent Memory Safety

作者: Hoseinzadeh, Morteza and Swanson, Steven
关键词: debugging, formal verification, persistent memory programming

Abstract

Corundum is a persistent memory programming library in Rust which enforces safety rules statically. The artifact contains the source code of Corundum, the installation scripts for Corundum and other libraries listed in the paper, source code of the workloads, and experiments run scripts.

DOI: 10.1145/3445814.3446710

QRAFT ASPLOS 21 Code and Dataset

作者: Patel, Tirthak and Tiwari, Devesh
关键词: NISQ Computing, Quantum Computing, Quantum Error Mitigation

Abstract

The artifacts can be divided into three categories: (1) Raw data: circuit metadata and output generated as a direct result of running quantum circuits. (2) Processed and Trained data: the data processed to be fed as input to the machine learning model training, as well as the output data of testing samples using the trained model. (3) Tools: code and scripts used for running circuits on quantum computers, processing the output, as well as training models and generating the final output (prediction of state probabilities).

DOI: 10.1145/3445814.3446743

Noisy Variational Quantum Algorithm Simulation via Knowledge Compilation for Repeated Inference

作者: Huang, Yipeng and Holtzen, Steven and Millstein, Todd and Van den Broeck, Guy and Martonosi, Margaret
关键词: Bayesian networks, knowledge compilation, quantum circuit simulation, quantum computing

Abstract

This artifact demonstrates a new way to perform quantum circuit simulation. We convert quantum circuits into probabilistic graphical models, which are then compiled into a format that enables efficient repeated queries.

The artifact consists of a Docker image which includes Google Cirq, a quantum programming framework, which we have extended to use our proposed approach as a quantum circuit simulation backend. Also in the Docker image are two quantum circuit simulators based on existing approaches which we compare against as evaluation baselines.

We offer the Docker image via three routes: a hosted version on Docker Hub provides the latest version of our software and requires minimal setup; a Dockerfile is provided to show how to replicate our environment from scratch; and finally a stable archival version is available on Zenodo.

With minimal setup, you can run test cases in our Docker container showing the validity of our approach. We test our quantum circuit simulation approach using the randomized test harness that Google Cirq uses to test its quantum circuit simulation back ends. We also demonstrate correct simulation results for a benchmark suite of quantum algorithms.

The Docker image contains performance benchmarking experiments that replicate results of our paper at reduced input problem sizes. The experiment scripts generate PDFs showing graphs that plot simulation wall clock time against input quantum circuit sizes. The input problem sizes are large enough to show that our proposed approach achieves a speedup versus existing simulation tools.

DOI: 10.1145/3445814.3446750

Replication Package for Article： CutQC： Using Small Quantum Computers for Large Quantum Circuit Evaluations

作者: Tang, Wei and Tomesh, Teague and Suchara, Martin and Larson, Jeffrey and Martonosi, Margaret
关键词: Hybrid Computing, Quantum Circuit Cutting, Quantum Computing (QC)

Abstract

Our artifact provides the source codes for the end-to-end CutQC toolflow. We also provide the benchmarking codes for several sample runtime and fidelity experiments. The HPC parallel version of the code is not provided, as different HPC platforms require very different setups.

DOI: 10.1145/3445814.3446758

PMFuzz： Test Case Generation for Persistent Memory Programs

作者: Liu, Sihang and Mahar, Suyash and Ray, Baishakhi and Khan, Samira
关键词: Crash Consistency, Debugging, Fuzzing, Persistent Memory, Testing

Abstract

PMFuzz is a test case generator for PM programs, aiming to generate high-value test cases for PM testing tools. The generated test cases include both program inputs and initial PM images (normal images and crash images). The key idea of PMFuzz is to perform a targeted fuzzing on PM-related code regions and generate valid PM images by reusing the program logic. After generating the test cases, PMFuzz feeds them to the PM program and uses existing testing tools (XFDetector and PMemcheck) to detect crash consistency and performance bugs. The archived version of this artifact can be accessed using this DOI. We also maintain a GitHub repository at https://pmfuzz.persistentmemory.org/. For the latest version, please check our GitHub repository.

DOI: 10.1145/3445814.3446691

Artifact for PMDebugger： Fast, Flexible, and Comprehensive Bug Detection for Persistent Memory Programs

作者: Di, Bang and Liu, Jiawen and Chen, Hao and Li, Dong
关键词: Crash Consistency, Debugging, Persistent Memory, Testing

Abstract

This is the open-source site for PMDebugger (ASPLOS’21). For the latest version, please see our GitHub page: https://github.com/PASAUCMerced/PMDebugger.

DOI: 10.1145/3445814.3446744

PMEM-spec： persistent memory speculation (strict persistency can trump relaxed persistency)

作者: Jeong, Jungi and Jung, Changhee
关键词: Strict Persistency, Persistency Model, HW/SW Codesign

Abstract

Persistency models define the persist-order that controls the order in which stores update persistent memory (PM). As with memory consistency, the relaxed persistency models provide better performance than the strict ones by relaxing the ordering constraints. To support such relaxed persistency models, previous studies resort to APIs for annotating the persist-order in program and hardware implementations for enforcing the programmer-specified order. However, these approaches to supporting relaxed persistency impose costly burdens on both architects and programmers. In light of this, the goal of this study is to demonstrate that the strict persistency model can outperform the relaxed models with significantly less hardware complexity and programming difficulty. To achieve that, this paper presents PMEM-Spec that speculatively allows any PM accesses without stalling or buffering, detecting their ordering violation (e.g., misspeculation for PM loads and stores). PMEM-Spec treats misspeculation as power failure and thus leverages failure-atomic transactions to recover from misspeculation by aborting and restarting them purposely. Since the ordering violation rarely occurs, PMEM-Spec can accelerate persistent memory accesses without significant misspeculation penalty. Experimental results show that PMEM-Spec outperforms two epoch-based persistency models with Intel X86 ISA and the state-of-the-art hardware support by 27.2% and 10.6%, respectively.

DOI: 10.1145/3445814.3446698

VSync： push-button verification and optimization for synchronization primitives on weak memory models

作者: Oberhauser, Jonas and Chehab, Rafael Lourenco de Lima and Behrens, Diogo and Fu, Ming and Paolillo, Antonio and Oberhauser, Lilith and Bhat, Koustubha and Wen, Yuzhong and Chen, Haibo and Kim, Jaeho and Vafeiadis, Viktor
关键词: weak memory models, model checking

Abstract

Implementing highly efficient and correct synchronization primitives on modern Weak Memory Model (WMM) architectures, such as ARM and RISC-V, is very difficult even for human experts. We introduce VSync, a framework to assist in optimizing and verifying synchronization primitives on WMM architectures. VSync automatically detects missing and overly-constrained barriers, while ensuring essential safety and liveness properties. VSync relies on two novel techniques: 1) Adaptive Linear Relaxation (ALR), which utilizes barrier monotonicity and speculation to quickly find a correct maximally-relaxed barrier combination; and 2) Await Model Checking (AMC), which for the first time makes it possible to check termination of await loops on WMMs. We use VSync to automatically optimize and verify state-of-the-art synchronization primitives from systems like seL4, CertiKOS, musl libc, DPDK, Concurrency Kit, and Linux, as well as from the literature. In doing so, we found three correctness bugs on deployed systems due to missing barriers and several performance bugs due to overly-constrained barriers. Synchronization primitives optimized by VSync have similar performance to industrial libraries optimized by experts.

DOI: 10.1145/3445814.3446748

Replication Package for Article： “CubicleOS： A Library OS with Software Componentisation for Practical Isolation”

作者: Sartakov, Vasily A. and Vilanova, Llu'{\i
关键词: compartments, Intel MPK, inter-process communication, isolation

Abstract

This artefact contains the library OS, two applications, the isolation monitor, and scripts to reproduce the experiments from the ASPLOS 2021 paper by V. A. Sartakov, L. Vilanova, R. Pietzuch — ``CubicleOS: A Library OS with Software Componentisation for Practical Isolation’’, which isolates components of a monolithic library OS without the use of message-based IPC primitives.

DOI: 10.1145/3445814.3446731

Benchmarking, Analysis, and Optimization of Serverless Function Snapshots

作者: Ustiugov, Dmitrii and Petrov, Plamen and Kogias, Marios and Bugnion, Edouard and Grot, Boris
关键词: cloud computing, datacenters, serverless, snapshots, virtualization

Abstract

This artifact contains the source code of the vHive-CRI host orchestrator and includes the necessary binary files of its dependencies, namely Firecracker-Containerd shim binaries, Firecracker hypervisor and jailer, default rootfs for Firecracker MicroVMs, MinIO object store server, and client binaries. The reviewers require Ubuntu 18.04 with root access and hardware virtualization support (e.g., VT-x), a platform with the root partition mounted on an SSD is preferred. The artifact lists the instructions to reproduce Fig. 8 for the configuration that uses vanilla Firecracker snapshots and the configuration that uses REAP-based snapshots. The reviewers can run functions from the representative FunctionBench suite, using pre-built Docker images.

DOI: 10.1145/3445814.3446714

Rhythmic pixel regions： multi-resolution visual sensing system towards high-precision visual computing at low power

作者: Kodukula, Venkatesh and Shearer, Alexander and Nguyen, Van and Lingutla, Srinivas and Liu, Yifei and LiKamWa, Robert
关键词: visual computing, pixel discard, augmented reality

Abstract

High spatiotemporal resolution can offer high precision for vision applications, which is particularly useful to capture the nuances of visual features, such as for augmented reality. Unfortunately, capturing and processing high spatiotemporal visual frames generates energy-expensive memory traffic. On the other hand, low resolution frames can reduce pixel memory throughput, but reduce also the opportunities of high-precision visual sensing. However, our intuition is that not all parts of the scene need to be captured at a uniform resolution. Selectively and opportunistically reducing resolution for different regions of image frames can yield high-precision visual computing at energy-efficient memory data rates. To this end, we develop a visual sensing pipeline architecture that flexibly allows application developers to dynamically adapt the spatial resolution and update rate of different “rhythmic pixel regions” in the scene. We develop a system that ingests pixel streams from commercial image sensors with their standard raster-scan pixel read-out patterns, but only encodes relevant pixels prior to storing them in the memory. We also present streaming hardware to decode the stored rhythmic pixel region stream into traditional frame-based representations to feed into standard computer vision algorithms. We integrate our encoding and decoding hardware modules into existing video pipelines. On top of this, we develop runtime support allowing developers to flexibly specify the region labels. Evaluating our system on a Xilinx FPGA platform over three vision workloads shows 43-64% reduction in interface traffic and memory footprint, while providing controllable task accuracy.

DOI: 10.1145/3445814.3446737

Q-VR： system-level design for future mobile collaborative virtual reality

作者: Xie, Chenhao and Li, Xie and Hu, Yang and Peng, Huwan and Taylor, Michael and Song, Shuaiwen Leon
关键词: Virtual Reality, System-on-Chip, Realtime Learning, Planet-Scale System Design, Mobile System

Abstract

High Quality Mobile Virtual Reality (VR) is what the incoming graphics technology era demands: users around the world, regardless of their hardware and network conditions, can all enjoy the immersive virtual experience. However, the state-of-the-art software-based mobile VR designs cannot fully satisfy the realtime performance requirements due to the highly interactive nature of user’s actions and complex environmental constraints during VR execution. Inspired by the unique human visual system effects and the strong correlation between VR motion features and realtime hardware-level information, we propose Q-VR, a novel dynamic collaborative rendering solution via software-hardware co-design for enabling future low-latency high-quality mobile VR. At software-level, Q-VR provides flexible high-level tuning interface to reduce network latency while maintaining user perception. At hardware-level, Q-VR accommodates a wide spectrum of hardware and network conditions across users by effectively leveraging the computing capability of the increasingly powerful VR hardware. Extensive evaluation on real-world games demonstrates that Q-VR can achieve an average end-to-end performance speedup of 3.4x (up to 6.7x) over the traditional local rendering design in commercial VR devices, and a 4.1x frame rate improvement over the state-of-the-art static collaborative rendering.

DOI: 10.1145/3445814.3446715

Warehouse-scale video acceleration： co-design and deployment in the wild

作者: Ranganathan, Parthasarathy and Stodolsky, Daniel and Calow, Jeff and Dorfman, Jeremy and Guevara, Marisabel and Smullen IV, Clinton Wills and Kuusela, Aki and Balasubramanian, Raghu and Bhatia, Sandeep and Chauhan, Prakash and Cheung, Anna and Chong, In Suk and Dasharathi, Niranjani and Feng, Jia and Fosco, Brian and Foss, Samuel and Gelb, Ben and Gwin, Sara J. and Hase, Yoshiaki and He, Da-ke and Ho, C. Richard and Huffman Jr., Roy W. and Indupalli, Elisha and Jayaram, Indira and Kongetira, Poonacha and Kyaw, Cho Mon and Laursen, Aaron and Li, Yuan and Lou, Fong and Lucke, Kyle A. and Maaninen, JP and Macias, Ramon and Mahony, Maire and Munday, David Alexander and Muroor, Srikanth and Penukonda, Narayana and Perkins-Argueta, Eric and Persaud, Devin and Ramirez, Alex and Rautio, Ville-Mikko and Ripley, Yolanda and Salek, Amir and Sekar, Sathish and Sokolov, Sergey N. and Springer, Rob and Stark, Don and Tan, Mercedes and Wachsler, Mark S. and Walton, Andrew C. and Wickeraad, David A. and Wijaya, Alvin and Wu, Hon Kwan
关键词: warehouse-scale computing, video transcoding, hardware-software codesign, domain-specific accelerators

Abstract

Video sharing (e.g., YouTube, Vimeo, Facebook, TikTok) accounts for the majority of internet traffic, and video processing is also foundational to several other key workloads (video conferencing, virtual/augmented reality, cloud gaming, video in Internet-of-Things devices, etc.). The importance of these workloads motivates larger video processing infrastructures and – with the slowing of Moore’s law – specialized hardware accelerators to deliver more computing at higher efficiencies. This paper describes the design and deployment, at scale, of a new accelerator targeted at warehouse-scale video transcoding. We present our hardware design including a new accelerator building block – the video coding unit (VCU) – and discuss key design trade-offs for balanced systems at data center scale and co-designing accelerators with large-scale distributed software systems. We evaluate these accelerators “in the wild" serving live data center jobs, demonstrating 20-33x improved efficiency over our prior well-tuned non-accelerated baseline. Our design also enables effective adaptation to changing bottlenecks and improved failure management, and new workload capabilities not otherwise possible with prior systems. To the best of our knowledge, this is the first work to discuss video acceleration at scale in large warehouse-scale environments.

DOI: 10.1145/3445814.3446723

Automatically detecting and fixing concurrency bugs in go software systems

作者: Liu, Ziheng and Zhu, Shuofei and Qin, Boqin and Chen, Hao and Song, Linhai
关键词: Static Analysis, Go, Concurrency Bugs, Bug Fixing, Bug Detection

Abstract

Go is a statically typed programming language designed for efficient and reliable concurrent programming. For this purpose, Go provides lightweight goroutines and recommends passing messages using channels as a less error-prone means of thread communication. Go has become increasingly popular in recent years and has been adopted to build many important infrastructure software systems. However, a recent empirical study shows that concurrency bugs, especially those due to misuse of channels, exist widely in Go. These bugs severely hurt the reliability of Go concurrent systems. To fight Go concurrency bugs caused by misuse of channels, this paper proposes a static concurrency bug detection system, GCatch, and an automated concurrency bug fixing system, GFix. After disentangling an input Go program, GCatch models the complex channel operations in Go using a novel constraint system and applies a constraint solver to identify blocking bugs. GFix automatically patches blocking bugs detected by GCatch using Go’s channel-related language features. We apply GCatch and GFix to 21 popular Go applications, including Docker, Kubernetes, and gRPC. In total, GCatch finds 149 previously unknown blocking bugs due to misuse of channels and GFix successfully fixes 124 of them. We have reported all detected bugs and generated patches to developers. So far, developers have fixed 125 blocking misuse-of-channel bugs based on our reporting. Among them, 87 bugs are fixed by applying GFix’s patches directly.

DOI: 10.1145/3445814.3446756

C11Tester Artifact

作者: Luo, Weiyu and Demsky, Brian
关键词: C++11, concurrency, data races, memory models

Abstract

The artifact contains a c11tester-vagrant directory and a tsan11-tsan11rec-docker directory. The c11tester-vagrant directory is a vagrant repository that compiles source codes for C11Tester, LLVM, the companion compiler pass, and benchmarks for C11Tester. The tsan11-tsan11rec-docker directory contains benchmarks and a docker image with prebuilt LLVMs for tsan11 and tsan11rec.

DOI: 10.1145/3445814.3446711

Kard： lightweight data race detection with per-thread memory protection

作者: Ahmad, Adil and Lee, Sangho and Fonseca, Pedro and Lee, Byoungyoung
关键词: memory protection, lock, data race, concurrency

Abstract

Finding data race bugs in multi-threaded programs has proven challenging. A promising direction is to use dynamic detectors that monitor the program’s execution for data races. However, despite extensive work on dynamic data race detection, most proposed systems for commodity hardware incur prohibitive overheads due to expensive compiler instrumentation of memory accesses; hence, they are not efficient enough to be used in all development and testing settings. KARD is a lightweight system that dynamically detects data races caused by inconsistent lock usage—when a program concurrently accesses the same memory object using different locks or only some of the concurrent accesses are synchronized using a common lock. Unlike existing detectors, KARD does not monitor memory accesses using expensive compiler instrumentation. Instead, KARD leverages commodity per-thread memory protection, Intel Memory Protection Keys (MPK). Using MPK, KARD ensures that a shared object is only accessible to a single thread in its critical section, and captures all violating accesses from other concurrent threads. KARD overcomes various limitations of MPK by introducing key-enforced race detection, employing consolidated unique page allocation, carefully managing protection keys, and automatically pruning out non-racy or redundant violations. Our evaluation shows that KARD detects all data races caused by inconsistent lock usage and has a low geometric mean execution time overhead: 7.0% on PARSEC and SPLASH-2x benchmarks and 5.3% on a set of real-world applications (NGINX, memcached, pigz, and Aget).

DOI: 10.1145/3445814.3446727

Paper Quantifying the Design-Space Tradeoffs in Autonomous Drones artifact, including software, data, and build giude for the open-source drone

作者: Hadidi, Ramyad and Asgari, Bahar and Jijina, Sam and Amyette, Adriana and Shoghi, Nima and Kim, Hyesoon
关键词: autonomous drones, build guide, design-space analysis, open-source platform, power measurements, SLAM

Abstract

This artifact describes our open-source experimental drone framework that is customizable across its hardware-software stack. The main and first portion of the artifact focuses on building the drone, which compliments the beginning sections of the paper. The build guide consists of two parts: hardware and software. Second, as an example of possible experiments, we provide sample scripts for important metrics measurements such as Linux perf and SLAM. Third, the artifact contains raw data for graphs in the paper.

DOI: 10.1145/3445814.3446721

Robomorphic computing： a design methodology for domain-specific accelerators parameterized by robot morphology

作者: Neuman, Sabrina M. and Plancher, Brian and Bourgeat, Thomas and Tambe, Thierry and Devadas, Srinivas and Reddi, Vijay Janapa
关键词: robotics, motion planning, hardware accelerators, dynamics

Abstract

Robotics applications have hard time constraints and heavy computational burdens that can greatly benefit from domain-specific hardware accelerators. For the latency-critical problem of robot motion planning and control, there exists a performance gap of at least an order of magnitude between joint actuator response rates and state-of-the-art software solutions. Hardware acceleration can close this gap, but it is essential to define automated hardware design flows to keep the design process agile as applications and robot platforms evolve. To address this challenge, we introduce robomorphic computing: a methodology to transform robot morphology into a customized hardware accelerator morphology. We (i) present this design methodology, using robot topology and structure to exploit parallelism and matrix sparsity patterns in accelerator hardware; (ii) use the methodology to generate a parameterized accelerator design for the gradient of rigid body dynamics, a key kernel in motion planning; (iii) evaluate FPGA and synthesized ASIC implementations of this accelerator for an industrial manipulator robot; and (iv) describe how the design can be automatically customized for other robot models. Our FPGA accelerator achieves speedups of 8\texttimes{

DOI: 10.1145/3445814.3446746

Gamma： leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication

作者: Zhang, Guowei and Attaluri, Nithya and Emer, Joel S. and Sanchez, Daniel
关键词: sparse matrix multiplication, sparse linear algebra, high-radix merge, explicit data orchestration, data movement reduction, accelerator, Gustavson’s algorithm

Abstract

Sparse matrix-sparse matrix multiplication (spMspM) is at the heart of a wide range of scientific and machine learning applications. spMspM is inefficient on general-purpose architectures, making accelerators attractive. However, prior spMspM accelerators use inner- or outer-product dataflows that suffer poor input or output reuse, leading to high traffic and poor performance. These prior accelerators have not explored Gustavson’s algorithm, an alternative spMspM dataflow that does not suffer from these problems but features irregular memory access patterns that prior accelerators do not support. We present GAMMA, an spMspM accelerator that uses Gustavson’s algorithm to address the challenges of prior work. GAMMA performs spMspM’s computation using specialized processing elements with simple high-radix mergers, and performs many merges in parallel to achieve high throughput. GAMMA uses a novel on-chip storage structure that combines features of both caches and explicitly managed buffers. This structure captures Gustavson’s irregular reuse patterns and streams thousands of concurrent sparse fibers (i.e., lists of coordinates and values for rows or columns) with explicitly decoupled data movement. GAMMA features a new dynamic scheduling algorithm to achieve high utilization despite irregularity. We also present new preprocessing algorithms that boost GAMMA’s efficiency and versatility. As a result, GAMMA outperforms prior accelerators by gmean 2.1x, and reduces memory traffic by gmean 2.2x and by up to 13x.

DOI: 10.1145/3445814.3446702

Reducing solid-state drive read latency by optimizing read-retry

作者: Park, Jisung and Kim, Myungsuk and Chun, Myoungjun and Orosa, Lois and Kim, Jihong and Mutlu, Onur
关键词: solid-state drives (SSDs), read-retry, latency, 3D NAND flash memory

Abstract

3D NAND flash memory with advanced multi-level cell techniques provides high storage density, but suffers from significant performance degradation due to a large number of read-retry operations. Although the read-retry mechanism is essential to ensuring the reliability of modern NAND flash memory, it can significantly in-crease the read latency of an SSD by introducing multiple retry steps that read the target page again with adjusted read-reference voltage values. Through a detailed analysis of the read mechanism and rigorous characterization of 160 real 3D NAND flash memory chips, we find new opportunities to reduce the read-retry latency by exploiting two advanced features widely adopted in modern NAND flash-based SSDs: 1) the CACHE READ command and 2) strong ECC engine. First, we can reduce the read-retry latency using the advanced CACHE READ command that allows a NAND flash chip to perform consecutive reads in a pipelined manner. Second, there exists a large ECC-capability margin in the final retry step that can be used for reducing the chip-level read latency. Based on our new findings, we develop two new techniques that effectively reduce the read-retry latency: 1) Pipelined Read-Retry (PR²) and 2) Adaptive Read-Retry (AR²). PR² reduces the latency of a read-retry operation by pipelining consecutive retry steps using the CACHE READ command. AR² shortens the latency of each retry step by dynamically reducing the chip-level read latency depending on the current operating conditions that determine the ECC-capability margin. Our evaluation using twelve real-world workloads shows that our proposal improves SSD response time by up to 31.5% (17% on average)over a state-of-the-art baseline with only small changes to the SSD controller.

DOI: 10.1145/3445814.3446719

Replication Package for Article – RecSSD： Near Data Processing for Solid State Drive Based Recommendation Inference

作者: Wilkening, Mark and Gupta, Udit and Hsia, Samuel and Trippel, Caroline and Wu, Carole-Jean and Brooks, David and Wei, Gu-Yeon
关键词: Caffe2, DLRM, OpenSSD, Python, UNVMe

Abstract

RecSSD is composed of a number of open sourced artifacts. First, we implement a fully-functional NDP SLS operator in the open source Cosmos+ OpenSSD system[4], provided in the RecSSD-OpenSSDFirmware repository[7]. To maintain compatibility with the NVMe protocols, the RecSSD interface is implemented within Micron’s UNVMe driver library [10], provided in the RecSSD-UNVMeDriver repository[9]. To evaluate RecSSD, we use a diverse set of eight industry representative recommendation models provided in Deep-RecInfra [18], implemented in Python using Caffe2 [1] and provided in the RecSSD-RecInfra repository[8]. In addition to the models themselves, we instrument the open-source synthetic trace generators from Facebook’s open-sourced DLRM [29] with our locality analysis from production-scale recommendation systems, also included in the RecSSD-RecInfra repository.

DOI: 10.1145/3445814.3446763

Prolonging 3D NAND SSD lifetime via read latency relaxation

作者: Liu, Chun-Yi and Lee, Yunju and Jung, Myoungsoo and Kandemir, Mahmut Taylan and Choi, Wonil
关键词: read latency relaxation, SSD read characterization, 3D NAND SSD

Abstract

The adoption of 3D NAND has significantly increased the SSD density; however, 3D NAND density-increasing techniques, such as extensive stacking of cell layers, can amplify read disturbances and shorten SSD lifetime. From our lifetime-impact characterization on 8 state-of-the-art SSDs, we observe that the 3D TLC/QLC SSDs can be worn-out by low read-only workloads within their warranty period since a huge amount of read disturbance-induced rewrites are performed in the background. To understand alternative read disturbance mitigation opportunities, we also conducted read-latency characterizations on 2 other SSDs without the background rewrite mechanism. The collected results indicate that, without the background rewriting, the read latencies of the majority of data become higher, as the number of reads on the data increases. Motivated by these two characterizations, in this paper, we propose to relax the short read latency constraint on the high-density 3D SSDs. Specifically, our proposal relies on the hint information passed from applications to SSDs that specifies the expected read performance. By doing so, the lifetime consumption caused by the read-induced writes can be reduced, thereby prolonging the SSD lifetime. The detailed experimental evaluations show that our proposal can reduce up to 56% of the rewrite-induced spent-lifetime with only 2% lower performance, under a file-server application.

DOI: 10.1145/3445814.3446733

Replication Package for Article “PIBE： Practical Kernel Control-Flow Hardening with Profile-Guided Indirect Branch Elimination”

作者: Duta, Victor and Giuffrida, Cristiano and Bos, Herbert and van der Kouwe, Erik
关键词: kernel, LMBench, profile-guided optimizations, transient execution

Abstract

Our artifact provides x86-64 kernel binaries for most of the kernel configurations we evaluated in the paper, along with scripts to configure LMBench, run and benchmark each kernel configuration and regenerate the syscall latencies and overheads discussed in the main tables of the paper. This allows the evaluation of our results on an Intel i7-8700K (Skylake) CPU or similar micro-architectures.

We also provide source code for the tools used during the kernel build process (e.g., binutils, LLVM 10), the code of our LLVM optimization passes and the kernel source code to regenerate the kernel binaries used in the workflow of our evaluation. We sup- ply the user with scripts to regenerate our Apache and LMBench profiling workloads, rebuild the kernel binaries provided in the evaluation or customize the kernels with a user-specified selection of transient mitigations and optimization strategies.

Furthermore, we also provide portable Apache and LMBench profiling workloads to speedup the customization process without the necessity of creating your own profiling workloads.

DOI: 10.1145/3445814.3446740

Computing with time： microarchitectural weird machines

作者: Evtyushkin, Dmitry and Benjamin, Thomas and Elwell, Jesse and Eitel, Jeffrey A. and Sapello, Angelo and Ghosh, Abhrajit
关键词: weird machines, speculative execution, side channel, obfuscation, Microarchitecture security

Abstract

Side-channel attacks such as Spectre rely on properties of modern CPUs that permit discovery of microarchitectural state via timing of various operations. The Weird Machine concept is an increasingly popular model for characterization of emergent execution that arises from side-effects of conventional computing constructs. In this work we introduce Microarchitectural Weird Machines (µWM): code constructions that allow performing computation through the means of side effects and conflicts between microarchitectual entities such as branch predictors and caches. The results of such computations are observed as timing variations. We demonstrate how µWMs can be used as a powerful obfuscation engine where computation operates based on events unobservable to conventional anti-obfuscation tools based on emulation, debugging, static and dynamic analysis techniques. We demonstrate that µWMs can be used to reliably perform arbitrary computation by implementing a SHA-1 hash function. We then present a practical example in which we use a µWM to obfuscate malware code such that its passive operation is invisible to an observer with full power to view the architectural state of the system until the code receives a trigger. When the trigger is received the malware decrypts and executes its payload. To show the effectiveness of obfuscation we demonstrate its use in the concealment and subsequent execution of a payload that exfiltrates a shadow password file, and a payload that creates a reverse shell.

DOI: 10.1145/3445814.3446729

Artifact for ‘HerQules： Securing Programs via Hardware-Enforced Message Queues’

作者: Chen, Daming D. and Lim, Wen Shih and Bakhshalipour, Mohammad and Gibbons, Phillip B. and Hoe, James C. and Parno, Bryan
关键词: compiler, fpga, ipc, llvm, nginx, ripe, spec, zsim

Abstract

Source code, experiment data, and virtual machines with precompiled benchmarks

DOI: 10.1145/3445814.3446736

Effective simulation and debugging for a high-level hardware language using software compilers

作者: Pit-Claudel, Cl'{e
关键词: hardware debugging, compilation, Hardware simulation

Abstract

Rule-based hardware-design languages (RHDLs) promise to enhance developer productivity by offering convenient abstractions. Advanced compiler technology keeps the cost of these abstractions low, generating circuits with excellent area and timing properties. Unfortunately, comparatively little effort has been spent on building simulators and debuggers for these languages, so users often simulate and debug their designs at the RTL level. This is problematic because generated circuits typically suffer from poor readability, as compiler optimizations can break high-level abstractions. Worse, optimizations that operate under the assumption that concurrency is essentially free yield faster circuits but often actively hurt simulation performance on platforms with limited concurrency, like desktop computers or servers. This paper demonstrates the benefits of completely separating the simulation and synthesis pipelines. We propose a new approach, yielding the first compiler designed for effective simulation and debugging of a language in the Bluespec family. We generate cycle-accurate C++ models that are readable, compatible with a wide range of traditional software-debugging tools, and fast (often two to three times faster than circuit-level simulation). We achieve these results by optimizing for sequential performance and using static analysis to minimize redundant work. The result is a vastly improved hardware-design experience, which we demonstrate on embedded processor designs and DSP building blocks using performance benchmarks and debugging case studies.

DOI: 10.1145/3445814.3446720

Replication Package for Article： Compiler Infrastructure for Accelerator Generators

作者: Nigam, Rachit and Thomas, Samuel and Li, Zhijing and Sampson, Adrian
关键词: Accelerator Design, Intermediate Language

Abstract

Our artifact packages an environment that can be used to reproduce the figures in the paper and perform similar evaluations. It is available at the following link:https://zenodo.org/record/4432747

It includes the following: - futil: The Calyx compiler. - fud: Driver for the futil compiler and hardware tools. - Linear algebra PolyBench written in Dahlia

DOI: 10.1145/3445814.3446712

Artifact for Paper： Compiler-Driven FPGA Virtualization with SYNERGY

作者: Landgraf, Joshua and Yang, Tiffany and Lin, Will and Rossbach, Christopher J. and Schkufza, Eric
关键词: Compilers, FPGAs, Operating Systems, Virtualization

Abstract

This artifact contains the code for all the currently-available SYNERGY (Cascade) backends, including the experimental new backend for F1. The artifact also includes the benchmarks from the paper, data files to run them with, and experiment files to replicate the experiments shown in the paper on the SW and F1 backends. Instructions are documented in README.md, ARTIFACT.md, and experiments/README.md.

DOI: 10.1145/3445814.3446755

BayesPerf： minimizing performance monitoring errors using Bayesian statistics

作者: Banerjee, Subho S. and Jha, Saurabh and Kalbarczyk, Zbigniew and Iyer, Ravishankar K.
关键词: Sampling Errors, Probabilistic Graphical Model, Performance Counter, Error Detection, Error Correction, Accelerator

Abstract

Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in HPC measurements by using a domain-driven Bayesian model that captures microarchitectural relationships between HPCs to jointly infer their values as probability distributions. We provide the design and implementation of an accelerator that allows for low-latency and low-power inference of the BayesPerf model for x86 and ppc64 CPUs. BayesPerf reduces the average error in HPC measurements from 40.1% to 7.6% when events are being multiplexed. The value of BayesPerf in real-time decision-making is illustrated with a simple example of scheduling of PCIe transfers.

DOI: 10.1145/3445814.3446739

Training for multi-resolution inference using reusable quantization terms

作者: Zhang, Sai Qian and McDanel, Bradley and Kung, H. T. and Dong, Xin
关键词: systolic arrays, quantization, joint-optimization training, deep neural networks, co-design, Multi-resolution inference

Abstract

Low-resolution uniform quantization (e.g., 4-bit bitwidth) for both Deep Neural Network (DNN) weights and data has emerged as an important technique for efficient inference. Departing from conventional quantization, we describe a novel training approach to support inference at multiple resolutions by reusing a single set of quantization terms (the same set of nonzero bits in values). The proposed approach streamlines the training and supports dynamic selection of resolution levels during inference. We evaluate the method on a diverse range of applications including multiple CNNs on ImageNet, an LSTM on Wikitext-2, and YOLO-v5 on COCO. We show that models resulting from our multi-resolution training can support up to 10 resolutions with only a moderate performance reduction (e.g., ≤ 1%) compared to training them individually. Lastly, using an FPGA, we compare our multi-resolution multiplier-accumulator (mMAC) against other conventional MAC designs and evaluate the inference performance. We show that the mMAC design broadens the choices in trading off cost, efficiency, and latency across a range of computational budgets.

DOI: 10.1145/3445814.3446741

A hierarchical neural model of data prefetching

作者: Shi, Zhan and Jain, Akanksha and Swersky, Kevin and Hashemi, Milad and Ranganathan, Parthasarathy and Lin, Calvin
关键词: Prefetching, Neural Networks, Attention Mechanism

Abstract

This paper presents Voyager, a novel neural network for data prefetching. Unlike previous neural models for prefetching, which are limited to learning delta correlations, our model can also learn address correlations, which are important for prefetching irregular sequences of memory accesses. The key to our solution is its hierarchical structure that separates addresses into pages and offsets and that introduces a mechanism for learning important relations among pages and offsets. Voyager provides significant prediction benefits over current data prefetchers. For a set of irregular programs from the SPEC 2006 and GAP benchmark suites, Voyager sees an average IPC improvement of 41.6% over a system with no prefetcher, compared with 21.7% and 28.2%, respectively, for idealized Domino and ISB prefetchers. We also find that for two commercial workloads for which current data prefetchers see very little benefit, Voyager dramatically improves both accuracy and coverage. At present, slow training and prediction preclude neural models from being practically used in hardware, but Voyager’s overheads are significantly lower—in every dimension—than those of previous neural models. For example, computation cost is reduced by 15- 20\texttimes{

DOI: 10.1145/3445814.3446752

Diospyros Software Artifact： Vectorization for Digital Signal Processors via Equality Saturation

作者: VanHattum, Alexa and Nigam, Rachit and Lee, Vincent T. and Bornholt, James and Sampson, Adrian
关键词: DSPs, Equality Saturation, Program Synthesis, Vectorization

Abstract

Our artifact packages an environment to reproduce the main empirical results of our paper. Specifically, we package: (1) the Diospyros compiler: a search-aided compiler for generating vectorized DSP kernels, (2) implementations of a range of benchmarks in Diospyros, (3) implementation of the Theia open-source application case study, and (4) scripts for recreating the experiments and charts in the paper.

DOI: 10.1145/3445814.3446707

Replication Package for Article： Scalable FSM Parallelization via Path Fusion and Higher-Order Speculation

作者: Qiu, Junqiao and Sun, Xiaofan and Sabet, Amir Hossein Nodehi and Zhao, Zhijia
关键词: Finite State Machine, FSM, Parallelization, Scalability, Speculation

Abstract

This artifact contains the source code of BoostFSM, including the five FSM parallelization schemes discussed in the paper and some benchmarks along with their inputs used for evaluation. In addition, this artifact provides bash scripts to compile the source code and reproduce the key experimental results reported in the paper. Considering the software dependencies, a software environment with Linux Centos 7 or other similar Linux distributions, GCC, Bash, Pthread, CMake and Boost library, is needed before the evaluation. Moreover, to reproduce all results reported in the paper, especially the speedup comparison and scalability analysis, the artifact needs to run on Intel Xeon Phi processor (Knights Landing/KNL).

DOI: 10.1145/3445814.3446705

VeGen： a vectorizer generator for SIMD and beyond

作者: Chen, Yishen and Mendis, Charith and Carbin, Michael and Amarasinghe, Saman
关键词: auto-vectorization, non-SIMD, optimization

Abstract

Vector instructions are ubiquitous in modern processors. Traditional compiler auto-vectorization techniques have focused on targeting single instruction multiple data (SIMD) instructions. However, these auto-vectorization techniques are not sufficiently powerful to model non-SIMD vector instructions, which can accelerate applications in domains such as image processing, digital signal processing, and machine learning. To target non-SIMD instruction, compiler developers have resorted to complicated, ad hoc peephole optimizations, expending significant development time while still coming up short. As vector instruction sets continue to rapidly evolve, compilers cannot keep up with these new hardware capabilities. In this paper, we introduce Lane Level Parallelism (LLP), which captures the model of parallelism implemented by both SIMD and non-SIMD vector instructions. We present VeGen, a vectorizer generator that automatically generates a vectorization pass to uncover target-architecture-specific LLP in programs while using only instruction semantics as input. VeGen decouples, yet coordinates automatically generated target-specific vectorization utilities with its target-independent vectorization algorithm. This design enables us to systematically target non-SIMD vector instructions that until now require ad hoc coordination between different compiler stages. We show that VeGen can use non-SIMD vector instructions effectively, for example, getting speedup 3\texttimes{

DOI: 10.1145/3445814.3446692

Neural architecture search as program transformation exploration

作者: Turner, Jack and Crowley, Elliot J. and O’Boyle, Michael F. P.
关键词: Machine learning, compilers

Abstract

Improving the performance of deep neural networks (DNNs) is important to both the compiler and neural architecture search (NAS) communities. Compilers apply program transformations in order to exploit hardware parallelism and memory hierarchy. However, legality concerns mean they fail to exploit the natural robustness of neural networks. In contrast, NAS techniques mutate networks by operations such as the grouping or bottlenecking of convolutions, exploiting the resilience of DNNs. In this work, we express such neural architecture operations as program transformations whose legality depends on a notion of representational capacity. This allows them to be combined with existing transformations into a unified optimization framework. This unification allows us to express existing NAS operations as combinations of simpler transformations. Crucially, it allows us to generate and explore new tensor convolutions. We prototyped the combined framework in TVM and were able to find optimizations across different DNNs, that significantly reduce inference time - over 3\texttimes{

DOI: 10.1145/3445814.3446753

Replication Package for Article： Analytical Characterization and Design Space Exploration for Optimization of CNNs

作者: Li, Rui and Xu, Yufan and Sukumaran-Rajam, Aravind and Rountev, Atanas and Sadayappan, P.
关键词: Design space exploration, Neural networks, Performance modeling, Tile size optimization

Abstract

This artifact includes a software implementation and benchmark specification for reproducing experiment results for paper “Analytical Characterization and Design Space Exploration for Optimization of CNNs”

DOI: 10.1145/3445814.3446759

Mind mappings： enabling efficient algorithm-accelerator mapping space search

作者: Hegde, Kartik and Tsai, Po-An and Huang, Sitao and Chandra, Vikas and Parashar, Angshuman and Fletcher, Christopher W.
关键词: gradient-based search, mapping space search, programmable domain-specific accelerators

Abstract

Modern day computing increasingly relies on specialization to satiate growing performance and efficiency requirements. A core challenge in designing such specialized hardware architectures is how to perform mapping space search, i.e., search for an optimal mapping from algorithm to hardware. Prior work shows that choosing an inefficient mapping can lead to multiplicative-factor efficiency overheads. Additionally, the search space is not only large but also non-convex and non-smooth, precluding advanced search techniques. As a result, previous works are forced to implement mapping space search using expert choices or sub-optimal search heuristics. This work proposes Mind Mappings, a novel gradient-based search method for algorithm-accelerator mapping space search. The key idea is to derive a smooth, differentiable approximation to the otherwise non-smooth, non-convex search space. With a smooth, differentiable approximation, we can leverage efficient gradient-based search algorithms to find high-quality mappings. We extensively compare Mind Mappings to black-box optimization schemes used in prior work. When tasked to find mappings for two important workloads (CNN and MTTKRP), Mind Mapping finds mappings that achieve an average 1.40\texttimes{

DOI: 10.1145/3445814.3446762

Statistical robustness of Markov chain Monte Carlo accelerators

作者: Zhang, Xiangyu and Bashizade, Ramin and Wang, Yicheng and Mukherjee, Sayan and Lebeck, Alvin R.
关键词: statistical robustness, statistical machine learning, probabilistic computing, markov chain monte carlo, accelerator

Abstract

Statistical machine learning often uses probabilistic models and algorithms, such as Markov Chain Monte Carlo (MCMC), to solve a wide range of problems. Probabilistic computations, often considered too slow on conventional processors, can be accelerated with specialized hardware by exploiting parallelism and optimizing the design using various approximation techniques. Current methodologies for evaluating correctness of probabilistic accelerators are often incomplete, mostly focusing only on end-point result quality (“accuracy”). It is important for hardware designers and domain experts to look beyond end-point “accuracy” and be aware of how hardware optimizations impact statistical properties. This work takes a first step toward defining metrics and a methodology for quantitatively evaluating correctness of probabilistic accelerators. We propose three pillars of statistical robustness: 1) sampling quality, 2) convergence diagnostic, and 3) goodness of fit. We apply our framework to a representative MCMC accelerator and surface design issues that cannot be exposed using only application end-point result quality. We demonstrate the benefits of this framework to guide design space exploration in a case study showing that statistical robustness comparable to floating-point software can be achieved with limited precision, avoiding floating-point hardware overheads.

DOI: 10.1145/3445814.3446697

NeuroEngine： a hardware-based event-driven simulation system for advanced brain-inspired computing

作者: Lee, Hunjun and Kim, Chanmyeong and Chung, Yujin and Kim, Jangwoo
关键词: neuromorphic accelerators, event-driven simulation, brain-inspired computing

Abstract

Brain-inspired computing aims to understand the cognitive mechanisms of a brain and apply them to advance various areas in computer science. Deep learning is an example to greatly improve the field of pattern recognition and classification by utilizing an artificial neural network (ANN). To exploit advanced mechanisms of a brain and thus make more great advances, researchers need a methodology that can simulate neural networks with higher computational capabilities such as advanced spiking neural networks (SNNs) with two-stage neurons and synaptic delays. However, existing SNN simulation methodologies are too slow and energy-inefficient due to their software-based simulation or hardware-based but time-driven execution mechanisms. In this paper, we present NeuroEngine, a fast and energy-efficient hardware-based system to efficiently simulate advanced SNNs. The key idea is to design an accelerator to enable event-driven simulations of the SNNs at a minimum cost. NeuroEngine achieves high speed and energy efficiency by carefully architecting its datapath and memory units to take the best advantage of the event-driven mechanism while satisfying all the important requirements to simulate our target SNNs. For high performance and energy efficiency, NeuroEngine applies a simpler datapath, multi-queue scheduler, and lazy update to minimize its neuron computation and event scheduling overhead. Then, we build an end-to-end simulation system by implementing a programming interface and a compilation toolchain for NeuroEngine hardware. Our evaluations show that NeuroEngine greatly improves the harmonic mean performance and energy efficiency by 4.30\texttimes{

DOI: 10.1145/3445814.3446738

Defensive approximation： securing CNNs using approximate computing

作者: Guesmi, Amira and Alouani, Ihsen and Khasawneh, Khaled N. and Baklouti, Mouna and Frikha, Tarek and Abid, Mohamed and Abu-Ghazaleh, Nael
关键词: security, approximate computing, adversarial example, Deep neural network

Abstract

In the past few years, an increasing number of machine-learning and deep learning structures, such as Convolutional Neural Networks (CNNs), have been applied to solving a wide range of real-life problems. However, these architectures are vulnerable to adversarial attacks: inputs crafted carefully to force the system output to a wrong label. Since machine-learning is being deployed in safety-critical and security-sensitive domains, such attacks may have catastrophic security and safety consequences. In this paper, we propose for the first time to use hardware-supported approximate computing to improve the robustness of machine learning classifiers. We show that our approximate computing implementation achieves robustness across a wide range of attack scenarios. Specifically, we show that successful adversarial attacks against the exact classifier have poor transferability to the approximate implementation. The transferability is even poorer for the black-box attack scenarios, where adversarial attacks are generated using a proxy model. Surprisingly, the robustness advantages also apply to white-box attacks where the attacker has unrestricted access to the approximate classifier implementation: in this case, we show that substantially higher levels of adversarial noise are needed to produce adversarial examples. Furthermore, our approximate computing model maintains the same level in terms of classification accuracy, does not require retraining, and reduces resource utilization and energy consumption of the CNN. We conducted extensive experiments on a set of strong adversarial attacks; We empirically show that the proposed implementation increases the robustness of a LeNet-5 and an Alexnet CNNs by up to 99% and 87%, respectively for strong transferability-based attacks along with up to 50% saving in energy consumption due to the simpler nature of the approximate logic. We also show that a white-box attack requires a remarkably higher noise budget to fool the approximate classifier, causing an average of 4 dB degradation of the PSNR of the input image relative to the images that succeed in fooling the exact classifier.

DOI: 10.1145/3445814.3446747

Language-Parametric Compiler Validation with Application to LLVM - Artifact Evaluation for ASPLOS 2020

作者: Kasampalis, Theodoros and Park, Daejun and Lin, Zhengyao and Adve, Vikram S. and Ro\c{s
关键词: Compilers, Program Equivalence, Simulation, Translation Validation

Abstract

A VirtualBox VM image that is fully set up to reproduce experiments mentioned in the ASPLOS 2021 paper titled “Language-Parametric Compiler Validation with Application to LLVM”. The included README.md file contains detailed instructions on how to use the artifact both for reproduction of experiments and for general use.

DOI: 10.1145/3445814.3446751

Software Artifact for Incremental CFG Patching for Binary Rewriting

作者: Meng, Xiaozhu and Liu, Weijie
关键词: Docker, Dyninst, Firefox, Spack

Abstract

Software artifact needed for paper “Incremental CFG Patching for Binary Rewriting”. It includes scripts for setting environments, software dependencies, and running experiments, and template configuration files for SPEC CPU 2017.

DOI: 10.1145/3445814.3446765

Who’s debugging the debuggers? exposing debug information bugs in optimized binaries

作者: Di Luna, Giuseppe Antonio and Italiano, Davide and Massarelli, Luca and "{O
关键词: Verification, Optimized Binaries, Debug Information

Abstract

Despite the advancements in software testing, bugs still plague deployed software and result in crashes in production. When debugging issues —sometimes caused by “heisenbugs”— there is the need to interpret core dumps and reproduce the issue offline on the same binary deployed. This requires the entire toolchain (compiler, linker, debugger) to correctly generate and use debug information. Little attention has been devoted to checking that such information is correctly preserved by modern toolchains’ optimization stages. This is particularly important as managing debug information in optimized production binaries is non-trivial, often leading to toolchain bugs that may hinder post-deployment debugging efforts. In this paper, we present Debug2, a framework to find debug information bugs in modern toolchains. Our framework feeds random source programs to the target toolchain and surgically compares the debugging behavior of their optimized/unoptimized binary variants. Such differential analysis allows Debug2 to check invariants at each debugging step and detect bugs from invariant violations. Our invariants are based on the (in)consistency of common debug entities, such as source lines, stack frames, and function arguments. We show that, while simple, this strategy yields powerful cross-toolchain and cross-language invariants, which can pinpoint several bugs in modern toolchains. We have used Debug2 to find 23 bugs in the LLVM toolchain (clang/lldb), 8 bugs in the GNU toolchain (GCC/gdb), and 3 in the Rust toolchain (rustc/lldb)—with 14 bugs already fixed by the developers.

DOI: 10.1145/3445814.3446695

Speculative interference attacks： breaking invisible speculation schemes

作者: Behnia, Mohammad and Sahu, Prateek and Paccagnella, Riccardo and Yu, Jiyong and Zhao, Zirui Neil and Zou, Xiang and Unterluggauer, Thomas and Torrellas, Josep and Rozas, Carlos and Morrison, Adam and Mckeen, Frank and Liu, Fangfei and Gabor, Ron and Fletcher, Christopher W. and Basak, Abhishek and Alameldeen, Alaa
关键词: invisible speculation, microarchitectural covert channels, speculative execution attacks

Abstract

Recent security vulnerabilities that target speculative execution (e.g., Spectre) present a significant challenge for processor design. These highly publicized vulnerabilities use speculative execution to learn victim secrets by changing the cache state. As a result, recent computer architecture research has focused on invisible speculation mechanisms that attempt to block changes in cache state due to speculative execution. Prior work has shown significant success in preventing Spectre and other attacks at modest performance costs. In this paper, we introduce speculative interference attacks, which show that prior invisible speculation mechanisms do not fully block speculation-based attacks that use cache state. We make two key observations. First, mis-speculated younger instructions can change the timing of older, bound-to-retire instructions, including memory operations. Second, changing the timing of a memory operation can change the order of that memory operation relative to other memory operations, resulting in persistent changes to the cache state. Using both of these observations, we demonstrate (among other attack variants) that secret information accessed by mis-speculated instructions can change the order of bound-to-retire loads. Load timing changes can therefore leave secret-dependent changes in the cache, even in the presence of invisible speculation mechanisms. We show that this problem is not easy to fix. Speculative interference converts timing changes to persistent cache-state changes, and timing is typically ignored by many cache-based defenses. We develop a framework to understand the attack and demonstrate concrete proof-of-concept attacks against invisible speculation mechanisms. We conclude with a discussion of security definitions that are sufficient to block the attacks, along with preliminary defense ideas based on those definitions.

DOI: 10.1145/3445814.3446708

Replication for article： Jamais Vu： Thwarting Microarchitectural Replay Attacks

作者: Skarlatos, Dimitrios and Zhao, Zirui Neil and Paccagnella, Riccardo and Fletcher, Christopher W. and Torrellas, Josep
关键词: Gem5, Processor design, Replay attack, Side-channel countermeasures

Abstract

Our artifact provides a complete gem5 implementation of Jamais Vu, along with scripts to evaluate the SPEC’17 benchmarks. We also provide a GitHub repository with the gem5 implementation and required scripts to reproduce our simulation results. Finally, we provide a binary analysis infrastructure based on Radare2 that allows the compilation of binaries with the proposed Epoch markings.

DOI: 10.1145/3445814.3446716

Code for Streamline Attack： A Fast, Flushless Cache Covert-Channel Attack byEnabling Asynchronous Collusion

作者: Saileshwar, Gururaj and Fletcher, Christopher W. and Qureshi, Moinuddin
关键词: Asynchronous Protocol, Cache Side-Channels, Covert-channel Attacks, Last-Level Cache, Shared Caches

Abstract

This artifact presents the code and methodology to run the Streamline cache covert-channel attack. We provide the C++ code for the sender and receiver processes engaged in covert communication. Although the attack itself is not specific to an OS, ISA, or micro- architecture, the code is written with the assumption of an x86 Linux system and an Intel CPU that is a Skylake or a newer generation model. The code may be compiled with a standard compiler and run natively to execute the covert-communication. We also provide scripts to run the attack in several configurations demon- strated in Section-IV of our paper (with and without ECC, varying the shared array size and the synchronization period) and provide a Jupyter notebook to visualize the results.

Please use the public GitHub repository of the project https://github.com/gururaj-s/streamline for the most updated version of the code.

DOI: 10.1145/3445814.3446742