ASPLOS 2023

Achieving Sub-second Pairwise Query over Evolving Graphs

作者: Chen, Hongtao and Zhang, Mingxing and Yang, Ke and Chen, Kang and Zomaya, Albert and Wu, Yongwei and Qian, Xuehai
关键词: graph processing, pairwise query, triangle inequality

Abstract

Many real-time OLAP systems have been proposed to query evolving data with sub-second latency. Although this feature is highly attractive, it is very hard to be achieved on analytic graph queries that can only be answered after accessing every connected vertex. Fortunately, researchers recently observed that answering pairwise queries is enough for many real-world scenarios. These pairwise queries avoid the exhaustive nature and hence may only need to access a small portion of the graph. Obviously, the crux of achieving low latency is to what extent the system can eliminate unnecessary computations. This pruning process, according to our investigation, is usually achieved by estimating certain upper bounds of the query result in existing systems. However, our evaluation results demonstrate that these existing upper-bound-only pruning techniques can only prune about half of the vertex activations, which is still far away from achieving the sub-second latency goal on large graphs. In contrast, we found that it is possible to substantially accelerate the processing if we are able to not only estimate the upper bounds, but also foresee a tighter lower bound for certain pairs of vertices in the graph. Our experiments show that only less than 1% of the vertices are activated via using this novel lower bound based pruning technique. Based on this observation, we build SGraph, a system that is able to answer dynamic pairwise queries over evolving graphs with sub-second latency. It can ingest millions of updates per second and simultaneously answer pairwise queries with a latency that is several orders of magnitude smaller than state-of-the-art systems.

DOI: 10.1145/3575693.3576173

AfterImage： Leaking Control Flow Data and Tracking Load Operations via the Hardware Prefetcher

作者: Chen, Yun and Pei, Lingfeng and Carlson, Trevor E.
关键词: Hardware Security, Prefetcher, Side-channel Attacks

Abstract

AfterImage is a hardware side-channel inside specific Intel processors. In this artifact, we provide the needed information to reproduce the main results presented in the paper.

DOI: 10.1145/3575693.3575719

A Generic Service to Provide In-Network Aggregation for Key-Value Streams

作者: He, Yongchao and Wu, Wenfei and Le, Yanfang and Liu, Ming and Lao, ChonLam
关键词: Big Data, In-Network Aggregation, Key-Value, P4

Abstract

Key-value stream aggregation is a common operation in distributed systems, which requires intensive computation and network resources. We propose a generic in-network aggregation service for key-value streams, ASK, to accelerate the aggregation operations in diverse distributed applications. ASK is a switch-host co-designed system, where the programmable switch provides a best-effort aggregation service, and the host runs a daemon to interact with applications. ASK makes in-depth optimization tailored to traffic characteristics, hardware restrictions, and network unreliable natures: it vectorizes multiple key-value tuples’ aggregation of one packet in one switch pipeline pass, which improves the per-host’s goodput; it develops a lightweight reliability mechanism for key-value stream’s asynchronous aggregation, which guarantees computation correctness; it designs a hot-key agnostic prioritization for key-skewed workloads, which improves the switch memory utilization. We prototype ASK and use it to support Spark and BytePS. The evaluation shows that ASK could accelerate pure key-value aggregation tasks by up to 155 times and big data jobs by 3-5 times, and be backward compatible with existing INA-empowered distributed training solutions with the same speedup.

DOI: 10.1145/3575693.3575708

A Prediction System Service

作者: Zhang, Zhizhou and Glova, Alvin Oliver and Sherwood, Timothy and Balkind, Jonathan
关键词: Just-In-Time compiler, Operation System, hardware lock elision, memory management, perceptron, runtime optimization, software optimization

Abstract

To better facilitate application performance programming we propose a software optimization strategy enabled by a novel low-latency Prediction System Service (PSS). Rather than relying on nuanced domain-specific knowledge or slapdash heuristics, a system service for prediction encourages programmers to spend their time uncovering new levers for optimization rather than worrying about the details of their control. The core idea is to write optimizations that improve performance in specific cases, or under specific tunings, and leave the decision of how and when exactly to apply those optimizations to the system to learn through feedback-directed learning. Such a prediction service can be implemented in any number of ways, including as a shared library that can be easily reused by software written in different programming languages, and opens the door to both new software optimization patterns and hardware design possibilities. As a demonstration of the utility of this approach, we show that three very different application-targeted optimization scenarios can each benefit from even a very straightforward perceptron-based implementation of the PSS as long as the service latency can be held low. First, we show that PSS can be used to more intelligently guide hardware lock elision with resulting speedups over a baseline implementation by 34% on average. Second, we show that a PSS can find good configuration parameters for PyPy’s Just-In-Time (JIT) compiler resulting in 15% speedup on average. Last, we show PSS can guide the page reclamation task within a kernel memory management subsystem to reduce the average memory latency by 33% on average. In all three cases, this new optimization pattern with service support is able to meet or beat the best-known hand-crafted methods with a fraction of the complexity.

DOI: 10.1145/3575693.3575714

AtoMig： Automatically Migrating Millions Lines of Code from TSO to WMM

作者: Beck, Martin and Bhat, Koustubha and Stri\v{c
关键词: memory consistency models, parallelism and concurrency, static analysis, sustainability

Abstract

CPUs with weak memory-consistency models (WMMs), such as Arm and RISC-V, are rapidly
increasing their market share. Porting legacy x86 applications to such CPUs
requires introducing extra synchronization to prevent WMM-related concurrency
bugs—a task often left to human experts.

Given the rarity of such experts and the enormous size of legacy applications,
we develop AtoMig, an effective, fully automated tool for porting large, real-world
applications to WMM CPU architectures.
AtoMig detects shared memory
access patterns
with novel static analysis strategies
and performs program transformations to properly protect them from WMM effects.
In the absence of sufficiently scalable verification methods,
AtoMig shows practicality of
focusing on code patterns more prone to WMM faults,
trading off completeness for scalability.

We validate the correctness of AtoMig’s transformations on several
small concurrent benchmarks via model checking.
We demonstrate the scalability and performance of our approach by applying
AtoMig to popular real-world large code bases with up to millions of lines of code,
viz., MariaDB, Postgres, SQlite, LevelDB, and Memcached.
As part of this work, we also found a WMM bug in MariaDB,
which AtoMig fixes automatically.

DOI: 10.1145/3575693.3579849

BeeHive： Sub-second Elasticity for Web Services with Semi-FaaS Execution

作者: Zhao, Ziming and Wu, Mingyu and Tang, Jiawei and Zang, Binyu and Wang, Zhaoguo and Chen, Haibo
关键词: Cloud Computing, Function-as-a-Service, Java Virtual Machine

Abstract

Function-as-a-service (FaaS), an emerging cloud computing paradigm, is expected to provide strong elasticity due to its promise to auto-scale fine-grained functions rapidly. Although appealing for applications with good parallelism and dynamic workload, this paper shows that it is non-trivial to adapt existing monolithic applications (like web services) to FaaS due to their complexity. To bridge the gap between complicated web services and FaaS, this paper proposes a runtime-based Semi-FaaS execution model, which dynamically extracts time-consuming code snippets (closures) from applications and offloads them to FaaS platforms for execution. It further proposes BeeHive, an offloading framework for Semi-FaaS, which relies on the managed runtime to provide a fallback-based execution model and addresses the performance issues in traditional offloading mechanisms for FaaS. Meanwhile, the runtime system of BeeHive selects offloading candidates in a user-transparent way and supports efficient object sharing, memory management, and failure recovery in a distributed environment. The evaluation using various web applications suggests that the Semi-FaaS execution supported by BeeHive can reach sub-second resource provisioning on commercialized FaaS platforms like AWS Lambda, which is up to two orders of magnitude better than other alternative scaling approaches in cloud computing.

DOI: 10.1145/3575693.3575752

Better Than Worst-Case Decoding for Quantum Error Correction

作者: Ravi, Gokul Subramanian and Baker, Jonathan M. and Fayyazi, Arash and Lin, Sophia Fuhui and Javadi-Abhari, Ali and Pedram, Massoud and Chong, Frederic T.
关键词: cryogenic systems, decoding, fault tolerant, quantum computing, quantum error correction, single flux quantum, surface codes

Abstract

The overheads of classical decoding for quantum error correction in cryogenic quantum systems grow rapidly with the number of logical qubits and their correction code distance. Decoding at room temperature is bottlenecked by refrigerator I/O bandwidth while cryogenic on-chip decoding is limited by area/power/thermal budget.

To overcome these overheads, we are motivated by the observation that in the common case (over 90% of the time), error correction ‘syndromes’ are fairly trivial with high redundancy / sparsity, since the error correction codes are over-provisioned to be able to correct for uncommon worst-case complex scenarios (to ensure substantially low logical error rates). If suitably exploited, these trivial scenarios can be handled with insignificant overhead, thereby alleviating any bottlenecks towards handling the worst-case scenarios by state-of-the-art means.

We propose Better Than Worst-Case Decoding for Quantum Error Correction, targeting cryogenic quantum systems and Surface Code, consisting of:

On-chip Clique Decoder: An extremely lightweight decoder for correcting trivial common-case errors, designed for the cryogenic domain. The decoder is implemented and evaluated for SFQ logic.

Statistical Off-chip Bandwidth Allocation: A statistical confidence-based technique for allocation of off-chip decoding bandwidth, to efficiently handle the rare complex decodes that are not covered by the Clique Decoder.

Decode-Overflow Execution Stalling: A method to stall circuit execution, for the worst-case scenarios in which the provisioned off-chip bandwidth is insufficient to complete all requested off-chip decodes.

In all, BTWC decoding achieves 70-99+% off-chip bandwidth elimination across a range of logical and physical error rates, without significantly sacrificing the accuracy of a state-of-the-art off-chip decoder. Further, it achieves 10-1000x bandwidth reduction over prior bandwidth reduction techniques, as well as 15-37x resource overhead reduction compared to prior on-chip decoding.

DOI: 10.1145/3575693.3575733

Betty： Enabling Large-Scale GNN Training with Batch-Level Graph Partitioning

作者: Yang, Shuangyan and Zhang, Minjia and Dong, Wenqian and Li, Dong
关键词: Graph neural network, Graph partition, Redundancy reduction

Abstract

The directory of the Betty includes dataset/, Figures/, pytorch/, README.md, requiremnets.sh. We can execute ‘bash install_requirements.sh’ to install requirments. After downloading benchmarks and generating full batch data into folder /Betty/dataset/. The directory pytorch contains all necessary files for the micro-batch training and mini-batch training. In folder micro_batch_train, graph_partitioner.py contains our implementation of redundancy embedded graph partitioning. block_dataloader.py is implemented to construct the micro-batch based on the partitioning results of REG. The folder Figures contains these important figures for analysis and performance evaluation.

DOI: 10.1145/3575693.3575725

Carbon Explorer： A Holistic Framework for Designing Carbon Aware Datacenters

作者: Acun, Bilge and Lee, Benjamin and Kazhamiaka, Fiodar and Maeng, Kiwan and Gupta, Udit and Chakkaravarthy, Manoj and Brooks, David and Wu, Carole-Jean
关键词: Datacenter carbon footprint optimization, batteries, embodied and operational carbon footprint, load shifting, renewable energy

Abstract

Technology companies reduce their datacenters’ carbon footprint by investing in renewable energy generation and receiving credits from power purchase agreements. Annually, datacenters offset their energy consumption with generation credits (Net Zero). But hourly, datacenters often consume carbon-intensive energy from the grid when carbon-free energy is scarce. Relying on intermittent renewable energy in every hour (24/7) requires a mix of renewable energy from complementary sources, energy storage, and workload scheduling. In this paper, we present the Carbon Explorer framework to analyze the solution space. We use Carbon Explorer to balance trade-offs between operational and embodied carbon, optimizing the mix of solutions for 24/7 carbon-free datacenter operation based on geographic location and workload. Carbon Explorer has been open-sourced at https://github.com/facebookresearch/CarbonExplorer.

DOI: 10.1145/3575693.3575754

CommonGraph： Graph Analytics on Evolving Data

作者: Afarin, Mahbod and Gao, Chao and Rahman, Shafiur and Abu-Ghazaleh, Nael and Gupta, Rajiv
关键词: evolving graphs, iterative graph algorithms, work sharing

Abstract

We consider the problem of graph analytics on evolving graphs (i.e., graphs that change over time). In this scenario, a query typically needs to be applied to different snapshots of the graph over an extended time window, for example to track the evolution of a property over time. Solving a query independently on multiple snapshots is inefficient due to repeated execution of subcomputation common to multiple snapshots. At the same time, we show that using streaming, where we start from the earliest snapshot and stream the changes to the graph incrementally updating the query results one snapshot at a time is also inefficient. We propose CommonGraph, an approach for efficient processing of queries on evolving graphs. We first observe that deletion operations are significantly more expensive than addition operations for many graph queries (those that are monotonic). CommonGraph converts all deletions to additions by finding a common graph that exists across all snapshots. After computing the query on this graph, to reach any snapshot, we simply need to add the missing edges and incrementally update the query results. CommonGraph also allows sharing of common additions among snapshots that require them, and breaks the sequential dependency inherent in the traditional streaming approach where snapshots are processed in sequence, enabling additional opportunities for parallelism. We incorporate the CommonGraph approach by extending the KickStarter streaming framework. We implement optimizations that enable efficient handling of edge additions without resorting to expensive in place graph mutations, significantly reducing the streaming overhead, and enabling direct reuse of shared edges among different snapshots. CommonGraph achieves 1.38x-8.17x improvement in performance over Kickstarter across multiple benchmarks.

DOI: 10.1145/3575693.3575713

DFusor

作者: Wang, Theodore Luo and Tian, Yongqiang and Dong, Yiwen and Xu, Zhenyang and Sun, Chengnian
关键词: Compiler Testing, Debug Information

Abstract

ASPLOS 2023 Artifact for “Compilation Consistency Modulo Debug Information”

DOI: 10.1145/3575693.3575740

Compiling Distributed System Models with PGo [evaluation]

作者: Hackett, Finn and Hosseini, Shayan and Costa, Renato and Do, Matthew and Beschastnikh, Ivan
关键词: Compilers, Distributed systems, Formal methods, PlusCal, TLA+

Abstract

This repository aggregates all the tools and data necessary to reproduce the results in the evaluation section of our ASPLOS 2023 paper.

Our artifact has two components. We provide the PGo compiler itself, which can compile MPCal specifications, and we also provide a method for reproducing our performance results from our ASPLOS 2023 paper. These files describe how to reproduce our performance results.

Our own set of results is included in the results_paper/ folder. For how to use the included tools and how to interpret the included results, see the README.

DOI: 10.1145/3575693.3575695

Software artifacts for the paper “Copy-on-Pin： The Missing Piece for Correct Copy-on-Write”

作者: Hildenbrand, David and Schulz, Martin and Amit, Nadav
关键词: copy-on-write, COW, fork, memory deduplication, page pinning, page sharing, virtual memory

Abstract

Software artifacts for the paper “Copy-on-Pin: The Missing Piece for Correct Copy-on-Write”.

DOI: 10.1145/3575693.3575716

Decker

作者: Porter, Chris and Khan, Sharjeel and Pande, Santosh
关键词: program security, software debloating

Abstract

The Decker framework consists of a compiler pass and runtime library. Its main objective is to debloat software at runtime. The artifact includes a Docker image that encapsulates basic dependencies, the
Decker code itself, benchmarks, and the scripts to drive artifact evaluation.

DOI: 10.1145/3575693.3575734

DeepUM： Tensor Migration and Prefetching in Unified Memory

作者: Jung, Jaehoon and Kim, Jinpyo and Lee, Jaejin
关键词: CUDA, data prefetching, deep learning, device driver, neural networks, runtime system, unified memory

Abstract

Deep neural networks (DNNs) are continuing to get wider and deeper. As a result, it requires a tremendous amount of GPU memory and computing power. In this paper, we propose a framework called DeepUM that exploits CUDA Unified Memory (UM) to allow GPU memory oversubscription for DNNs. While UM allows memory oversubscription using a page fault mechanism, page migration introduces enormous overhead. DeepUM uses a new correlation prefetching technique to hide the page migration overhead.
It is fully automatic and transparent to users. We also propose two optimization techniques to minimize the GPU fault handling time. We evaluate the performance of DeepUM using nine large-scale DNNs from MLPerf, PyTorch examples, and Hugging Face and compare its performance with six state-of-the-art GPU memory swapping approaches. The evaluation result indicates that DeepUM is very effective for GPU memory oversubscription and can handle larger models that other approaches fail to handle.

DOI: 10.1145/3575693.3575736

Ditto： End-to-End Application Cloning for Networked Cloud Services

作者: Liang, Mingyu and Gan, Yu and Li, Yueying and Torres, Carlos and Dhanotia, Abhishek and Ketkar, Mahesh and Delimitrou, Christina
关键词: architecture, benchmarking and emulation, cloud computing, microservices, software reverse engineering

Abstract

The lack of representative, publicly-available cloud services has been a recurring problem in the architecture and systems communities. While open-source benchmarks exist, they do not capture the full complexity of cloud services. Application cloning is a promising way to address this, however, prior work is limited to CPU-/cache-centric, single-node services, operating at user level.

We present Ditto, an automated framework for cloning end-to-end cloud applications, both monolithic and microservices, which captures I/O and network activity, as well as kernel operations, in addition to application logic. Ditto takes a hierarchical approach to application cloning, starting with capturing the dependency graph across distributed services, to recreating each tier’s control/data flow, and finally generating system calls and assembly that mimics the individual applications. Ditto does not reveal the logic of the original application, facilitating publicly sharing clones of production services with hardware vendors, cloud providers, and the research community.

We show that across a diverse set of single- and multi-tier applications, Ditto accurately captures their CPU and memory characteristics as well as their high-level performance metrics, is portable across platforms, and facilitates a wide range of system studies.

DOI: 10.1145/3575693.3575751

DPACS： Hardware Accelerated Dynamic Neural Network Pruning through Algorithm-Architecture Co-design

作者: Gao, Yizhao and Zhang, Baoheng and Qi, Xiaojuan and So, Hayden Kwok-Hay
关键词: Deep Learning Applications, hardware accelerators

Abstract

DPCAS is an algorithm-architecture co-design framework for dynamic neural network pruning. It utilizes a hardware-aware dynamic spatial and channel pruning mechanism in conjunction with a dynamic dataflow engine in hardware to facilitate efficient processing of the pruned network.

DOI: 10.1145/3575693.3575728

Ecovisor： A Virtual Energy System for Carbon-Efficient Applications

作者: Souza, Abel and Bashir, Noman and Murillo, Jorge and Hanafy, Walid and Liang, Qianlin and Irwin, David and Shenoy, Prashant
关键词: Sustainable computing, cloud computing, operating systems

Abstract

Cloud platforms’ rapid growth is raising significant concerns about their carbon emissions. To reduce carbon emissions, future cloud platforms will need to increase their reliance on renewable energy sources, such as solar and wind, which have zero emissions but are highly unreliable. Unfortunately, today’s energy systems effectively mask this unreliability in hardware, which prevents applications from optimizing their carbon-efficiency, or work done per kilogram of carbon emitted. To address the problem, we design an “ecovisor”, which virtualizes the energy system and exposes software-defined control of it to applications. An ecovisor enables each application to handle clean energy’s unreliability in software based on its own specific requirements. We implement a small-scale ecovisor prototype that virtualizes a physical energy system to enable software-based application-level i) visibility into variable grid carbon-intensity and local renewable generation and ii) control of server power usage and battery charging and discharging. We evaluate the ecovisor approach by showing how multiple applications can concurrently exercise their virtual energy system in different ways to better optimize carbon-efficiency based on their specific requirements compared to general system-wide policies.

DOI: 10.1145/3575693.3575709

ElasticFlow Artifact

作者: Gu, Diandian and Zhao, Yihao and Zhong, Yinmin and Xiong, Yifan and Han, Zhenhua and Cheng, Peng and Yang, Fan and Huang, Gang and Jin, Xin and Liu, Xuanzhe
关键词: Cluster Scheduling, Distributed Deep Learning, GPU Cluster, Serverless Computing

Abstract

The artifact provides source code for the prototype of the proposed system ElasticFlow, including the main implementation of ElasticFlow, testbed experiment scripts (Section 6.2 \& Section 6.6), and cluster simulation scripts (Section 6.3 \& Section 6.4 \& Section 6.5). We provide a docker image with pre-installed prerequisites to simplify the testbed experiment workflow. Users can also use a script to install all software dependencies from scratch. Please refer to the documents in our repository for more details.

DOI: 10.1145/3575693.3575721

EVStore： Storage and Caching Capabilities for Scaling Embedding Tables in Deep Recommendation Systems

作者: Kurniawan, Daniar H. and Wang, Ruipu and Zulkifli, Kahfi S. and Wiranata, Fandi A. and Bent, John and Vigfusson, Ymir and Gunawi, Haryadi S.
关键词: Caching systems, Deep learning, Inference systems, Performance, Recommendation Systems

Abstract

Modern recommendation systems, primarily driven by deep-learning models, depend on fast model inferences to be useful. To tackle the sparsity in the input space, particularly for categorical variables, such inferences are made by storing increasingly large embedding vector (EV) tables in memory. A core challenge is that the inference operation has an all-or-nothing property: each inference requires multiple EV table lookups, but if any memory access is slow, the whole inference request is slow. In our paper, we design, implement and evaluate EVStore, a 3-layer EV table lookup system that harnesses both structural regularity in inference operations and domain-specific approximations to provide optimized caching, yielding up to 23% and 27% reduction on the average and p90 latency while quadrupling throughput at 0.2% loss in accuracy. Finally, we show that at a minor cost of accuracy, EVStore can reduce the Deep Recommendation System (DRS) memory usage by up to 94%, yielding potentially enormous savings for these costly, pervasive systems.

DOI: 10.1145/3575693.3575718

FLAT： An Optimized Dataflow for Mitigating Attention Bottlenecks

作者: Kao, Sheng-Chun and Subramanian, Suvinay and Agrawal, Gaurav and Yazdanbakhsh, Amir and Krishna, Tushar
关键词: Attention, DNN Accelerators, Dataflow, Transformer

Abstract

Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at the cost of prohibitively large memory requirements and computational complexity, especially at higher number of input elements. This limitation is due to inherently limited data reuse opportunities and quadratic growth in memory footprints, leading to severe memory-boundedness and limited scalability of input elements. This work addresses these challenges by devising a tailored dataflow optimization, called FLAT, for attention mechanisms without altering their functionality. This dataflow processes costly attention operations through a unique fusion mechanism, transforming the memory footprint quadratic growth to merely a linear one. To realize the full potential of this bespoke mechanism, we propose a tiling approach to enhance the data reuse across attention operations. Our method both mitigates the off-chip bandwidth bottleneck as well as reduces the on-chip memory requirement. FLAT delivers 1.94x (1.76x) speedup and 49% and (42%) of energy savings compared to the state-of-the-art Edge (Cloud) accelerators with no customized dataflow optimization. When on-chip resources are scarce (20 KB-200 KB), FLAT yields, on average, 1.5x end-to-end latency reduction across a diverse range of conventional attention-based models with input sequence lengths ranging from 512-token to 64K-token. Our evaluations demonstrate that state-of-the-art DNN dataflow applied to attention operations reach the efficiency limit for inputs above 512 elements. In contrast, FLAT unblocks transformer models for inputs with up to 64K elements.

DOI: 10.1145/3575693.3575747

FrozenQubits： Boosting Fidelity of QAOA by Skipping Hotspot Nodes

作者: Ayanzadeh, Ramin and Alavisamani, Narges and Das, Poulami and Qureshi, Moinuddin
关键词: NISQ, QAOA, Quantum Computing

Abstract

Quantum Approximate Optimization Algorithm (QAOA) is one of the leading candidates for demonstrating the quantum advantage using near-term quantum computers. Unfortunately, high device error rates limit us from reliably running QAOA circuits for problems with more than a few qubits. In QAOA, the problem graph is translated into a quantum circuit such that every edge corresponds to two 2-qubit CNOT operations in each layer of the circuit. As CNOTs are extremely error-prone, the fidelity of QAOA circuits is dictated by the number of edges in the problem graph.

We observe that the majority of graphs corresponding to real-world applications follow a “power-law” distribution, where some hotspot nodes have significantly higher number of connections. We leverage this insight and propose “FrozenQubits” that freezes the hotspot nodes or qubits and intelligently partitions the state-space of the given problem into several smaller sub-spaces, which are then solved independently. The corresponding QAOA sub-circuits are significantly less vulnerable to gate and decoherence errors due to the reduced number of CNOT operations in each sub-circuit. Unlike prior circuit-cutting approaches, FrozenQubits does not require any exponentially complex postprocessing step. Our evaluations with 5,300 QAOA circuits on eight different quantum computers from IBM show that FrozenQubits can improve the quality of solutions by 8.73x on average (and by up to 57x), albeit while utilizing 2x more quantum resources.

DOI: 10.1145/3575693.3575741

Reproduction Package for Article “GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture”

作者: Qureshi, Zaid and Mailthody, Vikram Sharma and Gelado, Isaac and Min, Seungwon and Masood, Amna and Park, Jeongmin and Xiong, Jinjun and Newburn, C. J. and Vainbrand, Dmitri and Chung, I-Hsin and Garland, Michael and Dally, William and Hwu, Wen-mei
关键词: GPUDirect, GPUs, Memory capacity, Memory hierarchy, SSDs, Storage systems

Abstract

The artifact is the source code the BaM system that enables efficient, on-demand accesses to storage from GPU thread. The artifact includes the source code for the system’s kernel module, library, and micro-benchmarks and applications. The applications and micro-benchmarks support multiple state-of-the-art implementations as well as BaM implementations for easy comparison.

DOI: 10.1145/3575693.3575748

GZKP： A GPU Accelerated Zero-Knowledge Proof System

作者: Ma, Weiliang and Xiong, Qian and Shi, Xuanhua and Ma, Xiaosong and Jin, Hai and Kuang, Haozhao and Gao, Mingyu and Zhang, Ye and Shen, Haichen and Hu, Weifang
关键词: GPU acceleration, zero-knowledge proof

Abstract

Zero-knowledge proof (ZKP) is a cryptographic protocol that allows one party to prove the correctness of a statement to another party without revealing any information beyond the correctness of the statement itself. It guarantees computation integrity and confidentiality, and is therefore increasingly adopted in industry for a variety of privacy-preserving applications, such as verifiable outsource computing and digital currency. A significant obstacle in using ZKP for online applications is the performance overhead of its proof generation. We develop GZKP, a GPU accelerated zero-knowledge proof system that supports different levels of security requirements and brings significant speedup toward making ZKP truly usable. For polynomial computation over a large finite field, GZKP promotes a cache-friendly memory access pattern while eliminating the costly external shuffle in existing solutions. For multi-scalar multiplication, GZKP adopts a new parallelization strategy, which aggressively combines integer elliptic curve point operations and exploits fine-grained task parallelism with load balancing for sparse integer distribution. GZKP outperforms the state-of-the-art ZKP systems by an order of magnitude, achieving up to 48.1\texttimes{

DOI: 10.1145/3575693.3575711

Reproduction package for article ‘Hacky Racers： Exploiting Instruction-Level Parallelism to Generate Stealthy Fine-Grained Timers’

作者: Xiao, Haocheng and Ainsworth, Sam
关键词: Caches, Instruction-level Parallelism, JavaScript, Microarchitectural Security, Spectre

Abstract

Our artifact provides source code and scripts for four timing side-channel experiments mentioned in this paper, corresponding to sections 7.3 to 7.5, and each demonstrates the efficiency and portability of either/both Racing and Magnifier Gadgets. Our results should be evaluated on an Intel or AMD x86 machine (we used an i7-8750h, but systems of a similar architecture will work out-of-the-box). Migration to systems with other ISAs or significantly different micro-architectures require minor source-code modifications.

DOI: 10.1145/3575693.3575700

Artifact for paper “Hidet： Task Mapping Programming Paradigm for Deep Learning Tensor Programs”

作者: Ding, Yaoyao and Yu, Cody Hao and Zheng, Bojian and Liu, Yizhi and Wang, Yida and Pekhimenko, Gennady
关键词: artifact, deep learning compiler, deep nerual networks, inference

Abstract

This artifact helps readers to reproduce all experiments in the evaluation section of our paper. In Section 6, there are 6 experiments (one end to end experiment and 5 case studies). These experiments compare Hidet with other DNN frameworks and compilers on representative DNN models from the perspective of execution latency, optimization time, schedule space, input sensitivity, and different batch sizes. In the public artifact, we provide scripts to launch the 6 experiments automatically. With the hardware and software described in Section A.3.2 and A.3.3, the artifact should reproduce all experimental results in the evaluation section.

DOI: 10.1145/3575693.3575702

HuffDuff： Stealing Pruned DNNs from Sparse Accelerators

作者: Yang, Dingqing and Nair, Prashant J. and Lis, Mieszko
关键词: Side-channel attacks, Sparse DNN accelerators

Abstract

Deep learning models are a valuable “secret sauce” that confers a significant competitive advantage. Many models are never visible to the user and even publicly known state-of-the-art models are either completely proprietary or only accessible via access-controlled APIs. Increasingly, these models run directly on the edge, often using a low-power DNN accelerator. This makes models particularly vulnerable, as an attacker with physical access can exploit side channels like off-chip memory access volumes. Indeed, prior work has shown that this channel can be used to steal dense DNNs from edge devices by correlating data transfer volumes with layer geometry. Unfortunately, prior techniques become intractable when the model is sparse in either weights or activations because off-chip transfers no longer correspond exactly to layer dimensions. Could it be that the many mobile-class sparse accelerators are inherently safe from this style of attack? In this paper, we show that it is feasible to steal a pruned DNN model architecture from a mobile-class sparse accelerator using the DRAM access volume channel. We describe HuffDuff, an attack scheme with two novel techniques that leverage (i) ‍the boundary effect present in CONV layers, and (ii) ‍the timing side channel of on-the-fly activation compression. Together, these techniques dramatically reduce the space of possible model architectures up to 94 orders of magnitude, resulting in fewer than 100 candidate models — a number that can be feasibly tested. Finally, we sample network instances from our solution space and show that (i) ‍our solutions reach the victim accuracy under the iso-footprint constraint, and (ii) ‍significantly improve black-box targeted attack success rates.

DOI: 10.1145/3575693.3575738

Junkyard Computing： Repurposing Discarded Smartphones to Minimize Carbon

作者: Switzer, Jennifer and Marcano, Gabriel and Kastner, Ryan and Pannuto, Pat
关键词: cloud computing, life cycle assessment, sustainability

Abstract

1.5 billion smartphones are sold annually, and most are decommissioned less than two years later. Most of these unwanted smartphones are neither discarded nor recycled but languish in junk drawers and storage units. This computational stockpile represents a substantial wasted potential: modern smartphones have increasingly high-performance and energy-efficient processors, extensive networking capabilities, and a reliable built-in power supply. This project studies the ability to reuse smartphones as “junkyard computers.” Junkyard computers grow global computing capacity by extending device lifetimes, which supplants the manufacture of new devices. We show that the capabilities of even decade-old smartphones are within those demanded by modern cloud microservices and discuss how to combine phones to perform increasingly complex tasks. We describe how current operation-focused metrics do not capture the actual carbon costs of compute. We propose Computational Carbon Intensity—a performance metric that balances the continued service of older devices with the superlinear runtime improvements of newer machines. We use this metric to redefine device service lifetime in terms of carbon efficiency. We develop a cloudlet of reused Pixel 3A phones. We analyze the carbon benefits of deploying large, end-to-end microservice-based applications on these smartphones. Finally, we describe system architectures and associated challenges to scale to cloudlets with hundreds and thousands of smartphones.

DOI: 10.1145/3575693.3575710

Khuzdul： Efficient and Scalable Distributed Graph Pattern Mining Engine

作者: Chen, Jingji and Qian, Xuehai
关键词: distributed graph analytics, graph pattern mining

Abstract

This paper proposes Khuzdul, a distributed execution engine with a well-defined abstraction that can be integrated with existing single-machine graph pattern mining (GPM) systems to provide efficiency and scalability at the same time. The key novelty is the extendable embedding abstraction which can express pattern enumeration algorithms, allow fine-grained task scheduling, and enable low-cost GPM-specific data reuse to reduce communication cost. The effective BFS-DFS hybrid exploration generates sufficient concurrent tasks for communication-computation overlapping with bounded memory consumption. Two scalable distributed GPM systems are implemented by porting Automine and GraphPi on Khuzdul. Our evaluation shows that Khuzdul based systems significantly outperform state-of-the-art distributed GPM systems with partitioned graphs by up to 75.5\texttimes{

DOI: 10.1145/3575693.3575743

ASPLOS’ 23 Artifact of “KIT： Testing OS-Level Virtualization for Functional Interference Bugs”

作者: Liu, Congyu and Gong, Sishuai and Fonseca, Pedro
关键词: Bugs, OS-level Virtualization, Testing

Abstract

Container isolation is implemented through OS-level virtualization, such as Linux namespaces. Unfortunately, these mechanisms are extremely challenging to implement correctly and, in practice, suffer from functional interference bugs, which compromise container security. In particular, functional interference bugs allow an attacker to extract information from another container running on the same machine or impact its integrity by modifying kernel resources that are incorrectly isolated. Despite their impact, functional interference bugs in OS-level virtualization have received limited attention in part due to the challenges in detecting them. Instead of causing memory errors or crashes, many functional interference bugs involve hard-to-catch logic errors that silently produce semantically incorrect results. This paper proposes KIT, a dynamic testing framework that discovers functional interference bugs in OS-level virtualization mechanisms, such as Linux namespaces. The key idea of KIT is to detect inter-container functional interference by comparing the system call traces of a container across two executions, where it runs with and without the preceding execution of another container. To achieve high efficiency and accuracy, KIT includes two critical components: an efficient algorithm to generate test cases that exercise inter-container data flows and a system call trace analysis framework that detects functional interference bugs and clusters bug reports. KIT discovered 9 functional interference bugs in Linux kernel 5.13, of which 6 have been confirmed. All bugs are caused by logic errors, showing that this approach is able to detect hard-to-catch semantic bugs.

DOI: 10.1145/3575693.3575731

Artifact of “LeaFTL： A Learning-Based Flash Translation Layer for Solid-State Drives”

作者: Sun, Jinghan and Li, Shaobo and Sun, Yunxin and Sun, Chao and Vucinic, Dejan and Huang, Jian
关键词: Flash Translation Layer, Learning-Based Storage, Solid-State Drive

Abstract

This artifact is for reproducing the experiment results in the paper. The artifact includes the simulator source code with LeaFTL implementation, datasets for evaluation, and also scripts and instructions for reproducing the results. More details on the artifact can be found in the GitHub README File.

DOI: 10.1145/3575693.3575744

Lucid Artifact

作者: Hu, Qinghao and Zhang, Meng and Sun, Peng and Wen, Yonggang and Zhang, Tianwei
关键词: Cluster Management, Machine Learning, Workload Scheduling

Abstract

This artifact appendix describes how to reproduce main results in our paper. In our public repository, we provide the source code, related dataset and the instructions to perform artifact evaluation. Please refer to the README file for more details.

DOI: 10.1145/3575693.3575705

MC Mutants Artifact

作者: Levine, Reese and Guo, Tianhao and Cho, Mingun and Baker, Alan and Levien, Raph and Neto, David and Quinn, Andrew and Sorensen, Tyler
关键词: memory consistency, mutation testing, parallel programming models

Abstract

This artifact contains information for both collecting analyzing the results we present in the paper. On the collection side, we provide the means to run the exact experiments included in the paper. Using the exact devices from the paper will show very similar results to ours, but any GPU can be used to evaluate the way in which we collect and analyze data. On the analysis side, we include the results from running the experiments on the four devices in the paper, as well as the analysis tools we used to generate the main figures in the paper.

DOI: 10.1145/3575693.3575750

Mobius： Fine Tuning Large-Scale Models on Commodity GPU Servers

作者: Feng, Yangyang and Xie, Minhui and Tian, Zijie and Wang, Shuo and Lu, Youyou and Shu, Jiwu
关键词: Neural networks, distributed training, parallel training

Abstract

Fine-tuning on cheap commodity GPU servers makes large-scale deep learning models benefit more people. However, the low inter-GPU communication bandwidth and pressing communication contention on the commodity GPU server obstruct training efficiency. In this paper, we present Mobius, a communication-efficient system for fine tuning large-scale models on commodity GPU servers. The key idea is a novel pipeline parallelism scheme enabling heterogeneous memory for large-scale model training, while bringing fewer communications than existing systems. Mobius partitions the model into stages and carefully schedules them between GPU memory and DRAM to overlap communication with computation. It formulates pipeline execution into a mixed-integer program problem to find the optimal pipeline partition. It also features a new stage-to-GPU mapping method termed cross mapping, to minimize communication contention. Experiments on various scale models and GPU topologies show that Mobius significantly reduces the training time by 3.8-5.1\texttimes{

DOI: 10.1145/3575693.3575703

MSCCLang： Microsoft Collective Communication Language

作者: Cowan, Meghan and Maleki, Saeed and Musuvathi, Madanlal and Saarikivi, Olli and Xiong, Yifan
关键词: Collective Communication, Compilers, GPU

Abstract

Machine learning models with millions or billions of parameters are increasingly trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, collective communication becomes a bottleneck. Custom collective algorithms optimized for both particular network topologies and application-specific communication patterns can alleviate this bottleneck and help these applications scale. However, implementing correct and efficient custom algorithms is challenging.

This paper introduces MSCCLang, a system for programmable GPU communication. MSCCLang provides a domain specific language for writing collective communication algorithms and an optimizing compiler for lowering them to an executable form, which can be executed efficiently and flexibly in an interpreter-based runtime. We used MSCCLang to write novel collective algorithms for AllReduce and AllToAll that are up to 1.9\texttimes{

DOI: 10.1145/3575693.3575724

Navigating the Dynamic Noise Landscape of Variational Quantum Algorithms with QISMET

作者: Ravi, Gokul Subramanian and Smith, Kaitlin and Baker, Jonathan M. and Kannan, Tejas and Earnest, Nathan and Javadi-Abhari, Ali and Hoffmann, Henry and Chong, Frederic T.
关键词: error mitigation, noisy intermediate-scale quantum, quantum computing, superconducting qubits, transient error, variational quantum algorithms, variational quantum eigensolver

Abstract

In the Noisy Intermediate Scale Quantum (NISQ) era, the dynamic nature of quantum systems causes noise sources to constantly vary over time. Transient errors from the dynamic NISQ noise landscape are challenging to comprehend and are especially detrimental to classes of applications that are iterative and/or long-running, and therefore their timely mitigation is important for quantum advantage in real-world applications.

The most popular examples of iterative long-running quantum applications are variational quantum algorithms (VQAs). Iteratively, VQA’s classical optimizer evaluates circuit candidates on an objective function and picks the best circuits towards achieving the application’s target. Noise fluctuation can cause a significant transient impact on the objective function estimation of the VQA iterations’ tuning candidates. This can severely affect VQA tuning and, by extension, its accuracy and convergence.

This paper proposes QISMET: Quantum Iteration Skipping to Mitigate Error Transients, to navigate the dynamic noise landscape of VQAs. QISMET actively avoids instances of high fluctuating noise which are predicted to have a significant transient error impact on specific VQA iterations. To achieve this, QISMET estimates transient error in VQA iterations and designs a controller to keep the VQA tuning faithful to the transient-free scenario. By doing so, QISMET efficiently mitigates a large portion of the transient noise impact on VQAs and is able to improve the fidelity by 1.3x-3x over a traditional VQA baseline, with 1.6-2.4x improvement over alternative approaches, across different applications and machines.

DOI: 10.1145/3575693.3575739

ASPLOS2023 Artifact for “NNSmith： Generating Diverse and Valid Test Cases for Deep Learning Compilers”

作者: Liu, Jiawei and Lin, Jinkun and Ruffy, Fabian and Tan, Cheng and Li, Jinyang and Panda, Aurojit and Zhang, Lingming
关键词: Compiler, Fuzzing, Machine Learning Systems, Testing

Abstract

The artifact contains evidence of bug finding, source code of NNSmith’s prototype, and user-friendly HTML documentation for re-generating the results. Specifically, it includes (1) links to bugs reported by the authors as real-world bug finding evidence, and (2) scripts and code to re-generate main results in § 5. To make artifact evaluation as simple as possible, our artifact is packaged into a pre-built docker image, along with a detailed and friendly HTML documentation. To fully evaluate the artifact, a X86-CPU platform with docker access is needed, with approximately 21 hours of machine time and 1 hour of manual inspection time.

DOI: 10.1145/3575693.3575707

NUBA： Non-Uniform Bandwidth GPUs

作者: Zhao, Xia and Jahre, Magnus and Tang, Yuhua and Zhang, Guangda and Eeckhout, Lieven
关键词: GPU, Non-Uniform Bandwidth Architecture (NUBA)

Abstract

The parallel execution model of GPUs enables scaling to hundreds of thousands of threads, which is a key capability that many modern high-performance applications exploit. GPU vendors are hence increasing the compute and memory resources with every GPU generation — resulting in the need to efficiently stitch together a plethora of Symmetric Multiprocessors (SMs), Last-Level Cache (LLC) slices and memory controllers while maximizing bandwidth and keeping power consumption and design complexity in check. Conventional GPUs are Uniform Bandwidth Architectures (UBAs) as they provide equal bandwidth between all SMs and all LLC slices. UBA GPUs require a uniform high-bandwidth Network-on-Chip (NoC), and our key observation is that provisioning a NoC to match the LLC slice bandwidth incurs a hefty power and complexity overhead. We propose the Non-Uniform Bandwidth Architecture (NUBA), a GPU system architecture aimed at fully utilizing LLC slice bandwidth. A NUBA GPU consists of partitions that each feature a few SMs and LLC slices as well as a memory controller — hence exposing the complete LLC bandwidth to the SMs within a partition since they can be connected with point-to-point links — and a NoC between partitions — to enable access to remote data.Exploiting the potential of NUBA GPUs however requires carefully co-designing system software, the compiler and architectural policies. The critical system software component is our Local-And-Balanced (LAB) page placement policy which enables the GPU driver to place data in local partitions while avoiding load imbalance. Moreover, we propose Model-Driven Replication (MDR) which identifies read-only shared data with data-flow analysis at compile time. At run time, MDR leverages an architectural mechanism that replicates read-only shared data across LLC slices when this can be done without pressuring cache capacity. With LAB and MDR, our NUBA GPU improves average performance by 23.1% and 22.2% (and up to 183.9% and 182.4%) compared to iso-resource memory-side and SM-side UBA GPUs, respectively. When the NUBA concept is leveraged to reduce overhead while maintaining similar performance, NUBA reduces NoC power consumption by 12.1\texttimes{

DOI: 10.1145/3575693.3575745

Optimus-CC： Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

作者: Song, Jaeyong and Yim, Jinkyu and Jung, Jaewon and Jang, Hongsun and Kim, Hyung-Jin and Kim, Youngsok and Lee, Jinho
关键词: 3D Parallelism, Communication Optimization, Distributed Systems, Gradient Compression, Large-scale NLP Training, Pipeline Parallelism, Systems for Machine Learning

Abstract

This repository is for AE (Artifact Evaluation) process of ASPLOS’23.

In ASPLOS23/ folder, scripts for pretraining (TABLE 2), speedup check (TABLE 2, Fig. 10), memory consumption check (Fig. 12), comp/decomp throughput check (Fig. 14), and cosine similarity check (Fig. 11) are available. We give a detailed guideline for these evaluations in Evaluation Reproducing section. For accuracy check of zero-shot (TABLE 3 and TABLE 4), the process is quite complex, so please refer to Zero-Shot Task Running section. Note that training script for TABLE 4 is available in TABLE 2 training script folder. Other experiments (not main evaluation) for figures can be run by changing options in speedup check scripts.

UPDATE: GPT-335M version scripts are added to ASPLOS23/fig10/ directory to test functionality in a small clutster.

Dataset making is explained in Dataset Preprocessing. Make the pretraining dataset based on the guideline and use the binarized dataset.

For detailed arguments and settings, please refer to below explanations.

DOI: 10.1145/3575693.3575712

Pond： CXL-Based Memory Pooling Systems for Cloud Platforms

作者: Li, Huaicheng and Berger, Daniel S. and Hsu, Lisa and Ernst, Daniel and Zardoshti, Pantea and Novakovic, Stanko and Shah, Monish and Rajadnya, Samir and Lee, Scott and Agarwal, Ishwar and Hill, Mark D. and Fontoura, Marcus and Bianchini, Ricardo
关键词: CXL, Compute Express Link, cloud computing, datacenter, memory disaggregation, memory pooling

Abstract

Public cloud providers seek to meet stringent performance requirements and low hardware cost. A key driver of performance and cost is main memory. Memory pooling promises to improve DRAM utilization and thereby reduce costs. However, pooling is challenging under cloud performance requirements. This paper proposes Pond, the first memory pooling system that both meets cloud performance goals and significantly reduces DRAM cost. Pond builds on the Compute Express Link (CXL) standard for load/store access to pool memory and two key insights. First, our analysis of cloud production traces shows that pooling across 8-16 sockets is enough to achieve most of the benefits. This enables a small-pool design with low access latency. Second, it is possible to create machine learning models that can accurately predict how much local and pool memory to allocate to a virtual machine (VM) to resemble same-NUMA-node memory performance. Our evaluation with 158 workloads shows that Pond reduces DRAM costs by 7% with performance within 1-5% of same-NUMA-node VM allocations.

DOI: 10.1145/3575693.3578835

Prism： Optimizing Key-Value Store for Modern Heterogeneous Storage Devices

作者: Song, Yongju and Kim, Wook-Hee and Monga, Sumit Kumar and Min, Changwoo and Eom, Young Ik
关键词: Key-value Stores, Non-volatile Memory

Abstract

We addressed the question: “How should we design a key-value store for the non-hierarchical storage landscape?”. We designed and implemented a novel key-value store named Prism that utilizes modern heterogeneous storage devices. This artifact contains three major components: 1) the source code of Prism, 2) a benchmark suite, and 3) a workload generator to evaluate the system. Additionally, the scripts enclosed allow readers to configure, build, and run Prism conveniently and precisely. For reference, since Prism is built on modern heterogeneous storage devices including non-volatile memory and NVMe SSDs, there are some hardware dependencies. For the software side, Prism needs PMDK and liburing libraries. Although it could be challenging to make up an identical testbed to ours, it is necessary to satisfy both the hardware and software requirements in order to get the expected key outcomes.

DOI: 10.1145/3575693.3575722

Probabilistic Concurrency Testing for Weak Memory Programs

作者: Gao, Mingyu and Chakraborty, Soham and Ozkan, Burcu Kulahcioglu
关键词: Concurrency, Randomized algorithms, Testing, Weak memory

Abstract

The upload is a VagrantBox package, containing the artifact for the paper Probabilistic Testing for Weak Memory Programs. This vagrant package offers the experimental environment, which contains all code, benchmarks, and scripts to reproduce the experimental results in the paper.

DOI: 10.1145/3575693.3575729

Artifacts for “Propeller： A Profile Guided, Relinking Optimizer for Warehouse Scale Applications”

作者: Shen, Han and Pszeniczny, Krzysztof and Lavaee, Rahman and Kumar, Snehasish and Tallam, Sriraman and Li, Xinliang David
关键词: Binary Optimization, Datacenters, Distributed Build System, Post-Link Optimization, Profile Guided Optimization, Warehouse-Scale Applications

Abstract

The disassembly-driven, monolithic design of post link optimizers face scaling challenges with large binaries and is at odds with distributed build systems. To reconcile and enable post link optimizations within a distributed build environment, we propose Propeller, a relinking optimizer for warehouse scale workloads. Propeller uses basic block sections (a novel linker abstraction) to enable a new approach to post link optimization without disassembly. Propeller achieves scalability by relinking the binary using precise profiles instead of rewriting the binary. The overhead of relinking is lowered by caching and leveraging distributed compiler actions during code generation.

In this artifact, we present the means to replicate the results on a standalone machine. We provide a cloud-hosted bare metal machine which has been provisioned with the tooling and dependencies to run Propeller and Lightning BOLT. We use this environment to demonstrate Propeller’s approach to post link optimizations on a bootstraped build of clang. We show Propeller can achieve equitable performance to Lightning BOLT with lower peak memory consumption. A key aspect of Propeller enabled optimizations is the integration with a distributed build system which provides caching. To demonstrate the effect of caching we provide scripting which emulates the effect on a single machine. The scripts used to replicate the results are also publicly available at https://github.com/google/llvm-propeller.

DOI: 10.1145/3575693.3575727

Reproduction Package for Paper `Protecting Data Integrity of Web Applications with Database Constraints Inferred from Application Code`

作者: Huang, Haochen and Shen, Bingyu and Zhong, Li and Zhou, Yuanyuan
关键词: Data integrity, Database constraints, Static analysis, Web applications

Abstract

This repo is for the code release of our paper Protecting Data Integrity of Web Applications with Database Constraints Inferred from Application Code. in ASPLOS 2023.

In the paper, we developed a static analysis tool to infer the missing database constraints from the application source code.

DOI: 10.1145/3575693.3575699

Qompress： Efficient Compilation for Ququarts Exploiting Partial and Mixed Radix Operations for Communication Reduction

作者: Litteken, Andrew and Seifert, Lennart Maximilian and Chadwick, Jason and Nottingham, Natalia and Chong, Frederic T. and Baker, Jonathan M.
关键词: compilation, quantum computing, qudit

Abstract

Quantum computing is in an era of limited resources. Current hardware lacks high fidelity gates, long coherence times, and the number of computational units required to perform meaningful computation. Contemporary quantum devices typically use a binary system, where each qubit exists in a superposition of the 0 and 1 states. However, it is often possible to access the 2 or even 3 states in the same physical unit by manipulating the system in different ways. In this work, we consider automatically encoding two qubits into one four-state ququart via a compression scheme. We use quantum optimal control to design efficient proof-of-concept gates that fully replicate standard qubit computation on these encoded qubits. We extend qubit compilation schemes to efficiently route qubits on an arbitrary mixed-radix system consisting of both qubits and ququarts, reducing communication and minimizing excess circuit execution time introduced by longer-duration ququart gates. In conjunction with these compilation strategies, we introduce several methods to find beneficial compressions, reducing circuit error due to computation and communication by up to 50

DOI: 10.1145/3575693.3575726

RAIZN： Redundant Array of Independent Zoned Namespaces

作者: Kim, Thomas and Jeon, Jekyeom and Arora, Nikhil and Li, Huaicheng and Kaminsky, Michael and Andersen, David G. and Ganger, Gregory R. and Amvrosiadis, George and Bj\o{
关键词: RAID, Reliability, Storage, ZNS, Zoned Namespaces

Abstract

Source code for RAIZN: Redundant Array of Independent Zoned Namespaces (ASPLOS 23)

DOI: 10.1145/3575693.3575746

Reproduction Package for Article `Revisiting Log-structured Merging for KV Stores in Hybrid Memory Systems’

作者: Duan, Zhuohui and Yao, Jiabo and Liu, Haikun and Liao, Xiaofei and Jin, Hai and Zhang, Yu
关键词: Key-Value Store, Log-Structured Merge, LSM-tree Compaction, Non-Volatile Memory, Skip List

Abstract

All experimental results in Figures 6, 7, 8, 9, 10, 11, and 12 can be reproduced. These results can reflect the performance of MioDB.

DOI: 10.1145/3575693.3575715

Scoped Buffered Release Persistency Model for GPUs

作者: Pandey, Shweta and Kamath, Aditya K and Basu, Arkaprava
关键词: Grpahics Processing Unit, Persistent Memory

Abstract

We provide the source code and setup for our GPU persistency model, Scoped Buffered Release Persistency (SBRP). SBRP is a scope-aware, buffered persistency model that provides high performance to GPU applications that wish to persist data on Non-Volatile Memory (NVM). SBRP modifies the GPU hardware and has been implemented using GPGPU-Sim, a GPU simulator. For more details on the simulator requirements, check the README in the simulator folder.

This repository consists of the source code of the simulator, benchmarks used for evaluation and all scripts needed to replicate the figures in the paper.

DOI: 10.1145/3575693.3575749

Artifact for “ShakeFlow： Functional Hardware Description with Latency-Insensitive Interface Combinators”

作者: Han, Sungsoo and Jang, Minseong and Kang, Jeehoon
关键词: combinator, functional programming, hardware description language, latency insensitive interface

Abstract

This artifact contains our port of the Corundum 100Gbps NIC and BaseJump STL’s dataflow and network-on-chip modules to the ShakeFlow hardware description language, and scripts to reproduce the results presented in the paper.

For a full reproduction, the following hardware equipment is necessary:

Xilinx Alveo U200
100Gbps NIC (e.g., Mellanox MCX556A-EDAT)
QSFP28 DAC cable
Two machines with a PCIe 3.0+ x16 slot

For the full details, refer to the README.md of https://github.com/kaist-cp/shakeflow.

DOI: 10.1145/3575693.3575701

Sigma： Compiling Einstein Summations to Locality-Aware Dataflow

作者: Zhao, Tian and Rucker, Alexander and Olukotun, Kunle
关键词: compiler, domain-specific language, hardware acceleration, index notation, machine learning, neural networks, reconfigurable dataflow accelerator

Abstract

Most dataflow accelerator compilers achieve high performance by mapping each node in a dataflow program to a dedicated hardware element on a dataflow accelerator. However, this approach misses critical data reuse optimizations required to exploit the data bandwidth from fine-grained memory elements, e.g., FIFOs and pipeline registers. Moreover, writing performant dataflow programs requires users to have domain expertise in the underlying dataflow accelerators.

To address these issues, we designed Sigma, a novel compiler that supports high-level programming constructs such as Einstein summations, index notations, and tensors, finds opportunities for data reuse from high-level dataflow graphs, and exploits on-chip data bandwidth from fine-grained memory elements. Sigma targeting a research dataflow accelerator demonstrates a 5.4x speedup and 44.6x area-normalized speedup over Nvidia’s V100 accelerator, and a 7.1x speedup over hand-written dataflow programs.

DOI: 10.1145/3575693.3575694

SMAPPIC： Scalable Multi-FPGA Architecture Prototype Platform in the Cloud

作者: Chirkov, Grigory and Wentzlaff, David
关键词: FPGA, Modeling, cloud, heterogeneity, interconnect, multi-die, multicore

Abstract

Traditionally, architecture prototypes are built on top of FPGA infrastructure, with two associated problems. First, very large FPGAs are prohibitively expensive for most people and institutions. Second, the burden of FPGA development adds to an already uneasy life of researchers, especially those who focus on software. Large designs that do not fit into a single FPGA exacerbate these issues even more. This work presents SMAPPIC — the first open-source prototype platform for shared memory multi-die architectures on cloud FPGAs. SMAPPIC leverages the OpenPiton/BYOC infrastructure and AWS F1 instances to make FPGA-based prototypes of System-on-Chips, processor cores, accelerators, cache subsystems, etc., cheap, scalable, and straightforward. SMAPPIC enables many use cases that are not possible or significantly more complicated in existing software and FPGA tools. This work has the potential to accelerate the rate of innovation in computer engineering fields in the nearest future.

DOI: 10.1145/3575693.3575753

Spada： Accelerating Sparse Matrix Multiplication with Adaptive Dataflow

作者: Li, Zhiyao and Li, Jiaxiang and Chen, Taijie and Niu, Dimin and Zheng, Hongzhong and Xie, Yuan and Gao, Mingyu
关键词: dataflow, hardware acceleration, sparse matrix multiplication

Abstract

Sparse matrix-matrix multiplication (SpGEMM) is widely used in many scientific and deep learning applications. The highly irregular structures of SpGEMM limit its performance and efficiency on conventional computation platforms, and thus motivate a large body of specialized hardware designs. Existing SpGEMM accelerators only support specific types of rigid execution dataflow such as inner/output-product or row-based schemes. Each dataflow is only optimized for certain sparse patterns and fails to generalize with robust performance to the widely diverse SpGEMM workloads across various domains. We propose Spada, a combination of three novel techniques for SpGEMM accelerators to efficiently adapt to various sparse patterns. First, we describe a window-based adaptive dataflow that can be flexibly adapted to different modes to best match the data distributions and realize different reuse benefits. Then, our hardware architecture efficiently supports this dataflow template, with flexible, fast, and low-cost reconfigurability and effective load balancing features. Finally, we use a profiling-guided approach to detect the sparse pattern and determine the optimized dataflow mode to use, based on the key observations of sparse pattern similarity in nearby matrix regions. Our evaluation results demonstrate that Spada is able to match or exceed the best among three state-of-the-art SpGEMM accelerators, and avoid the performance degradation of the others if data distribution and dataflow mismatch. It achieves an average 1.44\texttimes{

DOI: 10.1145/3575693.3575706

SpecPMT： Speculative Logging for Resolving Crash Consistency Overhead of Persistent Memory

作者: Ye, Chencheng and Xu, Yuanchao and Shen, Xipeng and Sha, Yan and Liao, Xiaofei and Jin, Hai and Solihin, Yan
关键词: logging, microarchitecture, persistent memory, transaction

Abstract

Crash consistency overhead is a long-standing barrier to the adoption of byte-addressable persistent memory in practice. Despite continuous progress, persistent transactions for crash consistency still incur a 5.6X slowdown, making persistent memory prohibitively costly in practical settings. This paper introduces speculative logging, a new method that forgoes most memory fences and reduces data persistence overhead by logging data values early. This technique enables a novel persistent transaction model, speculatively persistent memory transactions (SpecPMT). Our evaluation shows that SpecPMT reduces the execution time overheads of persistent transactions substantially to just 10%.

DOI: 10.1145/3575693.3575696

Evaluation for “Stepwise Debugging for Hardware Accelerators”

作者: Berlstein, Griffin and Nigam, Rachit and Gyurgyik, Christophe and Sampson, Adrian
关键词: Accelerator Design, Accelerator Simulation, Debugging, Intermediate Language

Abstract

This artifact consists of one piece of software, the Cider Interpreter and Debugger for Calyx, alongside data and helper scripts. Cider is a simulator and debugger for hardware accelerators written in the Calyx IR. Since Cider is also a simulator, it can be used to interpreter and debug hardware designs without lowering them from the IR to RTL.

This artifact seeks to reproduce the benchmark results discussed in our performance evaluation as well as the debugging process shown in section 3 of our paper. This supports our paper by showing the usability of Cider and how it compares to related tools, alongside demonstrating the debugging interface.

DOI: 10.1145/3575693.3575717

STI： Turbocharge NLP Inference at the Edge via Elastic Pipelining

作者: Guo, Liwei and Choe, Wonkyo and Lin, Felix Xiaozhu
关键词: Edge computing, Machine Learning Systems, NLP inference

Abstract

Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, creating a tension between the two key resources of a mobile device. To meet a target latency, holding the whole model in memory launches execution as soon as possible but increases one app’s memory footprints by several times, limiting its benefits to only a few inferences before being recycled by mobile memory management. On the other hand, loading the model from storage on demand incurs IO as long as a few seconds, far exceeding the delay range satisfying to a user; pipelining layerwise model loading and execution does not hide IO either, due to the high skewness between IO and computation delays.

To this end, we propose Speedy Transformer Inference (STI). Built on the key idea of maximizing IO/compute resource utilization on the most important parts of a model, STI reconciles the latency v.s. memory tension via two novel techniques. First, model sharding. STI manages model parameters as independently tunable shards, and profiles their importance to accuracy. Second, elastic pipeline planning with a preload buffer. STI instantiates an IO/compute pipeline and uses a small buffer for preload shards to bootstrap execution without stalling at early stages; it judiciously selects, tunes, and assembles shards per their importance for resource-elastic execution, maximizing inference accuracy.

Atop two commodity SoCs, we build STI and evaluate it against a wide range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU. We demonstrate that STI delivers high accuracies with 1–2 orders of magnitude lower memory, outperforming competitive baselines.

DOI: 10.1145/3575693.3575698

TensorIR： An Abstraction for Automatic Tensorized Program Optimization

作者: Feng, Siyuan and Hou, Bohan and Jin, Hongyi and Lin, Wuwei and Shao, Junru and Lai, Ruihang and Ye, Zihao and Zheng, Lianmin and Yu, Cody Hao and Yu, Yong and Chen, Tianqi
关键词: Deep Neural Network, Machine Learning Compiler, Tensor Computation

Abstract

Deploying deep learning models on various devices has become an important topic. The wave of hardware specialization brings a diverse set of acceleration primitives for multi-dimensional ten- sor computations. These new acceleration primitives, along with the emerging machine learning models, bring tremendous engineering challenges. In this paper, we present TensorIR, a compiler abstraction for optimizing programs with these tensor computation primitives. TensorIR generalizes the loop nest representation used in existing machine learning compilers to bring tensor computation as the first-class citizen. Finally, we build an end-to-end framework on top of our abstraction to automatically optimize deep learning models for given tensor computation primitives. Experimental results show that TensorIR compilation automatically uses the tensor computation primitives for given hardware backends and delivers performance that is competitive to state-of-art hand-optimized systems across platforms.

DOI: 10.1145/3575693.3576933

Reproduction package for the article ‘TiLT： A Time-Centric Approach for Stream Query Optimization and Parallelization’

作者: Jayarajan, Anand and Zhao, Wei and Sun, Yudi and Pekhimenko, Gennady
关键词: compiler, intermediate representation, stream data analytics, temporal query processing

Abstract

This artifact appendix includes the source code and scripts to reproduce the scalability results and the real-world applications performance in the article ‘TiLT: A Time-Centric Approach for Stream Query Optimization and Parallelization’. We include docker containers to setup the runtime environment for all the experiments in order to support portability. Therefore, the artifact can be executed on any multi-core machine with docker engine installed. We also use Linux gnuplot utility to generate figures from the collected performance numbers. We using Ubuntu 20.04 operating system for running the scripts provided in the artifact.

DOI: 10.1145/3575693.3575704

TLP： A Deep Learning-Based Cost Model for Tensor Program Tuning

作者: Zhai, Yi and Zhang, Yu and Liu, Shuo and Chu, Xiaomeng and Peng, Jie and Ji, Jianmin and Zhang, Yanyong
关键词: compiler optimization, cost model, deep Learning, tensor program

Abstract

Tensor program tuning is a non-convex objective optimization problem, to which search-based approaches have proven to be effective. At the core of the search-based approaches lies the design of the cost model. Though deep learning-based cost models perform significantly better than other methods, they still fall short and suffer from the following problems. First, their feature extraction heavily relies on expert-level domain knowledge in hardware architectures. Even so, the extracted features are often unsatisfactory and require separate considerations for CPUs and GPUs. Second, a cost model trained on one hardware platform usually performs poorly on another, a problem we call cross-hardware unavailability. In order to address these problems, we propose TLP and MTL-TLP. TLP is a deep learning-based cost model that facilitates tensor program tuning. Instead of extracting features from the tensor program itself, TLP extracts features from the schedule primitives. We treat schedule primitives as tensor languages. TLP is thus a Tensor Language Processing task. In this way, the task of predicting the tensor program latency through the cost model is transformed into a natural language processing (NLP) regression task. MTL-TLP combines Multi-Task Learning and TLP to cope with the cross-hardware unavailability problem. We incorporate these techniques into the Ansor framework and conduct detailed experiments. Results show that TLP can speed up the average search time by 9.1\texttimes{

DOI: 10.1145/3575693.3575737

Artifacts of LAKE

作者: Fingler, Henrique and Tarte, Isha and Yu, Hangchen and Szekely, Ariel and Hu, Bodun and Akella, Aditya and Rossbach, Christopher J.
关键词: api remoting, kernel, ml for systems, systems for ml

Abstract

Kernel API remoting system, kernel drivers of workloads and benchmarks scripts.

DOI: 10.1145/3575693.3575697

Artifacts to reproduce all experiments in `uBFT： Microsecond-Scale BFT using Disaggregated Memory`

作者: Aguilera, Marcos K. and Ben-David, Naama and Guerraoui, Rachid and Murat, Antoine and Xygkis, Athanasios and Zablotchi, Igor
关键词: Byzantine fault tolerance, disaggregated memory, fast path, finite memory, microsecond scale, RDMA, replication, signatureless

Abstract

This artifact contains all the necessary source code to compile, execute and generate the data of all the figures in uBFT: Microsecond-Scale BFT using Disaggregated Memory.

DOI: 10.1145/3575693.3575732

uGrapher： High-Performance Graph Operator Computation via Unified Abstraction for Graph Neural Networks

作者: Zhou, Yangjie and Leng, Jingwen and Song, Yaoxu and Lu, Shuwen and Wang, Mian and Li, Chao and Guo, Minyi and Shen, Wenting and Li, Yong and Lin, Wei and Liu, Xiangwen and Wu, Hanqing
关键词: AI Frameworks, Graph Neural Networks, Graphics Processing Unit

Abstract

As graph neural networks (GNNs) have achieved great success in many graph learning problems, it is of paramount importance to support their efficient execution. Different graphs and different operators present different patterns during execution. However, there is still a gap in the existing GNN acceleration research to explore adaptive parallelism. We show that existing GNN frameworks rely on handwritten static kernels, which fail to achieve the best performance across different graph operators and input graph structures. In this work, we propose uGrapher, a unified interface that achieves general high performance for different graph operators and datasets. The existing GNN frameworks can easily integrate our design for its simple and unified API. We take a principled approach that decouples a graph operator’s computation and schedule to achieve that. We first build a GNN-specific operator abstraction that incorporates the semantics of graph tensors and graph loops. We explore various schedule strategies based on the abstraction that can balance the well-established trade-off relationship between parallelism, locality, and efficiency. Our evaluation shows that uGrapher can bring up to 29.1\texttimes{

DOI: 10.1145/3575693.3575723

Reproduction Package for Article “VClinic： A Portable and Efficient Framework for Fine-Grained Value Profilers”

作者: You, Xin and Yang, Hailong and Lei, Kelun and Luan, Zhongzhi and Qian, Depei
关键词: Dynamic Binary Instrumentation, Performance Analysis, Value Profiler

Abstract

The provided docker images contain pre-built VClinic and compared value profilers. For X86 platforms, docker image “vclinic_artifact_x86.tar” should be used; otherwise, for ARM platforms, docker image “vclinic_artifact_arm.tar” should be used. As pin-based value profilers only support X86 platforms, we only include the built pin-based value profilers in the “vclinic_artifact_x86.tar” docker image. The detailed instructions for evaluating the artifacts as well as the estimated evaluation time for each step on both platforms are listed in “/home/vclinic_artifact/README.md”. The summary of how to set up the experimental environment is listed in README. Note that both “vclinic_artifact_x86.tar” and vclinic_artifact_arm.tar are pre-built docker images and one can directly follow the instructions in the README file to reproduce the evaluation results.

DOI: 10.1145/3575693.3576934

Artifact of Article “VDom： Fast and Unlimited Virtual Domains on Multiple Architectures”

作者: Yuan, Ziqi and Hong, Siyu and Chang, Rui and Zhou, Yajin and Shen, Wenbo and Ren, Kui
关键词: computer system security, operating system, operating systems, software security, virtualization

Abstract

The artifact of paper “VDom: Fast and Unlimited Virtual Domains on Multiple Architectures”, consists of the source code of VDom modified Linux kernel, user-space libraries, all evaluation benchmarks, and scripts necessary to reproduce the paper’s evaluation results.

DOI: 10.1145/3575693.3575735

Artifact for paper “WACO： Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program”

作者: Won, Jaeyeon and Mendis, Charith and Emer, Joel S. and Amarasinghe, Saman
关键词: Approximate Nearest Neighbor Search, Auto-scheduling, Auto-tuning, Sparse Matrix, Sparse Tensor

Abstract

Our artifact includes programs for 1. training a cost model, 2. searching with a nearest neighbor search, and 3. a performance evaluator of the SuperSchedule. Our artifact is available at https://github.com/nullplay/Workload-Aware-Co-Optimization. Please follow the README.md instructions.

DOI: 10.1145/3575693.3575742

Where Did My Variable Go? Poking Holes in Incomplete Debug Information

作者: Assaiante, Cristian and D’Elia, Daniele Cono and Di Luna, Giuseppe Antonio and Querzoni, Leonardo
关键词: Debuggers, compiler bugs, compiler optimizations

Abstract

The availability of debug information for optimized executables can largely ease crucial tasks such as crash analysis. Source-level debuggers use this information to display program state in terms of source code, allowing users to reason on it even when optimizations alter program structure extensively. A few recent endeavors have proposed effective methodologies for identifying incorrect instances of debug information, which can mislead users by presenting them with an inconsistent program state.

In this work, we identify and study a related important problem: the completeness of debug information. Unlike correctness issues for which an unoptimized executable can serve as reference, we find there is no analogous oracle to deem when the cause behind an unreported part of program state is an unavoidable effect of optimization or a compiler implementation defect. In this scenario, we argue that empirically derived conjectures on the expected availability of debug information can serve as an effective means to expose classes of these defects.

We propose three conjectures involving variable values and study how often synthetic programs compiled with different configurations of the popular gcc and LLVM compilers deviate from them. We then discuss techniques to pinpoint the optimizations behind such violations and minimize bug reports accordingly. Our experiments revealed, among others, 24 bugs already confirmed by the developers of the gcc-gdb and clang-lldb ecosystems.

DOI: 10.1145/3575693.3575720

Direct Mind-Machine Teaming (Keynote)

作者: Bhattacharjee, Abhishek
关键词: No keywords

Abstract

Direct mind-machine teaming will help us treat brain disorders, augment the healthy brain, and shed light on how the brain as an organ gives rise to the mind. Delivering on this promise requires the design of computer systems that delicately balance the tight power, latency, and bandwidth trade-offs needed to decode brain activity, stimulate biological neurons, and control assistive devices most effectively.

This talk presents my group’s design of a standardized and general computer architecture for future brain interfacing. Our design enables the treatment of several neurological disorders (most notably, epilepsy and movement disorders) and lays the groundwork for brain interfacing techniques that can help augment cognitive control and decision-making in the healthy brain. Central to our design is end-to-end hardware acceleration, from the microarchitectural to the distributed system level. Key insights are undergirded via detailed physical synthesis models and chip tape-outs in a 12nm CMOS process.

DOI: 10.1145/3582016.3587050

Language Models： The Most Important Compute Challenge of Our Time (Keynote)

作者: Catanzaro, Bryan
关键词: No keywords

Abstract

ChatGPT recently became one of the fastest growing new applications in history, thanks to its intriguing text generation capabilities that are able to answer questions, write poetry, and even problem solve. Large Language Models are now being integrated in fundamental ways into products around the tech industry. The possibilities are extraordinary, but much research remains to make these systems reliable and trustworthy, as well as integrate them into applications seamlessly. Additionally, the computational challenges behind large language modeling are also quite important. Systems for training and deploying these models must be highly scalable and run at extreme efficiency, because the amount of work necessary to converge a model can be extraordinarily large. The cost of deploying these models is a barrier to their deployment and must be lowered significantly. In this talk, I’ll discuss the work we have been doing at NVIDIA to optimize systems for Large Language Model training and inference, and highlight some of the challenges that remain for future work.

DOI: 10.1145/3582016.3587051

ABNDP： Co-optimizing Data Access and Load Balance in Near-Data Processing

作者: Tian, Boyu and Chen, Qihang and Gao, Mingyu
关键词: DRAM caches, load balance, near-data processing, task scheduling

Abstract

Near-Data Processing (NDP) has been a promising architectural paradigm to address the memory wall challenge for data-intensive applications. Typical NDP systems based on 3D-stacked memories contain massive parallel processing units, each of which can access its local memory as well as other remote memory regions in the system. In such an architecture, minimizing remote data accesses and achieving computation load balance exhibit a fundamental tradeoff, where existing solutions can only improve one but sacrifice the other. We propose ABNDP, which leverages novel hardware-software co-optimizations to simultaneously alleviate these two issues without making tradeoffs. ABNDP uses a novel and efficient distributed DRAM cache to allow additional data caching locations in the system, where the computation load at the original data hotspots can be distributed and balanced, without significantly increasing remote accesses. ABNDP also adopts a hybrid task scheduling policy that considers both the remote access cost and the load imbalance impact, and exploits the flexibility of the multiple data caching locations to decide the best computation place. Our evaluation shows that ABNDP successfully achieves the two goals of minimizing remote access cost and maintaining load balance, and significantly outperforms the baseline systems in terms of both performance (1.7\texttimes{

DOI: 10.1145/3582016.3582026

Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling

作者: Odemuyiwa, Toluwanimi O. and Asghari-Moghaddam, Hadi and Pellauer, Michael and Hegde, Kartik and Tsai, Po-An and Crago, Neal C. and Jaleel, Aamer and Owens, John D. and Solomonik, Edgar and Emer, Joel S. and Fletcher, Christopher W.
关键词: Hardware Acceleration, Sparse Computation, Tensor Algebra

Abstract

Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques—such as tiling—for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT’s key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3\texttimes{

DOI: 10.1145/3582016.3582064

APEX： A Framework for Automated Processing Element Design Space Exploration using Frequent Subgraph Analysis

作者: Melchert, Jackson and Feng, Kathleen and Donovick, Caleb and Daly, Ross and Sharma, Ritvik and Barrett, Clark and Horowitz, Mark A. and Hanrahan, Pat and Raina, Priyanka
关键词: CGRA, design space exploration, domain-specific accelerators, graph analysis, hardware-software co-design, processing elements, reconfigurable accelerators, subgraph

Abstract

The architecture of a coarse-grained reconfigurable array (CGRA) processing element (PE) has a significant effect on the performance and energy-efficiency of an application running on the CGRA. This paper presents APEX, an automated approach for generating specialized PE architectures for an application or an application domain. APEX first analyzes application domain benchmarks using frequent subgraph mining to extract commonly occurring computational subgraphs. APEX then generates specialized PEs by merging subgraphs using a datapath graph merging algorithm. The merged datapath graphs are translated into a PE specification from which we automatically generate the PE hardware description in Verilog along with a compiler that maps applications to the PE. The PE hardware and compiler are inserted into a flexible CGRA generation and compilation toolchain that allows for agile evaluation of CGRAs. We evaluate APEX for two domains, machine learning and image processing. For image processing applications, our automatically generated CGRAs with specialized PEs achieve from 5% to 30% less area and from 22% to 46% less energy compared to a general-purpose CGRA. For machine learning applications, our automatically generated CGRAs consume 16% to 59% less energy and 22% to 39% less area than a general-purpose CGRA. This work paves the way for creation of application domain-driven design-space exploration frameworks that automatically generate efficient programmable accelerators, with a much lower design effort for both hardware and compiler generation.

DOI: 10.1145/3582016.3582070

Beyond Static Parallel Loops： Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad Memories

作者: Cheng, Lin and Ruttenberg, Max and Jung, Dai Cheol and Richmond, Dustin and Taylor, Michael and Oskin, Mark and Batten, Christopher
关键词: Manycore architecture, fine-grained threading, load-balancing, parallel programming, scratchpad memory

Abstract

Manycore architectures integrate hundreds of cores on a single chip by using simple cores and simple memory systems usually based on software-managed scratchpad memories (SPMs). However, such architectures are notoriously challenging to program, since the programmers need to manually manage all aspects of data movement and synchronization for both correctness and performance. We argue that this manycore programmability challenge is one of the key barriers to achieving the promise of manycore architectures. At the same time, the dynamic task parallel programming model is enjoying considerable success in addressing the programmability challenge of multi-core processors with tens of complex cores and hardware cache coherence. Conventional wisdom suggests a work-stealing runtime, which forms the core of most dynamic task parallel programming models, is ill-suited for manycore architectures. In this work, we demonstrate that a work-stealing runtime is not just feasible on manycore architectures with SPMs, but such a runtime can also significantly improve the performance of irregular workloads when executing on these architectures. We also explore three optimizations that allow the runtime to leverage unused SPM space for further performance benefit. Our dynamic task parallel programming framework achieves 1.2–28.5\texttimes{

DOI: 10.1145/3582016.3582020

CaQR： A Compiler-Assisted Approach for Qubit Reuse through Dynamic Circuit

作者: Hua, Fei and Jin, Yuwei and Chen, Yanhao and Vittal, Suhas and Krsulich, Kevin and Bishop, Lev S. and Lapeyre, John and Javadi-Abhari, Ali and Zhang, Eddy Z.
关键词: circuit fidelity, mid-circuit measurement, qubit reuse, qubit usage

Abstract

Quantum measurement is important to quantum computing as it extracts out the outcome of the circuit at the end of the computation. Previously, all measurements have to be done at the end of the circuit. Otherwise, it will incur significant errors. But it is not the case now. Recently IBM starts supporting dynamic circuit through hardware (instead of software by simulator). With mid-circuit hardware measurement, we can improve circuit efficacy and fidelity from three aspects: (a) reduced qubit usage, (b) reduced swap insertion, and © improved fidelity. We demonstrate this using real-world applications Bernstein Verizani on real hardware and show that circuit resource usage can be improved by 60%, and circuit fidelity can be improved by 15%. We design a compiler-assisted tool that can find and exploit the tradeoff between qubit reuse, fidelity, gate count, and circuit duration. We also developed a method for identifying whether qubit reuse will be beneficial for a given application. We evaluated our method on a representative set of important applications. We can reduce resource usage by up to 80% and improve circuit fidelity by up to 20%.

DOI: 10.1145/3582016.3582030

Reproduction Package for Article CaT： A Solver-Aided Compiler for Packet-Processing Pipelines

作者: Gao, Xiangyu and Raghunathan, Divya and Fang, Ruijie and Wang, Tao and Zhu, Xiaotong and Sivaraman, Anirudh and Narayana, Srinivas and Gupta, Aarti
关键词: code generation, integer linear programming, packet processing pipelines, program synthesis, Programmable switches

Abstract

This is the artifact evaluation part for the CaT compiler, proposed in our paper. The instruction helps reproduce the results from all tables and figures in IMPLEMENTATION AND EVALUATION section.

DOI: 10.1145/3582016.3582036

Source Code for “Characterizing and Optimizing End-to-End Systems for Private Inference”

作者: Garimella, Karthik and Ghodsi, Zahra and Jha, Nandan Kumar and Garg, Siddharth and Reagen, Brandon
关键词: cryptography, machine learning, private inference protocols, systems for machine learning

Abstract

We open source our private inference simulator at the following GitHub repo: https://github.com/kvgarimella/characterizing-private-inference. We construct a model of a system for private inference and a simulator using Simpy to explore and evaluate tradeoffs under different system conditions. We model a single-client, single-server setting where inferences are queued in a FIFO manner and are generated by sampling from a Poisson distribution.

The repository itself contains four high-level directories. The directory garbled_circuits contains the raw data for benchmarking ReLU Garbling and Evaluation on an Intel Atom Z8350 embedded device (1.92GHz, 4 cores, 2GB RAM) and an AMD EPYC 7502 server (2.5GHz, 32 cores, 256GB RAM). These two devices represent our client and server, respectively. Next, the directory layer_parallel_HE contains our code and the raw data for applying layer-parallelism to linear layer homomorphic evaluations. The directory simulator contains our private inference simulator. Finally, artifact contains scripts to replicate key figures in our paper.

DOI: 10.1145/3582016.3582065

Cohort： Software-Oriented Acceleration for Heterogeneous SoCs

作者: Wei, Tianrui and Turtayeva, Nazerke and Orenes-Vera, Marcelo and Lonkar, Omkar and Balkind, Jonathan
关键词: accelerators, heterogeneous systems, programming models, shared memory

Abstract

Philosophically, our approaches to acceleration focus on the extreme. We must optimise accelerators to the maximum, leaving software to fix any hardware-software mismatches. Today’s software abstractions for programming accelerators leak hardware details, requiring changes to data formats and manual memory and coherence management, among other issues. This harms generality and requires deep hardware knowledge to efficiently program accelerators, a state which we consider hardware-oriented. This paper proposes Software-Oriented Acceleration (SOA), where software uses existing abstractions, like software shared-memory queues, to interact with accelerators. We introduce the Cohort engine which exploits these queues’ standard semantics to efficiently connect producers and consumers in software with accelerators with minimal application changes. Accelerators are even usable in chains which can be runtime reconfigured by software. Cohort significantly reduces the burden to add new accelerators while maintaining system-level guarantees. We implement a Cohort FPGA prototype which supports SOA applications running on multicore Linux. Our evaluation shows speedups for Cohort over traditional approaches ranging from 1.83\texttimes{

DOI: 10.1145/3582016.3582059

Coyote： A Compiler for Vectorizing Encrypted Arithmetic Circuits (artifact)

作者: Malik, Raghav and Sheth, Kabir and Kulkarni, Milind
关键词: arithmetic circuits, Homomorphic encryption, vectorization

Abstract

This artifact contains everything necessary to replicate the results of the paper, including: * An implementation of the compiler described in the paper * A backend test harness for profiling vectorized code * Implementations of all benchmarks used in the evaluation * Scripts necessary to automate the process of compiling, running, and collecting data from the benchmarks.

DOI: 10.1145/3582016.3582057

DefT： Boosting Scalability of Deformable Convolution Operations on GPUs

作者: Hanson, Edward and Horton, Mark and Li, Hai (Helen) and Chen, Yiran
关键词: AI acceleration, GPU systems, compiler techniques and optimizations, dynamic models, neural networks

Abstract

Deformable Convolutional Networks (DCN) have been proposed as a powerful tool to boost the representation power of Convolutional Neural Networks (CNN) in computer vision tasks via adaptive sampling of the input feature map. Much like vision transformers, DCNs utilize a more flexible inductive bias than standard CNNs and have also been shown to improve performance of particular models. For example, drop-in DCN layers were shown to increase the AP score of Mask RCNN by 10.6 points while introducing only 1% additional parameters and FLOPs, improving the state-of-the-art model at the time of publication. However, despite evidence that more DCN layers placed earlier in the network can further improve performance, we have not seen this trend continue with further scaling of deformations in CNNs, unlike for vision transformers. Benchmarking experiments show that a realistically sized DCN layer (64H\texttimes{

DOI: 10.1145/3582016.3582017

Reproduction Package for Article ‘Disaggregated RAID Storage in Modern Datacenters’

作者: Shu, Junyi and Zhu, Ruidong and Ma, Yun and Huang, Gang and Mei, Hong and Liu, Xuanzhe and Jin, Xin
关键词: Disaggregated Storage, NVMe-oF, RAID, RDMA

Abstract

We provide the artifact for the ASPLOS 2023 paper “Disaggregated RAID Storage in Modern Datacenters”, including:

The main implementation of dRAID.
CloudLab testbed setup scripts.
FIO experiment scripts (Sec 9.2-9.5), which get the main results of the paper.
YCSB experiment scripts (Sec 9.6).

DOI: 10.1145/3582016.3582027

Reproduction Package for Article ‘DrGPUM： Guiding Memory Optimization for GPU-Accelerated Applications’

作者: Lin, Mao and Zhou, Keren and Su, Pengfei
关键词: CUDA, GPU profilers, GPUs, Memory management

Abstract

The artifact includes DrGPUM and benchmarks, along with instructions to reproduce the results shown in the paper.

DOI: 10.1145/3582016.3582044

ASPLOS 2023 Artifact for “Efficient Compactions Between Storage Tiers with PrismDB”

作者: Raina, Ashwini and Lu, Jianan and Cidon, Asaf and Freedman, Michael J.
关键词: compaction, key-value store, PrismDB, storage, tiered

Abstract

This artifact consists of the source code of PrismDB and the necessary scripts to reproduce the evaluation of the paper “Efficient Compactions Between Storage Tiers with PrismDB”, ASPLOS 23.

NOTE: Source code for the baselines rocksdb and mutant is not provided here. For rocksdb baseline, please refer to its official documentation on GitHub. For mutant baseline please follow the mutant SoCC paper.

DOI: 10.1145/3582016.3582052

Reproduction Package for Paper “Efficient Scheduler Live Update for Linux Kernel with Modularization”

作者: Ma, Teng and Chen, Shanpei and Wu, Yihao and Deng, Erwei and Song, Zhuo and Chen, Quan and Guo, Minyi
关键词: kernel, Linux scheduler, live update

Abstract

Plugsched is a SDK that enables live updating the Linux kernel scheduler. It can dynamically replace the scheduler subsystem without rebooting the system or applications, with milliseconds downtime. Plugsched can help developers to dynamically add, delete and modify kernel scheduling features in the production environment, which allows customizing the scheduler for different specific scenarios.

DOI: 10.1145/3582016.3582054

eHDL： Turning eBPF/XDP Programs into Hardware Designs for the NIC

作者: Rivitti, Alessandro and Bifulco, Roberto and Tulumello, Angelo and Bonola, Marco and Pontarelli, Salvatore
关键词: FPGA, HLS, Hardware Offloading, Network Programming, eBPF

Abstract

Scaling network packet processing performance to meet the increasing speed of network ports requires software programs to carefully leverage the network devices’ hardware features. This is a complex task for network programmers, who need to learn and deal with the heterogeneity of device architectures, and re-think their software to leverage them. In this paper we make first steps to reverse this design process, enabling the automatic generation of tailored hardware designs starting from a network packet processing program. We introduce eHDL, a high-level synthesis tool that automatically generates hardware pipelines from unmodified Linux’s eBPF/XDP programs. eHDL is designed to enable software developers to directly define and implement the hardware functions they need in the NIC. We prototype eHDL targeting a Xilinx Alveo U50 FPGA NIC, and evaluate it with a set of 5 eBPF/XDP programs. Our results show that the generated pipelines are efficient in terms of required hardware resources, using only 6.5%-13.3% of the FPGA, and always achieve the line rate forwarding throughput with about 1 microsecond of per-packet forwarding latency. Compared to other network-specific high-level synthesis tool, eHDL enables software programmers with no hardware expertise to describe stateful functions that operate on the entire packet data. Compared to alternative processor-based solutions that perform eBFP/XDP offloading to a NIC, eHDL provides 10-100x higher throughput.

DOI: 10.1145/3582016.3582035

Exit-Less, Isolated, and Shared Access for Virtual Machines

作者: Yasukata, Kenichi and Tazaki, Hajime and Aublin, Pierre-Louis
关键词: Isolation, Shared Memory, Virtualization

Abstract

This paper explores Exit-Less, Isolated, and Shared Access (ELISA), a novel in-memory object sharing scheme for Virtual Machines (VMs).
ELISA has the isolation advantage over the shared memory directly exposed to guest VMs while its overhead is smaller than that of host-interposition relying on the costly exit from the VM context.
In a nutshell, ELISA isolates shared in-memory objects by Extended Page Table (EPT) separation, and a guest VM accesses them by switching the EPT context using VMFUNC, a low-overhead CPU instruction of Intel CPUs.
Our experiment shows that the overhead of ELISA is 3.5 times smaller than that of VMCALL-oriented host-interposition.
We demonstrate the benefits of ELISA through two use cases; by replacing VMCALL with ELISA, a VM networking system and an in-memory key-value store exhibit 163% and 64% higher performance respectively.

DOI: 10.1145/3582016.3582042

Artifact for “Finding Unstable Code via Compiler-Driven Differential Testing”

作者: Li, Shaohua and Su, Zhendong
关键词: compiler, fuzzing, undefined behavior, Unstable code

Abstract

The artifact contains the code and datasets we used for our experiments, as well as scripts to generate the numbers, tables, and figures of our evaluation. Specifically, it includes (a) the Juliet testsuite used for evaluation; (b) scripts for running CompDiff, sanitizers, Coverity, CppCheck, and Infer on the Juliet testsuite; (c) scripts for reporting detection results of these tools; (d) scripts for generating bug statistics on 23 real-world programs; and (e) scripts for fuzzing a target with CompDiff-AFL++. Everything is packaged and pre-built as a docker image. A standard X86 Linux machine running docker is necessary to evaluate this artifact.

DOI: 10.1145/3582016.3582053

Flexagon： A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing

作者: Mu~{n
关键词: Dataflow, Deep Neural Network Accelerators, Memory Hierarchy, Merger-Reduction Network, Sparse-Sparse Matrix Multiplication

Abstract

Sparsity is a growing trend in modern DNN models.

Existing Sparse-Sparse Matrix Multiplication (SpMSpM) accelerators are tailored to a particular SpMSpM dataflow (i.e., Inner Product, Outer Product or Gustavson’s), which determines their overall efficiency. We demonstrate that this static decision inherently results in a suboptimal dynamic solution. This is because different SpMSpM kernels show varying features (i.e., dimensions, sparsity pattern, sparsity degree), which makes each dataflow better suited to different data sets.

In this work we present Flexagon, the first SpMSpM reconfigurable accelerator that is capable of performing SpMSpM computation by using the particular dataflow that best matches each case. Flexagon accelerator is based on a novel Merger-Reduction Network (MRN) that unifies the concept of reducing and merging in the same substrate, increasing efficiency. Additionally, Flexagon also includes a new L1 on-chip memory organization, specifically tailored to the different access characteristics of the input and output compressed matrices. Using detailed cycle-level simulation of contemporary DNN models from a variety of application domains, we show that Flexagon achieves average performance benefits of 4.59x, 1.71x, and 1.35x with respect to the state-of-the-art SIGMA-like, SpArch-like and GAMMA-like accelerators (265%, 67%, and 18%, respectively, in terms of average performance/area efficiency).

DOI: 10.1145/3582016.3582069

Going beyond the Limits of SFI： Flexible and Secure Hardware-Assisted In-Process Isolation with HFI

作者: Narayan, Shravan and Garfinkel, Tal and Taram, Mohammadkazem and Rudek, Joey and Moghimi, Daniel and Johnson, Evan and Fallin, Chris and Vahldiek-Oberwagner, Anjo and LeMay, Michael and Sahita, Ravi and Tullsen, Dean and Stefan, Deian
关键词: SFI, Wasm, hardware-based isolation, sandboxing

Abstract

We introduce Hardware-assisted Fault Isolation (HFI), a simple extension to
existing processors to support secure, flexible, and efficient in-process
isolation. HFI addresses the limitations of existing software-based isolation
(SFI) systems including: runtime overheads, limited scalability, vulnerability
to Spectre attacks, and limited compatibility with existing code. HFI can
seamlessly integrate with current SFI systems (e.g., WebAssembly), or directly
sandbox unmodified native binaries. To ease adoption, HFI relies only on
incremental changes to the data and control path of existing high-performance
processors. We evaluate HFI for x86-64 using the gem5 simulator and
compiler-based emulation on a mix of real and synthetic workloads.

DOI: 10.1145/3582016.3582023

Reproduction Package for Article “GRACE： A Scalable Graph-Based Approach To Accelerating Recommendation Model Inference”

作者: Ye, Haojie and Vedula, Sanketh and Chen, Yuhan and Yang, Yichen and Bronstein, Alex and Dreslinski, Ronald and Mudge, Trevor and Talati, Nishil
关键词: Algorithm-System Co-Design, DLRM, Embedding Reduction

Abstract

Our paper “GRACE: A Scalable Graph-Based Approach To Accelerating Recommendation Model Inference” presents an algorithm-system co-design for improving the performance of the embedding layer in Deep Learning Recommendation Models (DLRMs). This artifact reproduces some of the main results of our paper. The performance results shown in the paper are machine-dependent. For example, Fig. 8, Fig. 13, and Fig. 14 show results on a CPU-GPU system, HBM-only system, and DIMM-HBM system with Processing-In-Memory (PIM) capability, respectively. To enable reproducing results in a timely fashion on different machines, we reproduce the main result of our paper that is machine-independent (Fig. 10). Specifically, our instructions include 1) how to download the input datasets, 2) how to pre-process these datasets, 3) how to reproduce the memory traffic reduction results for each baseline, and 4) how to generate a plot similar to Fig. 10. Expected result: compared to a no-reduction baseline, GRACE reduces the memory traffic by 1.7x.

DOI: 10.1145/3582016.3582029

Graphene： An IR for Optimized Tensor Computations on GPUs

作者: Hagedorn, Bastian and Fan, Bin and Chen, Hanfeng and Cecka, Cris and Garland, Michael and Grover, Vinod
关键词: Code Generation, Compiler, Deep Learning, GPU, Intermediate Representation, Optimization, Tensor Computations

Abstract

Modern GPUs accelerate computations and data movements of multi-dimensional tensors in hardware. However, expressing optimized tensor computations in software is extremely challenging even for experts. Languages like CUDA C++ are centered around flat buffers in one-dimensional memory and lack reasonable abstractions for multi-dimensional data and threads. Existing tensor IRs are not expressive enough to represent the complex data-to-thread mappings required by the GPU tensor instructions.

In this paper, we introduce Graphene, an intermediate representation (IR) for optimized tensor computations on GPUs. Graphene is a low-level target language for tensor compilers and performance experts while being closer to the domain of tensor computations than languages offering the same level of control such as CUDA C++ and PTX. In Graphene, multi-dimensional data and threads are represented as first-class tensors. Graphene’s tensors are hierarchically decomposable into tiles allowing to represent optimized tensor computations as mappings between data and thread tiles.

We evaluate Graphene using some of the most important tensor computations in deep learning today, including GEMM, Multi-Layer Perceptron (MLP), Layernorm, LSTM, and Fused Multi-Head Attention (FMHA). We show that Graphene is capable of expressing all optimizations required to achieve the same practical peak performance as existing library implementations. Fused kernels beyond library routines expressed in Graphene significantly improve the end-to-end inference performance of Transformer networks and match or outperform the performance of cuBLAS(Lt), cuDNN, and custom handwritten kernels.

DOI: 10.1145/3582016.3582018

Reproduction pakage for Heron

作者: Bi, Jun and Guo, Qi and Li, Xiaqing and Zhao, Yongwei and Wen, Yuanbo and Guo, Yuxuan and Zhou, Enshuai and Hu, Xing and Du, Zidong and Li, Ling and Chen, Huaping and Chen, Tianshi
关键词: code generation, compiler optimization, tensor computation

Abstract

This artifact describes how to set up the compilation infrastructure of HERON and how to run the workloads described in Section 6.2. Concretely, this guide provides instructions to: • Set up the experimental environment of HERON. • Run experiments to demonstrate the optimization ability of HERON as shown in Figure 6, Figure 7, Figure 8, and Figure 10. • Visualization of the search spaces as shown in Figure 11. • Run experiments to demonstrate the effectiveness of CGA as shown in Figure 12 and Figure 13.

DOI: 10.1145/3582016.3582061

Artifact for “Homunculus： Auto-Generating Efficient Data-Plane ML Pipelines for Datacenter Networks” - ASPLOS 2023

作者: Swamy, Tushar and Zulfiqar, Annus and Nardi, Luigi and Shahbaz, Muhammad and Olukotun, Kunle
关键词: ML Compilers, Per-packet ML, Self-driving Networks

Abstract

The artifact contains the source code for the titular Homunculus compiler, a backend for the Taurus ASIC switch architecture, as well as three representative applications. We used these applications to demonstrate the core results of our paper, i.e., how Homunculus-generated models outperform or match the hand-tuned baseline versions. We include applications for anomaly detection, traffic classification, and botnet detection. Homunculus also generates the appropriate hardware code for each of these applications to run on a Taurus switch architecture.

DOI: 10.1145/3582016.3582022

Hyperscale Hardware Optimized Neural Architecture Search

作者: Li, Sheng and Andersen, Garrett and Chen, Tao and Cheng, Liqun and Grady, Julian and Huang, Da and Le, Quoc V. and Li, Andrew and Li, Xin and Li, Yang and Liang, Chen and Lu, Yifeng and Ni, Yun and Pang, Ruoming and Tan, Mingxing and Wicke, Martin and Wu, Gang and Zhu, Shengqi and Ranganathan, Parthasarathy and Jouppi, Norman P.
关键词: Accelerator, Deep Learning, GPU, Hyperscale Hardware, Machine Learning, Neural Architecture Search, Pareto Optimization, TPU

Abstract

Recent advances in machine learning have leveraged dramatic increases in computational power, a trend expected to continue in the future. This paper introduces the first Hyperscale Hardware Optimized Neural Architecture Search (H2O-NAS) to automatically design accurate and performant machine learning models tailored to the underlying hardware architecture. H2O-NAS consists of three key components: a new massively parallel “one-shot” search algorithm with intelligent weight sharing, which can scale to search spaces of O(10280) and handle large volumes of production traffic; hardware-optimized search spaces for diverse ML models on heterogeneous hardware; and a novel two-phase hybrid performance model and a multi-objective reward function optimized for large scale deployments. H2O-NAS has been implemented around state-of-the-art machine learning models (e.g. convolutional models, vision transformers, and deep learning recommendation models) and deployed at zettaflop scale in production. Our results demonstrate significant improvements in performance (22% ∼ 56%) and energy efficiency (17% ∼25%) at same or better quality. Our solution is designed for largescale deployment, streamlining privacy and security processes and reducing manual overhead. This facilitates a smooth and automated transition from research to production.

DOI: 10.1145/3582016.3582049

Infinity Stream： Portable and Programmer-Friendly In-/Near-Memory Fusion

作者: Wang, Zhengrong and Liu, Christopher and Arora, Aman and John, Lizy and Nowatzki, Tony
关键词: In-Memory Computing, Near-Memory Computing, Programmer-Transparent Acceleration, Stream-Based ISAs

Abstract

In-memory computing with large last-level caches is promising to dramatically alleviate data movement bottlenecks and expose massive bitline-level parallelization opportunities. However, key challenges from its unique execution model remain unsolved: automated parallelization, transparently orchestrating data transposition/alignment/broadcast for bit-serial logic, and mixing in-/near-memory computing. Most importantly, the solution should be programmer friendly and portable across platforms. Our key innovation is an execution model and intermediate representation (IR) that enables hybrid CPU-core, in-memory, and near-memory processing. Our IR is the tensor dataflow graph (tDFG), which is a unified representation of in-memory and near-memory computation. The tDFG exposes tensor-data structure information so that the hardware and runtime can automatically orchestrate data management for bitserial execution, including runtime data layout transformations. To enable microarchitecture portability, we use a two-phase, JIT-based compilation approach to dynamically lower the the tDFG to in-memory commands. Our design, infinity stream, is evaluated on a cycle-accurate simulator. Across data-processing workloads with fp32, it achieves 2.6\texttimes{

DOI: 10.1145/3582016.3582032

In-Network Aggregation with Transport Transparency for Distributed Training

作者: Liu, Shuo and Wang, Qiaoling and Zhang, Junyi and Wu, Wenfei and Lin, Qinliang and Liu, Yao and Xu, Meng and Canini, Marco and Cheung, Ray C. C. and He, Jianfei
关键词: Distributed Training, FPGA, In-Network Aggregation, RDMA

Abstract

Recent In-Network Aggregation (INA) solutions offload the all-reduce operation onto network switches to accelerate and scale distributed training (DT). On end hosts, these solutions build custom network stacks to replace the transport layer. The INA-oriented network stack cannot take advantage of the state-of-the-art performant transport layer implementation, and also causes complexity in system development and operation. We design a transport-transparent INA primitive named NetReduce for modern multi-rack data centers. NetReduce runs beneath the transport layer. The switch performs aggregation operations but preserves data transmission connections. The host uses RoCE as its transport layer to deliver gradient messages and receive aggregation results. NetReduce achieves performance gains from both INA and RoCE: linear scalability, traffic reduction, and bandwidth freeing-up from INA — high throughput, low latency, and low CPU overhead from RoCE. For jobs spanning several multi-GPU machines, we also devise parallel all-reduce based on NetReduce to make use of intra-machine and inter-machine bandwidth efficiently. We prototype NetReduce on an FPGA board attached to an Ethernet switch. We compare NetReduce with existing programmable switch-based solutions and justify the FPGA-based design choice. We evaluate NetReduce’s performance by training typical Deep Neural Network models on single-GPU and multi-GPU testbeds. NetReduce inter-operates with the existing Ethernet transport layer, is training-framework friendly, accelerates network-intensive DT jobs effectively (e.g., 70% for AlexNet), reduces CPU overheads (e.g., only one core for transmission), and is cost-effective (e.g., only 2.40% more capital expense and 0.68% more power consumption making 12.3-57.9% more performance acceleration).

DOI: 10.1145/3582016.3582037

Kodan： Addressing the Computational Bottleneck in Space

作者: Denby, Bradley and Chintalapudi, Krishna and Chandra, Ranveer and Lucia, Brandon and Noghabi, Shadi
关键词: IoT, edge computing, embedded systems

Abstract

Decreasing costs of deploying space vehicles to low-Earth orbit have fostered an emergence of large constellations of satellites. However, high satellite velocities, large image data quantities, and brief ground station contacts create a data downlink challenge. Orbital edge computing (OEC), which filters data at the space edge, addresses this downlink bottleneck but shifts the challenge to the inelastic computational capabilities onboard satellites. In this work, we present Kodan: an OEC system that maximizes the utility of saturated satellite downlinks while mitigating the computational bottleneck. Kodan consists of two phases. A one-time transformation step uses a reference implementation of a satellite data analysis application, along with a representative dataset, to produce specialized ML models targeted for deployment to the space edge. After deployment to a target satellite, a runtime system dynamically selects the best specialized models for each data sample to maximize valuable data downlinked within the constraints of the computational bottleneck. By intelligently filtering low-value data and prioritizing high-value data for transmit via the saturated downlink, Kodan increases the data value density between 89 and 97 percent.

DOI: 10.1145/3582016.3582043

LEGO： Empowering Chip-Level Functionality Plug-and-Play for Next-Generation IoT Devices

作者: Zhang, Chong and Li, Songfan and Song, Yihang and Meng, Qianhe and Chen, Minghua and Bai, YanXu and Lu, Li and Zhu, Hongzi
关键词: IoT architecture, description language, function plug-and-play

Abstract

Versatile Internet of Things (IoT) applications call for re-configurable IoT devices that can easily extend new functionality on demand. However, the heterogeneity of functional chips brings difficulties in device customization, leading to inadequate flexibility. In this paper, we propose LEGO, a novel architecture for chip-level re-configurable IoT devices that supports plug-and-play with Commercial Off-The-Shelf (COTS) chips. To combat the heterogeneity of functional chips, we first design a novel Unified Chip Description Language (UCDL) with meta-operation and chip specifications to access various types of functional chips uniformly. Then, to achieve chips plug-and-play, we build up a novel platform and shift all chip control logic to the gateway, which makes IoT devices entirely decoupled from specific applications and does not need to make any changes when plugging in new functional chips. Finally, to handle communications overheads, we built up a novel orchestration architecture for gateway instructions, which minimizes instruction transmission frequency in remote chip control. We implement the prototype and conduct extensive evaluations with 100+ types of COTS functional chips. The results show that new functional chips can be automatically accessed by the system within 0.13 seconds after being plugged in, and only bringing 0.53 kb of communication load on average, demonstrating the efficacy of LEGO design.

DOI: 10.1145/3582016.3582050

Mapping Very Large Scale Spiking Neuron Network to Neuromorphic Hardware

作者: Jin, Ouwen and Xing, Qinghui and Li, Ying and Deng, Shuiguang and He, Shuibing and Pan, Gang
关键词: Network on chip (NOC), Neuromorphic computing, Spiking Neural Networks (SNN), mapping

Abstract

Neuromorphic hardware is a multi-core computer system specifically designed to run Spiking Neuron Network (SNN) applications. As the scale of neuromorphic hardware increases, it becomes very challenging to efficiently map a large SNN to hardware. In this paper, we proposed an efficient approach to map very large scale SNN applications to neuromorphic hardware, aiming to reduce energy consumption, spike latency, and on-chip network communication congestion. The approach consists of two steps. Firstly, it solves the initial placement using the Hilbert curve, a space-filling curve with unique properties that are particularly suitable for mapping SNNs. Secondly, the Force Directed (FD) algorithm is developed to optimize the initial placement. The FD algorithm formulates the connections of clusters as tension forces, thus converts the local optimization of placement as a force analysis problem. The proposed approach is evaluated with the scale of 4 billion neurons, which is more than 200 times larger than previous research. The results show that our approach achieves state-of-the-art performance, significantly exceeding existing approaches.

DOI: 10.1145/3582016.3582038

Mosaic Pages： Big TLB Reach with Small Pages

作者: Gosakan, Krishnan and Han, Jaehyun and Kuszmaul, William and Mubarek, Ibrahim N. and Mukherjee, Nirjhar and Sriram, Karthik and Tagliavini, Guido and West, Evan and Bender, Michael A. and Bhattacharjee, Abhishek and Conway, Alex and Farach-Colton, Martin and Gandhi, Jayneel and Johnson, Rob and Kannan, Sudarsun and Porter, Donald E.
关键词: address translation, gem5, hashing, linux, paging, TLB, verilog, virtual memory

Abstract

There are three artifacts for this paper: a Gem5 model to reproduce Figure 4, a modified Linux kernel to reproduce Tables 3 and 4, and Verilog code to reproduce Table 5. The Linux artifact includes scripts to setup a KVM environment with Mosaic and vanilla Linux kernels. The artifact also includes scripts to run the Linux workloads in a VM and a script to generate tables.

DOI: 10.1145/3582016.3582021

MP-Rec： Hardware-Software Co-design to Enable Multi-path Recommendation

作者: Hsia, Samuel and Gupta, Udit and Acun, Bilge and Ardalani, Newsha and Zhong, Pan and Wei, Gu-Yeon and Brooks, David and Wu, Carole-Jean
关键词: Deep Learning, Hardware-Software Co-Design, Recommender Systems

Abstract

Deep learning recommendation systems serve personalized content under diverse tail-latency targets and input-query loads. In order to do so, state-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences over large bodies of contents. The reliance on a fixed embedding representation of embedding tables not only imposes significant memory capacity and bandwidth requirements but also limits the scope of compatible system solutions. This paper challenges the assumption of fixed embedding representations by showing how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance. Based on our characterization of various embedding representations, we propose a hybrid embedding representation that achieves higher quality embeddings at the cost of increased memory and compute requirements. To address the system performance challenges of the hybrid representation, we propose MP-Rec — a co-design technique that exploits heterogeneity and dynamic selection of embedding representations and underlying hardware platforms. On real system hardware, we demonstrate how matching custom accelerators, i.e., GPUs, TPUs, and IPUs, with compatible embedding representations can lead to 16.65\texttimes{

DOI: 10.1145/3582016.3582068

NosWalker： A Decoupled Architecture for Out-of-Core Random Walk Processing

作者: Wang, Shuke and Zhang, Mingxing and Yang, Ke and Chen, Kang and Ma, Shaonan and Jiang, Jinlei and Wu, Yongwei
关键词: graph processing, out-of-core, random walk

Abstract

Out-of-core random walk system has recently attracted a lot of attention as an economical way to run billions of walkers over large graphs. However, existing out-of-core random walk systems are all built upon general out-of-core graph processing frameworks, and hence do not take advantage of the unique properties of random walk applications. Different from traditional graph analysis algorithms, the sampling process of random walk can be decoupled from the processing of the walkers. It enables the system to reserve only pre-sample results in memory, which are typically much smaller than the entire edge set. Moreover, in random walk, it is not the number of walkers but the number of steps moved per second that dominates the overall performance. Thus, with independent walkers, there is no need to process all the walkers simultaneously. In this paper, we present NosWalker, an out-of-core random walk system that replaces the graph oriented scheduling with a decoupled system architecture that provides walker oriented scheduling. NosWalker is able to adaptively generate walkers and flexibly adjust the distribution of reserved pre-sample results in memory. Instead of processing all the walkers at once, NosWalker only tries its best to keep a few walkers able to continuously move forward. Experimental results show that NosWalker can achieve up to two orders of magnitude speedup compared to state-of-the-art out-of-core random walk systems. In particular, NosWalker demonstrates superior performance when the memory capacity can only hold about 10%-50% of the graph data, which can be a common case when the user needs to run billions of walkers over large graphs.

DOI: 10.1145/3582016.3582025

作者: Zhang, Zhongcheng and Ou, Yan and Liu, Ying and Wang, Chenxi and Zhou, Yongbin and Wang, Xiaoyu and Zhang, Yuyang and Ouyang, Yucheng and Shan, Jiahao and Wang, Ying and Xue, Jingling and Cui, Huimin and Feng, Xiaobing
关键词: Architecture, Auto Vectorization, Simd

Abstract

SIMD extensions are widely adopted in multi-core processors to exploit data-level parallelism. However, when co-running workloads on different cores, compute-intensive workloads cannot take advantage of the underutilized SIMD lanes allocated to memoryintensive workloads, reducing the overall performance. This paper proposes Occamy, a SIMD co-processor that can be shared by multiple CPU cores, so that their co-running workloads can spatially share its SIMD lanes. The key idea is to enable elastic spatial sharing by dynamically partitioning all the SIMD lanes across different workloads based on their phase behaviors, so that each workload may execute in variable-length SIMD mode. We also introduce an Occamy compiler to support such variable-length vectorization by analyzing such phase behaviors and generating the vectorized code that works with varying vector lengths. We demonstrate that Occamy can improve SIMD utilization, and consequently, performance over three representative SIMD architectures, with negligible chip area cost.

DOI: 10.1145/3582016.3582046

Persistent Memory Disaggregation for Cloud-Native Relational Databases

作者: Ruan, Chaoyi and Zhang, Yingqiang and Bi, Chao and Ma, Xiaosong and Chen, Hao and Li, Feifei and Yang, Xinjun and Li, Cheng and Aboulnaga, Ashraf and Xu, Yinlong
关键词: cloud-native database, memory disaggregation, persistent memory

Abstract

The recent emergence of commodity persistent memory (PM) hardware has altered the landscape of the storage hierarchy. It brings multi-fold benefits to database systems, with its large capacity, low latency, byte addressability, and persistence. However, PM has not been incorporated into the popular disaggregated architecture of cloud-native databases.

In this paper, we present PilotDB, a cloud-native relational database designed to fully utilize disaggregated PM resources. PilotDB possesses a new disaggregated DB architecture that allows compute nodes to be computation-heavy yet data-light, as enabled by large buffer pools and fast data persistence offered by remote PMs. We then propose a suite of novel mechanisms to facilitate RDMA-friendly remote PM accesses and minimize operations involving CPUs on the computation-light PM nodes. In particular, PilotDB adopts a novel compute-node-driven log organization that reduces network/PM bandwidth consumption and a log-pull design that enables fast, optimistic remote PM reads aggressively bypassing the remote PM node CPUs. Evaluation with both standard SQL benchmarks and a real-world production workload demonstrates that PilotDB (1) achieves excellent performance as compared to the best-performing baseline using local, high-end resources, (2) significantly outperforms a state-of-the-art DRAM-disaggregation system and the PM-disaggregation solution adapted from it, (3) enables faster failure recovery and cache buffer warm-up, and (4) offers superior cost-effectiveness.

DOI: 10.1145/3582016.3582055

Code for Article ‘PipeSynth： Automated Synthesis of Microarchitectural Axioms for Memory Consistency’

作者: Norman, Chase and Godbole, Adwait and Manerkar, Yatin A.
关键词: formal methods, memory consistency, microarchitecture, synthesis

Abstract

This is the code for the ASPLOS 2023 paper ‘PipeSynth: Automated Synthesis of Microarchitectural Axioms for Memory Consistency’.

DOI: 10.1145/3582016.3582056

Protect the System Call, Protect (Most of) the World with BASTION

作者: Jelesnianski, Christopher and Ismail, Mohannad and Jang, Yeongjin and Williams, Dan and Min, Changwoo
关键词: Argument Integrity, Code Re-use Attacks, Exploit Mitigation, System Call Protection, System Call Specialization, System Calls

Abstract

System calls are a critical building block in many serious security attacks, such as control-flow hijacking and privilege escalation attacks. Security-sensitive system calls (e.g., execve, mprotect), especially play a major role in completing attacks. Yet, few defense efforts focus to ensure their legitimate usage, allowing attackers to maliciously leverage system calls in attacks. In this paper, we propose a novel System Call Integrity, which enforces the correct use of system calls throughout runtime. We propose three new contexts enforcing (1) which system call is called and how it is invoked (Call Type), (2) how a system call is reached (Control Flow), and (3) that arguments are not corrupted (Argument Integrity). Our defense mechanism thwarts attacks by breaking the critical building block in their attack chains. We implement BASTION, as a compiler and runtime monitor system, to demonstrate the efficacy of the three system call contexts. Our security case study shows that BASTION can effectively stop all the attacks including real-world exploits and recent advanced attack strategies. Deploying BASTION on three popular system call-intensive programs, NGINX, SQLite, and vsFTPd, we show BASTION is secure and practical, demonstrating overhead of 0.60%, 2.01%, and 1.65%, respectively.

DOI: 10.1145/3582016.3582066

Re-architecting I/O Caches for Emerging Fast Storage Devices

作者: Ajdari, Mohammadamin and Peykani Sani, Pouria and Moradi, Amirhossein and Khanalizadeh Imani, Masoud and Bazkhanei, Amir Hossein and Asadi, Hossein
关键词: All-Flash Storage System, I/O Cache, Performance, Persistent Memory, RAM Disk, Solid-State Drive, Storage Area Network

Abstract

I/O caching has widely been used in enterprise storage systems to enhance the system performance with minimal cost. Using Solid-State Drives (SSDs) as an I/O caching layer on the top of arrays of Hard Disk Drives (HDDs) has been well studied in numerous studies. With emergence of ultra fast storage devices, recent studies suggest to use them as an I/O cache layer on top of mainstream SSDs in I/O intensive applications. Our detailed analysis shows despite significant potential of ultra-fast storage devices, existing I/O cache architectures may act as a major performance bottleneck in enterprise storage systems, which prevents to take advantage of the device full performance potentials. In this paper, using an enterprise-grade all-flash storage system, we first present a thorough analysis on the performance of I/O cache modules when ultra-fast memories are used as a caching layer on top of mainstream SSDs. Unlike traditional SSD-based caching on HDD arrays, we show the use of ultra-fast memory as an I/O cache device on SSD arrays exhibit completely unexpected performance behavior. As an example, we show two popular cache architectures exhibit similar throughput due to performance bottleneck on the traditional SSD/HDD devices, but with ultra-fast memory on SSD arrays, their true potential is released and show 5\texttimes{

DOI: 10.1145/3582016.3582041

File System for Reconfigurable Fabrics Codebase

作者: Landgraf, Joshua and Giordano, Matthew and Yoon, Esther and Rossbach, Christopher J.
关键词: FPGAs, Operating Systems, Virtual Memory, Virtualization

Abstract

This artifact contains the software and hardware code for FSRF, File System for Reconfigurable Fabrics, along with corresponding scripts and documentation for preparing test data, setting up the system, and running workloads and experiments.

DOI: 10.1145/3582016.3582048

RepCut： Superlinear Parallel RTL Simulation with Replication-Aided Partitioning

作者: Wang, Haoyuan and Beamer, Scott
关键词: Parallel RTL Simulation, RepCut

Abstract

This package contains the artifact for of RepCut: Superlinear Parallel RTL Simulation with Replication-Aided Partitioning, DOI 10.1145/3582016.3582034

This artifact contains the source code for RepCut, as well as other open source projects that are required to reproduce the results in the paper. We include Verilator 4.226 as a baseline. In addition, this artifact also contains scripts and a Makefile to compile and run the generated simulators, as well as to reproduce every figure and table from experiment data.

Please find more details in README.md

DOI: 10.1145/3582016.3582034

Code for Rosebud, Making FPGA-Accelerated Middlebox Development More Pleasant

作者: Khazraee, Moein and Forencich, Alex and Papen, George C. and Snoeren, Alex C. and Schulman, Aaron
关键词: 200G, FPGA, Hardware-Software Co-design, Middlebox

Abstract

This artifact contains the code to generate the Rosebud image, alongside code for simulation and runtime development. It also describes how new applications can be accelerated using this framework.

DOI: 10.1145/3582016.3582067

Simulator Independent Coverage for RTL Hardware Languages - Software Artifact

作者: Laeufer, Kevin and Iyer, Vighnesh and Biancolin, David and Bachrach, Jonathan and Nikoli'{c
关键词: Chisel, ChiselTest, FireSim, FIRRTL, FPGA, FSM Coverage, Hardware Compiler, Line Coverage, RTL, Toggle Coverage

Abstract

The code to reproduce results from our ASPLOS’23 paper on “Simulator Independent Coverage for RTL Hardware Languages”. Most results can be reproduced on a standard x86 Linux computer, however, for the FireSim performance and area/frequency results a more complicated setup on AWS cloud FPGAs is necessary. Please consult the Readme.md in simulator-independent-coverage.tar.gz for more instructions.

DOI: 10.1145/3582016.3582019

Skybox： Open-Source Graphic Rendering on Programmable RISC-V GPUs

作者: Tine, Blaise and Saxena, Varun and Srivatsan, Santosh and Simpson, Joshua R. and Alzammar, Fadi and Cooper, Liam and Kim, Hyesoon
关键词: graphics, hardware accelerator, microarchitecture

Abstract

Graphics rendering remains one of the most compute
intensive and memory bound applications of GPUs and has
been driving their push for performance and energy efficiency
since its inception. Early GPU architectures focused only on
accelerating graphics rendering and implemented dedicated fixed-
function rasterizer hardware to speed-up their rendering pipeline.
As GPUs have become more programmable and ubiquitous in
other application domains such as scientific computing, machine
learning, graph analytics, and crypto-currency, generalizing
GPU microarchitectures for area and power efficiency becomes
necessary, especially for mobile and IoT devices. In this work,
we present Skybox, a full-stack open-source GPU architecture
with integrated software, compiler, hardware, and simulation
environment, that enables end-to-end GPU research. Using
Skybox, we explore the design space of software versus hardware
graphics rendering and propose and hybrid micro-architecture
that accelerates the state-of-the art Vulkan graphics API. Skybox
also introduces novel compiler and system optimizations to support
its unique RISC-V ISA baseline. We evaluated Skybox on high-
end Altera and also Xilinx FPGAs. We were able to generate and
execute a 32 cores (512 threads) Skybox graphics processor on
Altera Stratix 10 FPGA, delivering a peak fill rate of 3.7 GPixels at 230 MHz.
Skybox is the first open-source full-stack
GPU software and hardware implementation that supports the
Vulkan API

DOI: 10.1145/3582016.3582024

Snape： Reliable and Low-Cost Computing with Mixture of Spot and On-Demand VMs

作者: Yang, Fangkai and Wang, Lu and Xu, Zhenyu and Zhang, Jue and Li, Liqun and Qiao, Bo and Couturier, Camille and Bansal, Chetan and Ram, Soumya and Qin, Si and Ma, Zhen and Goiri, '{I
关键词: Spot virtual machine, dynamic mixture, eviction prediction

Abstract

Cloud providers often have resources that are not being fully utilized, and they may offer them at a lower cost to make up for the reduced availability of these resources. However, customers may be hesitant to use such offerings (such as spot VMs) as making trade-offs between cost and resource availability is not always straightforward. In this work, we propose Snape (Spot On-demand Perfect Mixture), an intelligent framework to optimize the cost and resource availability by dynamically mixing on-demand VMs with spot VMs. Through a detailed characterization based on real production traces, we verify that the eviction of spot VMs is predictable to some extent. Snape also leverages constrained reinforcement learning to adjust the mixture policy online. Experiments across different configurations show that Snape achieves 44% savings compared to using only on-demand VMs while maintaining 99.96% availability, which is 2.77% higher than using only spot VMs.

DOI: 10.1145/3582016.3582028

Artifact for Article “Space Efficient TREC for Enabling Deep Learning on Microcontrollers”

作者: Liu, Jiesong and Zhang, Feng and Guan, Jiawei and Sung, Hsin-Hsuan and Guo, Xiaoguang and Du, Xiaoyong and Shen, Xipeng
关键词: Compiler Optimization, Real-time Machine Learning

Abstract

This directory contains the artifact for Space Efficient TREC for Enabling Deep Learning on Microcontrollers published in ASPLOS 2023. For detailed information, please see the readme.md in TREC-Artifact.zip.

DOI: 10.1145/3582016.3582062

SparseTIR Artifact

作者: Ye, Zihao and Lai, Ruihang and Shao, Junru and Chen, Tianqi and Ceze, Luis
关键词: Deep-Learning-Compilers, Sparse-Computation, Tensor-Compilers

Abstract

This repository contains scripts for setting up environments and reproducing results presented in the ASPLOS 2023 paper entitled SparseTIR: Composable Abstractions for Deep Learning.

Please read the README.md file or visit https://github.com/uwsampl/sparsetir-artifact for instructions on how to run and install this artifact.

DOI: 10.1145/3582016.3582047

SPLENDID： Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development

作者: Tan, Zujun and Chon, Yebin and Kruse, Michael and Doerfert, Johannes and Xu, Ziyang and Homerding, Brian and Campanoni, Simone and August, David I.
关键词: ASPLOS’23, Automatic Parallelization, Decompiler

Abstract

The artifact for this paper contains tools and data to reproduce, with minimal effort, the entire testing flow and corroborate its claims. All results can be generated from scratch (source codes) and run across different platforms with the provided docker image. The pre-built docker image supports runs across different platforms with software dependencies taken care of, including a pre-compiled copy of the proposed decompiler, its variants, state-of-the-art decompilers used for comparison, and miscellaneous software such as vim and python. We provide an easy top level interface, to simplify the testing process.

DOI: 10.1145/3582016.3582058

TeraHeap： Reducing Memory Pressure in Managed Big Data Frameworks

作者: Kolokasis, Iacovos G. and Evdorou, Giannos and Akram, Shoaib and Kozanitis, Christos and Papagiannis, Anastasios and Zakkak, Foivos S. and Pratikakis, Polyvios and Bilas, Angelos
关键词: fast storage devices, garbage collection, Java Virtual Machine (JVM), large analytics datasets, large managed heaps, memory hierarchy, memory management, serialization

Abstract

How to Access. All scripts are available in the GitHub repository https://github.com/CARV-ICS-FORTH/asplos2023_ae. All sources, including JVM, frameworks, and benchmarks, are included as public git submodules. Also, the artifact is available at https://doi.org/10.5281/zenodo.7590151.

Hardware Dependencies. We recommend a dual-socket server that is equipped with two Intel(R) Xeon(R) CPU E5-2630 v3 CPUs running at 2.4 GHz, each with eight physical cores and 16 hyper-threads for a total of 32 hyper-threads. The server should have at least 128 GB DRAM. We recommend using two 1 TB Samsung PM983 PCI Express NVMe SSDs and an HDD (larger than 1.5 TB) to allocate the datasets. For the evaluation with NVM, we consider using a dual-socket server with two Intel Xeon Platinum 8260M CPUs at 2.4 GHz, with 24 cores and (96 hyper-threads), and 192 GB of DDR4 DRAM. We use Intel Optane DC Persistent Memory with a total capacity of 3 TB, of which 1 TB is in Memory mode and 2 TB are in AppDirect mode.

Software Dependencies. The compilation environment and the provided scripts assume Centos 7, which uses Linux Kernel v.3.10 and v.4.14.

Data Sets. The required datasets for Spark workloads (except BC) are automatically generated using the SparkBench suite dataset generator. The dataset will be generated when executing the specific scripts to run Spark workloads. The datasets for Spark-BC and Giraph workloads are downloaded automatically before each workload execution.

DOI: 10.1145/3582016.3582045

Reproduction Package for ‘The Sparse Abstract Machine’

作者: Hsu, Olivia and Strange, Maxwell and Sharma, Ritvik and Won, Jaeyeon and Olukotun, Kunle and Emer, Joel S. and Horowitz, Mark A. and Kj\o{
关键词: abstract machine, compilers, cycle-approximate modeling, sparse tensor algebra, streaming dataflow

Abstract

This artifact describes how to set up and run our Sparse Abstract Machine (SAM) Python simulator and the C++ CUSTARD compiler, which compiles from concrete index notation (CIN) to SAM graphs (represented and stored in the DOT file format). The artifact also describes how to reproduce the quantitative experimental results in this paper. The artifact can be executed with any X86-64 or M-series Apple machine with Docker support and Python 3, at least 32 GB of RAM, and more than 20 GB of disk space.

Additionally, all instructions and dependencies for using the artifact are contained in the artifact appendix of the paper.

DOI: 10.1145/3582016.3582051

Towards an Adaptable Systems Architecture for Memory Tiering at Warehouse-Scale

作者: Duraisamy, Padmapriya and Xu, Wei and Hare, Scott and Rajwar, Ravi and Culler, David and Xu, Zhiyi and Fan, Jianing and Kennelly, Christopher and McCloskey, Bill and Mijailovic, Danijela and Morris, Brian and Mukherjee, Chiranjit and Ren, Jingliang and Thelen, Greg and Turner, Paul and Villavieja, Carlos and Ranganathan, Parthasarathy and Vahdat, Amin
关键词: Memory Management, Memory Tiering, Warehouse-Scale Computing

Abstract

Fast DRAM increasingly dominates infrastructure spend in large scale computing environments and this trend will likely worsen without an architectural shift. The cost of deployed memory can be reduced by replacing part of the conventional DRAM with lower cost albeit slower memory media, thus creating a tiered memory system where both tiers are directly addressable and cached. But, this poses numerous challenges in a highly multi-tenant warehouse-scale computing setting. The diversity and scale of its applications motivates an application-transparent solution in the general case, adaptable to specific workload demands.

This paper presents TMTS(Transparent Memory Tiering System), an application-transparent memory tiering management system that implements an adaptive, hardware-guided architecture to dynamically optimize access to the various directly-addressed memory tiers without faults. TMTS has been deployed at scale for two years serving thousands of production services, successfully meeting service level objectives (SLOs) across diverse application classes in the fleet. The solution is developed in terms of system level metrics it seeks to optimize and evaluated across the diverse workload mix to guide advanced policies embodied in a user-level agent. It sustains less than 5% overall performance degradation while replacing 25% of DRAM with a much slower medium.

DOI: 10.1145/3582016.3582031

TPP： Transparent Page Placement for CXL-Enabled Tiered-Memory

作者: Maruf, Hasan Al and Wang, Hao and Dhanotia, Abhishek and Weiner, Johannes and Agarwal, Niket and Bhattacharya, Pallab and Petersen, Chris and Chowdhury, Mosharaf and Kanaujia, Shobhit and Chauhan, Prakash
关键词: CXL-Memory, Datacenters, Heterogeneous System, Memory Management, Operating Systems, Tiered-Memory

Abstract

The increasing demand for memory in hyperscale applications has led to memory becoming a large portion of the overall datacenter spend. The emergence of coherent interfaces like CXL enables main memory expansion and offers an efficient solution to this problem. In such systems, the main memory can constitute different memory technologies with varied characteristics. In this paper, we characterize memory usage patterns of a wide range of datacenter applications across the server fleet of Meta. We, therefore, demonstrate the opportunities to offload colder pages to slower memory tiers for these applications. Without efficient memory management, however, such systems can significantly degrade performance.

We propose a novel OS-level application-transparent page placement mechanism (TPP) for CXL-enabled memory. TPP employs a lightweight mechanism to identify and place hot/cold pages to appropriate memory tiers. It enables a proactive page demotion from local memory to CXL-Memory. This technique ensures a memory headroom for new page allocations that are often related to request processing and tend to be short-lived and hot. At the same time, TPP can promptly promote performance-critical hot pages trapped in the slow CXL-Memory to the fast local memory, while minimizing both sampling overhead and unnecessary migrations. TPP works transparently without any application-specific knowledge and can be deployed globally as a kernel release.

We evaluate TPP with diverse memory-sensitive workloads in the production server fleet with early samples of new x86 CPUs with CXL 1.1 support. TPP makes a tiered memory system performant as an ideal baseline (<1% gap) that has all the memory in the local tier. It is 18% better than today’s Linux, and 5–17% better than existing solutions including NUMA Balancing and AutoTiering. Most of the TPP patches have been merged in the Linux v5.18 release while the remaining ones are just pending for more discussion.

DOI: 10.1145/3582016.3582063

Reproduction Package for Article ‘Transparent Runtime Change Handling for Android Apps’

作者: Chen, Zizhan and Shao, Zili
关键词: embedded systems, mobile systems, operating systems

Abstract

The artifact provides the source code of RCHDroid, along with the instructions to generate the results. The artifact allows users to reproduce key results from the paper, including Figure 7, Figure 8, Figure 9, Figure 10, and Figure 14. The hardware must contain the ROC-RK3399-PC-PLUS development board connected to a screen. We provide compiled images to simplify the experiment workflow. Users can also build images from the source code.

DOI: 10.1145/3582016.3582060

Untangle： A Principled Framework to Design Low-Leakage, High-Performance Dynamic Partitioning Schemes

作者: Zhao, Zirui Neil and Morrison, Adam and Fletcher, Christopher W. and Torrellas, Josep
关键词: Microarchitectural side-channel defense, information leakage, resource partitioning

Abstract

Partitioning a hardware structure dynamically among multiple security domains leaks some information but can deliver high performance. To understand the performance-security tradeoff of dynamic partitioning, it would be useful to formally quantify the leakage of these schemes. Unfortunately, this is hard, as what partition resizing decisions are made and when they are made are entangled. In this paper, we present Untangle, a novel framework for constructing low-leakage and high-performance dynamic partitioning schemes. Untangle formally splits the leakage into leakage from deciding what resizing action to perform (action leakage) and leakage from deciding when the resizing action occurs (scheduling leakage). Based on this breakdown, Untangle introduces a set of principles that decouple program timing from the action leakage. Moreover, Untangle introduces a new way to model the scheduling leakage without analyzing program timing. With these techniques, Untangle quantifies the leakage in a dynamic resizing scheme more tightly than prior work. To demonstrate Untangle, we apply it to dynamically partition the last-level cache. On average, workloads leak 78% less under Untangle than under a conventional dynamic partitioning approach, for the same workload performance.

DOI: 10.1145/3582016.3582033

NQPV

作者: Feng, Yuan and Xu, Yingte
关键词: nondeterminism, program verification, quantum programming

Abstract

NQPV is a verification assistant prototype of nondeterministic quantum programs. It implements the verification logic of partial correctness in the numerical form, with soundness guaranteed by the theory and experiments.

DOI: 10.1145/3582016.3582039