ISCA 2022

NvMR： non-volatile memory renaming for intermittent computing

作者: Bhattacharyya, Abhishek and Somashekhar, Abhijith and Miguel, Joshua San
关键词: energy-harvesting, idempotency, intermittent computing

Abstract

Intermittent systems on energy-harvesting devices have to frequently back up data because of an unreliable energy supply to make forward progress. These devices come with non-volatile memories like Flash/FRAM on board that are used to back up the system state. However, quite paradoxically, writing to a non-volatile memory consumes a lot of energy that makes backups expensive. Idem-potency violations inherent to intermittent programs are major contributors to the problem, as they render system state inconsistent and force backups to occur even when plenty of energy is available. In this work, we first characterize the complex persist dependencies that are unique to intermittent computing. Based on these insights, we propose NvMR, an intermittent architecture that eliminates idempotency violations in the program by renaming non-volatile memory addresses. This can reduce the number of backups to their theoretical minimum and decouple the decision of when to perform backups from the memory access constraints imposed by the program. Our evaluations show that compared to a state-of-the-art intermittent architecture, NvMR can save about 20% energy on average when running common embedded applications.

DOI: 10.1145/3470496.3527413

Free atomics： hardware atomic operations without fences

作者: Asgharzadeh, Ashkan and Cebrian, Juan M. and Perais, Arthur and Kaxiras, Stefanos and Ros, Alberto
关键词: atomic read-modify-write instructions, microarchitecture, multi-core architectures, store-to-load forwarding, total-store-order (TSO)

Abstract

Atomic Read-Modify-Write (RMW) instructions are primitive synchronization operations implemented in hardware that provide the building blocks for higher-abstraction synchronization mechanisms to programmers. According to publicly available documentation, current x86 implementations serialize atomic RMW operations, i.e., the store buffer is drained before issuing atomic RMWs and subsequent memory operations are stalled until the atomic RMW commits. This serialization, carried out by memory fences, incurs a performance cost which is expected to increase with deeper pipelines.This work proposes Free atomics, a lightweight, speculative, deadlock-free implementation of atomic operations that removes the need for memory fences, thus improving performance, while preserving atomicity and consistency. Free atomics is, to the best of our knowledge, the first proposal to enable store-to-load forwarding for atomic RMWs. Free atomics only requires simple modifications and incurs a small area overhead (15 bytes). Our evaluation using gem5-20 shows that, for a 32-core configuration, Free atomics improves performance by 12.5%, on average, for a large range of parallel workloads and 25.2%, on average, for atomic-intensive parallel workloads over a fenced atomic RMW implementation.

DOI: 10.1145/3470496.3527385

Securing GPU via region-based bounds checking

作者: Lee, Jaewon and Kim, Yonghae and Cao, Jiashen and Kim, Euna and Lee, Jaekyu and Kim, Hyesoon
关键词: GPU, memory safety

Abstract

Graphics processing units (GPUs) have become essential general-purpose computing platforms to accelerate a wide range of workloads, such as deep learning, scientific, and high-performance computing (HPC) applications. However, recent memory corruption attacks, such as buffer overflow, exposed security vulnerabilities in GPUs. We demonstrate that out-of-bounds writes are reproducible on an Nvidia GPU, which can enable other security attacks.We propose GPUShield, a hardware-software cooperative region-based bounds-checking mechanism, to improve GPU memory safety for global, local, and heap memory buffers. To achieve effective protection, we update the GPU driver to assign a random but unique ID to each buffer and local variable and store individual bounds information in the bounds table allocated in the global memory. The proposed hardware performs efficient bounds checking by indexing the bounds table with unique IDs. We further reduce the bounds-checking overhead by utilizing compile-time bounds analysis, workgroup/warp-level bounds checking, and GPU-specific address mode. Our performance evaluations show that GPUShield incurs little performance degradation across 88 CUDA benchmarks on the Nvidia GPU architecture and 17 OpenCL benchmarks on the Intel GPU architecture with a marginal hardware overhead.

DOI: 10.1145/3470496.3527420

t"{a

作者: Schwedock, Brian C. and Yoovidhya, Piratach and Seibert, Jennifer and Beckmann, Nathan
关键词: cache hierarchy, data movement, data-centric computing

Abstract

Current systems hide data movement from software behind the load-store interface. Software’s inability to observe and respond to data movement is the root cause of many inefficiencies, including the growing fraction of execution time and energy devoted to data movement itself. Recent specialized memory-hierarchy designs prove that large data-movement savings are possible. However, these designs require custom hardware, raising a large barrier to their practical adoption.This paper argues that the hardware-software interface is the problem, and custom hardware is often unnecessary with an expanded interface. The t"{a

DOI: 10.1145/3470496.3527379

EQC： ensembled quantum computing for variational quantum algorithms

作者: Stein, Samuel and Wiebe, Nathan and Ding, Yufei and Bo, Peng and Kowalski, Karol and Baker, Nathan and Ang, James and Li, Ang
关键词: distributed computing, quantum computing, variational quantum algorithms

Abstract

Variational quantum algorithm (VQA), which is comprised of a classical optimizer and a parameterized quantum circuit, emerges as one of the most promising approaches for harvesting the power of quantum computers in the noisy intermediate scale quantum (NISQ) era. However, the deployment of VQAs on contemporary NISQ devices often faces considerable system and time-dependant noise and prohibitively slow training speeds. On the other hand, the expensive supporting resources and infrastructure make quantum computers extremely keen on high utilization.In this paper, we propose a virtualized way of building up a quantum backend for variational quantum algorithms: rather than relying on a single physical device which tends to introduce ever-changing device-specific noise with less reliable performance as time-since-calibration grows, we propose to constitute a quantum ensemble, which dynamically distributes quantum tasks asynchronously across a set of physical devices, and adjusts the ensemble configuration with respect to machine status. In addition to reduced machine-dependant noise, the ensemble can provide significant speedups for VQA training. With this idea, we build a novel VQA training framework called EQC - a distributed gradient-based processor-performance-aware optimization system - that comprises: (i) a system architecture for asynchronous parallel VQA cooperative training; (ii) an analytical model for assessing the quality of a circuit output concerning its architecture, transpilation, and runtime conditions; (iii) a weighting mechanism to adjust the quantum ensemble’s computational contribution according to the systems’ current performance. Evaluations comprising 500K times’ circuit evaluations across 10 IBMQ NISQ devices using a VQE and a QAOA applications demonstrate that EQC can attain error rates very close to the most performant device of the ensemble, while boosting the training speed by 10.5X on average (up to 86X and at least 5.2x). EQC is available at https://github.com/pnnl/eqc.

DOI: 10.1145/3470496.3527434

Axiomatic hardware-software contracts for security

作者: Mosier, Nicholas and Lachnitt, Hanna and Nemati, Hamed and Trippel, Caroline
关键词: hardware security, hardware-software contracts, memory consistency models, side-channel attacks, spectre

Abstract

We propose leakage containment models (LCMs)—novel axiomatic security contracts which support formally reasoning about the security guarantees of programs when they run on particular microarchitectures. Our core contribution is an axiomatic vocabulary for formalizing LCMs, derived from the established axiomatic vocabulary for formalizing processor memory consistency models. Using this vocabulary, we formalize microarchitectural leakage—focusing on leakage through hardware memory systems—so that it can be automatically detected in programs and provide a taxonomy for classifying said leakage by severity. To illustrate the efficacy of LCMs, we first demonstrate that our leakage definition faithfully captures a sampling of (transient and non-transient) microarchitectural attacks from the literature. Second, we develop a static analysis tool based on LCMs which automatically identifies Spectre vulnerabilities in programs and scales to analyze real-world crypto-libraries.

DOI: 10.1145/3470496.3527412

PPMLAC： high performance chipset architecture for secure multi-party computation

作者: Zhou, Xing and Xu, Zhilei and Wang, Cong and Gao, Mingyu
关键词: MPC, hardware accelerator, privacy, privacy-preserving machine learning, secret sharing, security, side-channel protection

Abstract

Privacy issue is a main concern restricting data sharing and cross-organization collaborations. While Privacy-Preserving Machine Learning techniques such as Multi-Party Computations (MPC), Homomorphic Encryption, and Federated Learning are proposed to solve this problem, no solution exists with both strong security and high performance to run large-scale, complex machine learning models. This paper presents PPMLAC, a novel chipset architecture to accelerate MPC, which combines MPC’s strong security and hardware’s high performance, eliminates the communication bottleneck from MPC, and achieves several orders of magnitudes speed up over software-based MPC. It is carefully designed to only rely on a minimum set of simple hardware components in the trusted domain, thus is robust against side-channel attacks and malicious adversaries. Our FPGA prototype can run mainstream large-scale ML models like ResNet in near real-time under a practical network environment with non-negligible latency, which is impossible for existing MPC solutions.

DOI: 10.1145/3470496.3527392

INSPIRE： in-storage private information retrieval via protocol and architecture co-design

作者: Lin, Jilan and Liang, Ling and Qu, Zheng and Ahmad, Ishtiyaque and Liu, Liu and Tu, Fengbin and Gupta, Trinabh and Ding, Yufei and Xie, Yuan
关键词: in-storage computing, private information retrieval (PIR)

Abstract

Private Information Retrieval (PIR) plays a vital role in secure, database-centric applications. However, existing PIR protocols explore a massive working space containing hundreds of GiBs of query and database data. As a consequence, PIR performance is severely bounded by storage communication, making it far from practical for real-world deployment.In this work, we describe INSPIRE, an accelerator for IN-Storage Private Information REtrieval. INSPIRE follows a protocol and architecture co-design approach. We first design the INSPIRE protocol with a multi-stage filtering mechanism, which achieves a constant PIR query size. For a 1-billion-entry database of size 288GiB, INSPIRE’s protocol reduces the query size from 27GiB to 3.6MiB. Further, we propose the INSPIRE hardware, a heterogeneous in-storage architecture, which integrates our protocol across the SSD hierarchy. Together with the INSPIRE protocol, the INSPIRE hardware reduces the query time from 28.4min to 36s, relative to the the state-of-the-art FastPIR scheme.

DOI: 10.1145/3470496.3527433

TDGraph： a topology-driven accelerator for high-performance streaming graph processing

作者: Zhao, Jin and Yang, Yun and Zhang, Yu and Liao, Xiaofei and Gu, Lin and He, Ligang and He, Bingsheng and Jin, Hai and Liu, Haikun and Jiang, Xinyu and Yu, Hui
关键词: accelerator, incremental computation, many-core processor, state propagation, streaming graphs

Abstract

Many solutions have been recently proposed to support the processing of streaming graphs. However, for the processing of each graph snapshot of a streaming graph, the new states of the vertices affected by the graph updates are propagated irregularly along the graph topology. Despite the years’ research efforts, existing approaches still suffer from the serious problems of redundant computation overhead and irregular memory access, which severely underutilizes a many-core processor. To address these issues, this paper proposes a topology-driven programmable accelerator TDGraph, which is the first accelerator to augment the many-core processors to achieve high performance processing of streaming graphs. Specifically, we propose an efficient topology-driven incremental execution approach into the accelerator design for more regular state propagation and better data locality. TDGraph takes the vertices affected by graph updates as the roots to prefetch other vertices along the graph topology and synchronizes the incremental computations of them on the fly. In this way, most state propagations originated from multiple vertices affected by different graph updates can be conducted together along the graph topology, which help reduce the redundant computations and data access cost. Besides, through the efficient coalescing of the accesses to vertex states, TDGraph further improves the utilization of the cache and memory bandwidth. We have evaluated TDGraph on a simulated 64-core processor. The results show that, the state-of-the-art software system achieves the speedup of 7.1~21.4 times after integrating with TDGraph, while incurring only 0.73% area cost. Compared with four cutting-edge accelerators, i.e., HATS, Minnow, PHI, and DepGraph, TDGraph gains the speedups of 4.6~12.7, 3.2~8.6, 3.8~9.7, and 2.3~6.1 times, respectively.

DOI: 10.1145/3470496.3527409

DIMMining： pruning-efficient and parallel graph mining on near-memory-computing

作者: Dai, Guohao and Zhu, Zhenhua and Fu, Tianyu and Wei, Chiyue and Wang, Bangyan and Li, Xiangyu and Xie, Yuan and Yang, Huazhong and Wang, Yu
关键词: graph mining, near-memory-computing, systolic merge array

Abstract

Graph mining, which finds specific patterns in the graph, is becoming increasingly important in various domains. We point out that accelerating graph mining suffers from the following challenges: (1) Heavy comparison for pruning: Pruning technique is widely used to reduce search space in graph mining. It applies constraints on vertex indices and involves massive index comparisons. (2) Low parallelism of set operations: The typical graph mining algorithms can be expressed as a series of set operations between neighbors of vertices, which suffer from low parallelism if vertices are streaming to the computation units. (3) Heavy data transfer: Graph mining needs to transfer intermediate data with two orders of magnitude larger than the original data volume between CPU and memory.To tackle these challenges, we propose DIMMining with four techniques from algorithm to architecture perspectives. The Index Pre-comparison scheme is proposed for efficient pruning. We introduce the self anchor and neighbor partition to enable pre-comparison for vertex indices. Thus, we can reduce comparisons during runtime. We propose a Flexible BCSR (Bitmap with Compressed Sparse Row) format to enable parallelism for set operations from the data structure perspective, which works on continuous vertices without memory space overheads. The Systolic Merge Array is designed to further explore the parallelism on discontinuous vertices from the architecture perspective. Then, we propose a DIMM-based Near-Memory-Computing architecture, which eliminates the large-volume data transfer between the computation and the memory. Extensive experimental results on real-world graphs show that DIMMining achieves 222.23X and 139.51X speedup compared with FPGAs and CPUs, and 3.61X speedup over the state-of-the-art graph mining architecture.

DOI: 10.1145/3470496.3527388

NDMiner： accelerating graph pattern mining using near data processing

作者: Talati, Nishil and Ye, Haojie and Yang, Yichen and Belayneh, Leul and Chen, Kuan-Yu and Blaauw, David and Mudge, Trevor and Dreslinski, Ronald
关键词: graph pattern mining, hardware-software co-design, near data processing

Abstract

Graph Pattern Mining (GPM) algorithms mine structural patterns in graphs. The performance of GPM workloads is bottlenecked by control flow and memory stalls. This is because of data-dependent branches used in set intersection and difference operations that dominate the execution time.This paper first conducts a systematic GPM workload analysis and uncovers four new observations to inform the optimization effort. First, GPM workloads mostly fetch inputs of costly set operations from different memory banks. Second, to avoid redundant computation, modern GPM workloads employ symmetry breaking that discards several data reads, resulting in cache pollution and wasted DRAM bandwidth. Third, sparse pattern mining algorithms perform redundant memory reads and computations. Fourth, GPM workloads do not fully utilize the in-DRAM data parallelism.Based on these observations, this paper presents NDMiner, a Near Data Processing (NDP) architecture that improves the performance of GPM workloads. To reduce in-memory data transfer of fetching data from different memory banks, NDMiner integrates compute units to offload set operations in the buffer chip of DRAM. To alleviate the wasted memory bandwidth caused by symmetry breaking, NDMiner integrates a load elision unit in hardware that detects the satisfiability of symmetry breaking constraints and terminates unnecessary loads. To optimize the performance of sparse pattern mining, NDMiner employs compiler optimizations and maps reduced reads and composite computation to NDP hardware that improves algorithmic efficiency of sparse GPM. Finally, NDMiner proposes a new graph remapping scheme in memory and a hardware-based set operation reordering technique to best optimize bank, rank, and channel-level parallelism in DRAM. To orchestrate NDP computation, this paper presents design modifications at the host ISA, compiler, and memory controller. We compare the performance of NDMiner with state-of-the-art software and hardware baselines using a mix of dense and sparse GPM algorithms. Our evaluation shows that NDMiner significantly outperforms software and hardware baselines by 6.4X and 2.5X, on average, while incurring a negligible area overhead on CPU and DRAM.

DOI: 10.1145/3470496.3527437

SoftVN： efficient memory protection via software-provided version numbers

作者: Umar, Muhammad and Hua, Weizhe and Zhang, Zhiru and Suh, G. Edward
关键词: memory protection, trusted execution environment (TEE)

Abstract

Trusted execution environments (TEEs) in processors protect off-chip memory (DRAM), and ensure its confidentiality and integrity using memory encryption and integrity verification. However, such memory protection can incur significant performance overhead as it requires additional memory accesses for protection metadata such as version numbers (VNs) and MACs. This paper proposes SoftVN, an extension to the current memory protection schemes, which significantly reduces the overhead of today’s state-of-the-art by allowing software to provide VNs for memory accesses. For memory-intensive applications with simple memory access patterns for large data structures, the VNs only need to be maintained for data structures instead of individual cache blocks and can be tracked in software with low efforts. Off-chip VN accesses for memory reads can be removed if they are tracked and provided by software. We evaluate SoftVN by simulating a diverse set of memory-intensive applications, including deep learning, graph processing, and bioinformatics algorithms. The experimental results show that SoftVN reduces the memory protection overhead by 82% compared to the baseline similar to Intel SGX, and improves the performance by 33% on average. The maximum performance improvement can be as high as 65%.

DOI: 10.1145/3470496.3527378

CraterLake： a hardware accelerator for efficient unbounded computation on encrypted data

作者: Samardzic, Nikola and Feldmann, Axel and Krastev, Aleksandar and Manohar, Nathan and Genise, Nicholas and Devadas, Srinivas and Eldefrawy, Karim and Peikert, Chris and Sanchez, Daniel
关键词: fully homomorphic encryption, hardware acceleration

Abstract

Fully Homomorphic Encryption (FHE) enables offloading computation to untrusted servers with cryptographic privacy. Despite its attractive security, FHE is not yet widely adopted due to its prohibitive overheads, about 10,000X over unencrypted computation. Recent FHE accelerators have made strides to bridge this performance gap. Unfortunately, prior accelerators only work well for simple programs, but become inefficient for complex programs, which bring additional costs and challenges.We present CraterLake, the first FHE accelerator that enables FHE programs of unbounded size (i.e., unbounded multiplicative depth). Such computations require very large ciphertexts (tens of MBs each) and different algorithms that prior work does not support well. To tackle this challenge, CraterLake introduces a new hardware architecture that efficiently scales to very large cipher-texts, novel functional units to accelerate key kernels, and new algorithms and compiler techniques to reduce data movement.We evaluate CraterLake on deep FHE programs, including deep neural networks like ResNet and LSTMs, where prior work takes minutes to hours per inference on a CPU. CraterLake outperforms a CPU by gmean 4,600X and the best prior FHE accelerator by 11.2X under similar area and power budgets. These speeds enable realtime performance on unbounded FHE programs for the first time.

DOI: 10.1145/3470496.3527393

PS-ORAM： efficient crash consistency support for oblivious RAM on NVM

作者: Liu, Gang and Li, Kenli and Xiao, Zheng and Wang, Rujia
关键词: NVM, ORAM, crash consistency, persistence, security

Abstract

Oblivious RAM (ORAM) is a provable secure primitive to prevent access pattern leakage on the memory bus. By randomly remapping the data blocks and accessing redundant blocks, ORAM prevents access pattern leakage through ob-fuscation. Byte-addressable non-volatile memory (NVM) is considered as the candidate for main memory due to its better scalability, competitive performance, and persistent data store. While there is much prior work focusing on improving ORAM’s performance on the conventional DRAM-based memory system, when the memory technology shifts to use NVM, ensuring an efficient crash-consistent ORAM is needed for security, correctness, and performance. Directly using traditional software-based crash consistency support for ORAM system is not only expensive but also insecure.In this work, we study how to persist ORAM construction with an NVM-based memory system. To support crash consistency without damaging ORAM system security and compromising the performance, we propose PS-ORAM. PS-ORAM consists of a novel ORAM controller design and a set of ORAM access protocols that support crash consistency.We evaluate PS-ORAM with the system without crash consistency support, non-recursive and recursive PS-ORAM only incurs 4.29% and 3.65% additional performance overhead. The results show that PS-ORAM not only supports effective crash consistency with minimal performance and hardware overhead but also is friendly to NVM lifetime.

DOI: 10.1145/3470496.3527425

There’s always a bigger fish： a clarifying analysis of a machine-learning-assisted side-channel attack

作者: Cook, Jack and Drean, Jules and Behrens, Jonathan and Yan, Mengjia
关键词: deep learning, microarchitecture, security, side channels, website fingerprinting

Abstract

Machine learning has made it possible to mount powerful attacks through side channels that have traditionally been seen as challenging to exploit. However, due to the black-box nature of machine learning models, these attacks are often difficult to interpret correctly. Models that detect correlations cannot be used to prove causality or understand an attack’s various sources of information leakage.In this paper, we show that a state-of-the-art website-fingerprinting attack powered by machine learning was only partially analyzed. In this attack, an attacker collects cache-sweeping traces, which measure the frequency at which the entire last-level cache can be accessed over time, while a victim loads a website. A neural network is then trained on these traces to predict websites accessed by the victim. The attack’s usage of the cache led to a consensus that the attack exploited a cache-based side channel. However, we provide additional analysis contradicting this assumption and clarifying the mechanisms behind this powerful attack.We first replicate the website-fingerprinting attack without making any cache accesses, demonstrating that memory accesses are not crucial to the attack’s success and may even inhibit its performance. We then search for the primary source of information leakage in our new attack by analyzing the effects of various isolation mechanisms and by instrumenting the Linux kernel. We ultimately find that this attack’s success can be attributed primarily to system interrupts. Finally, we use this analysis to craft highly practical and effective defense mechanisms against our attack.

DOI: 10.1145/3470496.3527416

Gearbox： a case for supporting accumulation dispatching and hybrid partitioning in PIM-based accelerators

作者: Lenjani, Marzieh and Ahmed, Alif and Stan, Mircea and Skadron, Kevin
关键词: PIM, SpMSpV, SpMV, graph, processing in memory, sparse

Abstract

Processing-in-memory (PIM) minimizes data movement overheads by placing processing units near each memory segment. Recent PIMs employ processing units with a SIMD architecture. However, kernels with random accesses, such as sparse-matrix-dense-vector (SpMV) and sparse-matrix-sparse-vector (SpMSpV), cannot effectively exploit the parallelism of SIMD units because SIMD’s ALUs remain idle until all the operands are collected from local memory segments (memory segment attached to the processing unit) or remote memory segments (other segments of the memory).For SpMV and SpMSpV, properly partitioning the matrix and the vector among the memory segments is also very important. Partitioning determines (i) how much processing load will be assigned to each processing unit and (ii) how much communication is required among the processing units.In this paper, first, we propose a highly parallel architecture that can exploit the available parallelism even in the presence of random accesses. Second, we observed that, in SpMV and SpMSpV, most of the remote accesses become remote accumulations with the right choice of algorithm and partitioning. The remote accumulations could be offloaded to be performed by processing units next to the destination memory segments, eliminating idle time due to remote accesses. Accordingly, we introduce a dispatching mechanism for remote accumulation offloading. Third, we propose Hybrid partitioning and associated hardware support. Our partitioning technique enables (i) replacing remote read accesses with broadcasting (for only a small portion of data that will be read by all processing units), (ii) reducing the number of remote accumulations, and (iii) balancing the load.Our proposed method, Gearbox, with just one memory stack, delivers on average (up to) 15.73X (52X) speedup over a server-class GPU, NVIDIA P100, with three stacks of HBM2 memory.

DOI: 10.1145/3470496.3527402

To PIM or not for emerging general purpose processing in DDR memory systems

作者: Devic, Alexandar and Rai, Siddhartha Balakrishna and Sivasubramaniam, Anand and Akel, Ameen and Eilert, Sean and Eno, Justin
关键词: DRAM, compilers, general purpose processing, parallel processing, processing-in-memory, vector processing

Abstract

As Processing-In-Memory (PIM) hardware matures and starts making its way into normal compute platforms, software has an important role to play in determining what to perform where, and when, on such heterogeneous systems. Taking an emerging class of PIM hardware which provisions a general purpose (RISC-V) processor at each memory bank, this paper takes on this challenging problem by developing a software compilation framework. This framework analyzes several application characteristics - parallelizability, vectorizability, data set sizes, and offload costs - to determine what, whether, when and how to offload computations to the PIM engines. In the process, it also proposes a vector engine extension to the bank-level RISC-V cores. Using several off-the-shelf C/C++ applications, we demonstrate that PIM is not always a panacea, and a framework such as ours is essential in carefully selecting what needs to be performed where, when and how. The choice of hardware platforms - number of memory banks, relative speeds and capabilities of host CPU and PIM cores, can further impact the “to PIM or not” question.

DOI: 10.1145/3470496.3527431

MeNDA： a near-memory multi-way merge solution for sparse transposition and dataflows

作者: Feng, Siying and He, Xin and Chen, Kuan-Yu and Ke, Liu and Zhang, Xuan and Blaauw, David and Mudge, Trevor and Dreslinski, Ronald
关键词: hardware accelerator, hardware merge tree, multi-way merge accelerator, near-memory processing, sparse linear algebra, sparse matrix transposition, sparse matrix-vector multiplication

Abstract

Near-memory processing has been extensively studied to optimize memory intensive workloads. However, none of the proposed designs address sparse matrix transposition, an important building block in sparse linear algebra applications. Prior work shows that sparse matrix transposition does not scale as well as other sparse primitives such as sparse matrix vector multiplication (SpMV) and hence has become a growing bottleneck in common applications. Sparse matrix transposition is highly memory intensive but low in computational intensity, making it a promising candidate for near-memory processing. In this work, we propose MeNDA, a scalable near-DRAM multi-way merge accelerator that eliminates the off-chip memory interface bottleneck and exposes the high internal memory bandwidth to improve performance and reduce energy consumption for sparse matrix transposition. MeNDA adopts a merge sort based algorithm, exploiting spatial locality, and proposes a near-memory processing unit (PU) featuring a high-performance hardware merge tree. Because of the wide application of merge sort in sparse linear algebra, MeNDA is an extensible solution that can be easily adapted to support other sparse primitives such as SpMV. Techniques including seamless back-to-back merge sort, stall reducing prefetching and request coalescing are further explored to take full advantage of the increased system memory bandwidth. Compared to two state-of-the-art implementations of sparse matrix transposition on a CPU and a sparse library on a GPU, MeNDA is able to achieve a speedup of 19.1X, 12.0X, and 7.7x, respectively. MeNDA also shows an efficiency gain of 3.8x over a recent SpMV accelerator integrated with HBM. Incurring a power consumption of only 78.6 mW, a MeNDA PU can be easily accommodated by commodity DIMMs.

DOI: 10.1145/3470496.3527432

CaSMap： agile mapper for reconfigurable spatial architectures by automatically clustering intermediate representations and scattering mapping process

作者: Man, Xingchen and Zhu, Jianfeng and Song, Guihuan and Yin, Shouyi and Wei, Shaojun and Liu, Leibo
关键词: coarse-grained reconfigurable architecture, compiler, integer linear programming, reconfigurable spatial architecture

Abstract

Today, reconfigurable spatial architectures (RSAs) have sprung up as accelerators for compute- and data-intensive domains because they deliver energy and area efficiency close to ASICs and still retain sufficient programmability to keep the development cost low. The mapper, which is responsible for mapping algorithms onto RSAs, favors a systematic backtracking methodology because of high portability for evolving RSA designs. However, exponentially scaling compilation time has become the major obstacle. The key observation of this paper is that the key limiting factor to the systematic backtracking mappers is the waterfall mapping model which resolves all mapping variables and constraints at the same time using single-level intermediate representations (IRs).This work proposes CaSMap, an agile mapper framework independent of software and hardware of RSAs. By clustering the lowest-level software and hardware IRs into multi-level IRs, the original mapping process can be scattered as multi-stage decomposed ones and therefore the mapping problem with exponential complexity is mitigated. This paper introduces (a) strategies for clustering low-level hardware and software IRs with static connectivity and critical path analysis. (b) a multi-level scattered mapping model in which the higher-level model carries out the heuristics from IR clustering, endeavors to promote mapping success rate, and reduces the scale of the lower-level model. Our evaluation shows that CaSMap is able to reduce the problem scale (nonzeros) by 80.5% (23.1%-94.9%) and achieve a mapping time speedup of 83X over the state-of-the-art waterfall mapper across four different RSA topologies: MorphoSys, HReA, HyCUBE, and REVEL.

DOI: 10.1145/3470496.3527426

FFCCD： fence-free crash-consistent concurrent defragmentation for persistent memory

作者: Xu, Yuanchao and Ye, Chencheng and Solihin, Yan and Shen, Xipeng
关键词: defragmentation, garbage collection, memory management, non-volatile memory, persistent memory

Abstract

Persistent Memory (PM) is increasingly supplementing or substituting DRAM as main memory. Prior work have focused on reusability and memory leaks of persistent memory but have not addressed a problem amplified by persistence, persistent memory fragmentation, which refers to the continuous worsening of fragmentation of persistent memory throughout its usage. This paper reveals the challenges and proposes the first systematic crash-consistent solution, Fence-Free Crash-consistent Concurrent Defragmentation (FFCCD). FFCCD resues persistent pointer format, root nodes and typed allocation provided by persistent memory programming model to enable concurrent defragmentation on PM. FFCCD introduces architecture support for concurrent defragmentation that enables a fence-free design and fast read barrier, reducing two major overheads of defragmenting persistent memory. The techniques is effective (28–73% fragmentation reduction) and fast (4.1% execution time overhead).

DOI: 10.1145/3470496.3527406

LightPC： hardware and software co-design for energy-efficient full system persistence

作者: Lee, Sangwon and Kwon, Miryeong and Park, Gyuyoung and Jung, Myoungsoo
关键词: No keywords

Abstract

We propose LightPC, a lightweight persistence-centric platform to make the system robust against power loss. LightPC consists of hardware and software subsystems, each being referred to as open-channel PMEM (OC-PMEM) and persistence-centric OS (PecOS). OC-PMEM removes physical and logical boundaries in drawing a line between volatile and nonvolatile data structures by unshackling new memory media from conventional PMEM complex. PecOS provides a single execution persistence cut to quickly convert the execution states to persistent information in cases of a power failure, which can eliminate persistent control overhead. We prototype LightPC’s computing complex and OC-PMEM using our custom system board. PecOS is implemented based on Linux 4.19 and Berkeley bootloader on the hardware prototype. Our evaluation results show that OC-PMEM can make user-level performance comparable with a DRAM-only non-persistent system, while consuming 73% lower power and 69% less energy. LightPC also shortens the execution time of diverse HPC, SPEC, and In-memory DB workloads, compared to traditional persistent systems by 4.3X, on average.

DOI: 10.1145/3470496.3527397

ASAP： architecture support for asynchronous persistence

作者: Abulila, Ahmed and Hajj, Izzat El and Jung, Myoungsoo and Kim, Nam Sung
关键词: hardware logging, memory persistency, non-volatile memory

Abstract

Supporting atomic durability of updates for persistent memories is typically achieved with Write-Ahead Logging (WAL). WAL flushes log entries to persistent memory before making the actual data persistent to ensure that a consistent state can be recovered if a crash occurs. Performing WAL in hardware is attractive because it makes most aspects of log management transparent to software, and it completes log persist operations (LPOs) and data persist operations (DPOs) in the background, overlapping them with the execution of other instructions.Prior hardware logging solutions commit atomic regions synchronously. That is, once the end of a region is reached, all outstanding persist operations required for the region to commit must complete before instruction execution may proceed. For undo logging, LPOs and DPOs are both performed synchronously to ensure that the region commits synchronously. For redo logging, DPOs can be performed asynchronously, but LPOs are performed synchronously to ensure that the region commits synchronously. In both cases, waiting for synchronous persist operations (LPO or DPO) at the end of an atomic region causes atomic regions to incur high latency.To tackle this limitation, we propose ASAP, a hardware logging solution that allows atomic regions to commit asynchronously. That is, once the end of an atomic region is reached, instruction execution may proceed without waiting for outstanding persist operations to complete. As such, both LPOs and DPOs can be performed asynchronously. The challenge with allowing atomic regions to commit asynchronously is that it can lead to control and data dependence violations in the commit order of the atomic regions, leaving data in an unrecoverable state in case of a crash. To address this issue, ASAP tracks and enforces control and data dependencies between atomic regions in hardware to ensure that the regions commit in the proper order.Our evaluation shows that ASAP outperforms the state-of-the-art hardware undo and redo logging techniques by 1.41X and 1.53X, respectively, while achieving 0.96X the ideal performance when no persistence is enforced, at a small hardware cost (< 3%). ASAP also reduces memory traffic to persistent memory by 38% and 48%, compared with the state-of-the-art hardware undo and redo logging techniques, respectively. ASAP is robust against increasing persistent memory latency, making it suitable for both fast and slow persistent memory technologies.

DOI: 10.1145/3470496.3527399

Sibyl： adaptive and extensible data placement in hybrid storage systems using online reinforcement learning

作者: Singh, Gagandeep and Nadig, Rakesh and Park, Jisung and Bera, Rahul and Hajinazar, Nastaran and Novo, David and G'{o
关键词: data placement, hybrid storage systems, hybrid systems, machine learning, reinforcement learning, solid-state drives (SSDs)

Abstract

Hybrid storage systems (HSS) use multiple different storage devices to provide high and scalable storage capacity at high performance. Data placement across different devices is critical to maximize the benefits of such a hybrid system. Recent research proposes various techniques that aim to accurately identify performance-critical data to place it in a “best-fit” storage device. Unfortunately, most of these techniques are rigid, which (1) limits their adaptivity to perform well for a wide range of workloads and storage device configurations, and (2) makes it difficult for designers to extend these techniques to different storage system configurations (e.g., with a different number or different types of storage devices) than the configuration they are designed for. Our goal is to design a new data placement technique for hybrid storage systems that overcomes these issues and provides: (1) adaptivity, by continuously learning from and adapting to the workload and the storage device characteristics, and (2) easy extensibility to a wide range of workloads and HSS configurations.We introduce Sibyl, the first technique that uses reinforcement learning for data placement in hybrid storage systems. Sibyl observes different features of the running workload as well as the storage devices to make system-aware data placement decisions. For every decision it makes, Sibyl receives a reward from the system that it uses to evaluate the long-term performance impact of its decision and continuously optimizes its data placement policy online.We implement Sibyl on real systems with various HSS configurations, including dual- and tri-hybrid storage systems, and extensively compare it against four previously proposed data placement techniques (both heuristic- and machine learning-based) over a wide range of workloads. Our results show that Sibyl provides 21.6%/19.9% performance improvement in a performance-oriented/cost-oriented HSS configuration compared to the best previous data placement technique. Our evaluation using an HSS configuration with three different storage devices shows that Sibyl outperforms the state-of-the-art data placement policy by 23.9%-48.2%, while significantly reducing the system architect’s burden in designing a data placement mechanism that can simultaneously incorporate three storage devices. We show that Sibyl achieves 80% of the performance of an oracle policy that has complete knowledge offuture access patterns while incurring a very modest storage overhead of only 124.4 KiB.

DOI: 10.1145/3470496.3527442

A synthesis framework for stitching surface code with superconducting quantum devices

作者: Wu, Anbang and Li, Gushu and Zhang, Hezi and Guerreschi, Gian Giacomo and Ding, Yufei and Xie, Yuan
关键词: compiler, quantum computing, quantum error correction

Abstract

Quantum error correction (QEC) is the central building block of fault-tolerant quantum computation but the design of QEC codes may not always match the underlying hardware. To tackle the discrepancy between the quantum hardware and QEC codes, we propose a synthesis framework that can implement and optimize the surface code onto superconducting quantum architectures. In particular, we divide the surface code synthesis into three key subroutines. The first two optimize the mapping of data qubits and ancillary qubits including syndrome qubits on the connectivity-constrained superconducting architecture, while the last subroutine optimizes the surface code execution by rescheduling syndrome measurements. Our experiments on mainstream superconducting architectures demonstrate the effectiveness of the proposed synthesis framework. Especially, the surface codes synthesized by the proposed automatic synthesis framework can achieve comparable or even better error correction capability than manually designed QEC codes.

DOI: 10.1145/3470496.3527381

2QAN： a quantum compiler for 2-local qubit hamiltonian simulation algorithms

作者: Lao, Lingling and Browne, Dan E.
关键词: quantum compilation, quantum computing, quantum simulation

Abstract

Simulating quantum systems is one of the most important potential applications of quantum computers. The high-level circuit defining the simulation needs to be compiled into one that complies with hardware limitations such as qubit architecture (connectivity) and instruction (gate) set. General-purpose quantum compilers work at the gate level and have little knowledge of the mathematical properties of quantum applications, missing further optimization opportunities. Existing application-specific compilers only apply advanced optimizations in the scheduling procedure and are restricted to the CNOT or CZ gate set. In this work, we develop a compiler, named 2QAN, to optimize quantum circuits for 2-local qubit Hamiltonian simulation problems, a framework which includes the important quantum approximate optimization algorithm (QAOA). In particular, we exploit the flexibility of permuting different operators in the Hamiltonian (no matter whether they commute) and propose permutation-aware techniques for qubit routing, gate optimization and scheduling to minimize compilation overhead. 2QAN can target different architectures and different instruction sets. Compilation results on four applications (up to 50 qubits) and three quantum computers (namely, Google Sycamore, IBMQ Montreal and Rigetti Aspen) show that 2QAN outperforms state-of-the-art general-purpose compilers and application-specific compilers. Specifically, 2QAN can reduce the number of inserted SWAP gates by 11.5X, reduce overhead in hardware gate count by 68.5X, and reduce overhead in circuit depth by 21X. Experimental results on the Montreal device demonstrate that benchmarks compiled by 2QAN achieve the highest fidelity.

DOI: 10.1145/3470496.3527394

XQsim： modeling cross-technology control processors for 10+K qubit quantum computers

作者: Byun, Ilkwon and Kim, Junpyo and Min, Dongmoon and Nagaoka, Ikki and Fukumitsu, Kosuke and Ishikawa, Iori and Tanimoto, Teruo and Tanaka, Masamitsu and Inoue, Koji and Kim, Jangwoo
关键词: cryogenic computing, modeling, quantum computing, simulation, single flux quantum (SFQ)

Abstract

10+K qubit quantum computer is essential to achieve a true sense of quantum supremacy. With the recent effort towards the large-scale quantum computer, architects have revealed various scalability issues including the constraints in a quantum control processor, which should be holistically analyzed to design a future scalable control processor. However, it has been impossible to identify and resolve the processor’s scalability bottleneck due to the absence of a reliable tool to explore an extensive design space including microarchitecture, device technology, and operating temperature.In this paper, we present XQsim, an open-source cross-technology quantum control processor simulator. XQsim can accurately analyze the target control processors’ scalability bottlenecks for various device technology and operating temperature candidates. To achieve the goal, we first fully implement a convincing control processor microarchitecture for the Fault-tolerant Quantum Computer (FTQC) systems. Next, on top of the microarchitecture, we develop an architecture-level control processor simulator (XQsim) and thoroughly validate it with post-layout analysis, timing-accurate RTL simulation, and noisy quantum simulation. Lastly, driven by XQsim, we provide the future directions to design a 10+K qubit quantum control processor with several design guidelines and architecture optimizations. Our case study shows that the final control processor architecture can successfully support ~59K qubits with our operating temperature and technology choices.

DOI: 10.1145/3470496.3527417

Geyser： a compilation framework for quantum computing with neutral atoms

作者: Patel, Tirthak and Silver, Daniel and Tiwari, Devesh
关键词: NISQ computing, neutral atoms, quantum compiling, quantum computing, quantum software, rydberg atoms

Abstract

Compared to widely-used superconducting qubits, neutral-atom quantum computing technology promises potentially better scalability and flexible arrangement of qubits to allow higher operation parallelism and more relaxed cooling requirements. The high performance computing (HPC) and architecture community is beginning to design new solutions to take advantage of neutral-atom quantum architectures and overcome its unique challenges.We propose Geyser, the first work to take advantage of the multi-qubit gates natively supported by neutral-atom quantum computers by appropriately mapping quantum circuits to three-qubit-friendly physical arrangement of qubits. Then, Geyser creates multiple logical blocks in the quantum circuit to exploit quantum parallelism and reduce the number of pulses needed to realize physical gates. These circuit blocks elegantly enable Geyser to compose equivalent circuits with three-qubit gates, even when the original program does not have any multi-qubit gates. Our evaluation results show Geyser reduces the number of operation pulses by 25%-90% and improves the algorithm’s output fidelity by 25%-60% points across different algorithms.

DOI: 10.1145/3470496.3527428

X-cache： a modular architecture for domain-specific caches

作者: Sedaghati, Ali and Hakimi, Milad and Hojabr, Reza and Shriraman, Arrvindh
关键词: caches, coroutines, dataflow architectures, domain-specific architectures

Abstract

With Dennard scaling ending, architects are turning to domain-specific accelerators (DSAs). State-of-the-art DSAs work with sparse data [37] and indirectly-indexed data structures [18, 30]. They introduce non-affine and dynamic memory accesses [7, 35], and require domain-specific caches. Unfortunately, cache controllers are notorious for being difficult to architect; domain-specialization compounds the problem. DSA caches need to support custom tags, data-structure walks, multiple refills, and preloading. Prior DSAs include ad-hoc cache structures, and do not implement the cache controller. We propose X-Cache, a reusable caching idiom for DSAs. We will be open-sourcing a toolchain for both generating the RTL and programming X-Cache. There are three key ideas: i) DSA-specific Tags (Meta-tag): The designer can use any combination of fields from the DSA-metadata as the tag. Meta-tags eliminate the overhead of walking and translating metadata to global addresses. This saves energy, and improves load-to-use latency. ii) DSA-programmable walkers (X-Actions): We find that a common set of microcode actions can be used to implement the DSA-specific walking, data block, and tag management. We develop a programmable microcode engine that can efficiently realize the data orchestration. iii) DSA-portable controller (X-Routines): We use a portable abstraction, coroutines, to let the designer express walking and orchestration. Coroutines capture the block-level parallelism, remain lightweight, and minimize controller occupancy. We create caches for four different DSA families: Sparse GEMM [35, 37], GraphPulse [30], DASX [22], and Widx [18]. X-Cache outperforms address-based caches by 1.7 \texttimes{

DOI: 10.1145/3470496.3527380

Register file prefetching

作者: Shukla, Sudhanshu and Bandishte, Sumeet and Gaur, Jayesh and Subramoney, Sreenivas
关键词: address prediction, load value prefetching, microarchitecture, pipeline prefetching, value prediction

Abstract

The memory wall continues to limit the performance of modern out-of-order (OOO) processors, despite the expensive provisioning of large multi-level caches and advancements in memory prefetching. In this paper, we put forth an important observation that the memory wall is not monolithic, but is constituted of many latency walls arising due to the latency of each tier of cache/memory. Our results show that even though level-1 (L1) data cache latency is nearly 40X lower than main memory latency, mitigating this latency offers a very similar performance opportunity as the more widely studied, main memory latency.This motivates our proposal Register File Prefetch (RFP) that intelligently utilizes the existing OOO scheduling pipeline and available L1 data cache/Register File bandwidth to successfully prefetch 43.4% of load requests from the L1 cache to the Register File. Simulation results on 65 diverse workloads show that this translates to 3.1% performance gain over a baseline with parameters similar to Intel Tiger Lake processor, which further increases to 5.7% for a futuristic up-scaled core. We also contrast and differentiate register file prefetching from techniques like load value and address prediction that enhance performance by speculatively breaking data dependencies. Our analysis shows that RFP is synergistic with value prediction, with both the features together delivering 4.1% average performance improvement, which is significantly higher than the 2.2% performance gain obtained from just doing value prediction.

DOI: 10.1145/3470496.3527398

GCoM： a detailed GPU core model for accurate analytical modeling of modern GPUs

作者: Lee, Jounghoo and Ha, Yeonan and Lee, Suhyun and Woo, Jinyoung and Lee, Jinho and Jang, Hanhwi and Kim, Youngsok
关键词: graphics processing units, interval analysis, performance modeling

Abstract

Analytical models can greatly help computer architects perform orders of magnitude faster early-stage design space exploration than using cycle-level simulators. To facilitate rapid design space exploration for graphics processing units (GPUs), prior studies have proposed GPU analytical models which capture first-order stall events causing performance degradation; however, the existing analytical models cannot accurately model modern GPUs due to their outdated and highly abstract GPU core microarchitecture assumptions. Therefore, to accurately evaluate the performance of modern GPUs, we need a new GPU analytical model which accurately captures the stall events incurred by the significant changes in the core microarchitectures of modern GPUs.We propose GCoM, an accurate GPU analytical model which faithfully captures the key core-side stall events of modern GPUs. Through detailed microarchitecture-driven GPU core modeling, GCoM accurately models modern GPUs by revealing the following key core-side stalls overlooked by the existing GPU analytical models. First, GCoM identifies the compute structural stall events caused by the limited per-sub-core functional units. Second, GCoM exposes the memory structural stalls due to the limited banks and shared nature of per-core L1 data caches. Third, GCoM correctly predicts the memory data stalls induced by the sectored L1 data caches which split a cache line into a set of sectors sharing the same tag. Fourth, GCoM captures the idle stalls incurred by the inter- and intra-core load imbalances. Our experiments using an NVIDIA RTX 2060 configuration show that GCoM greatly improves the modeling accuracy by achieving a mean absolute error of 10.0% against Accel-Sim cycle-level simulator, whereas the state-of-the-art GPU analytical model achieves a mean absolute error of 44.9%.

DOI: 10.1145/3470496.3527384