ISCA 2023

QIsim： Architecting 10+K Qubit QC Interfaces Toward Quantum Supremacy

作者: Min, Dongmoon and Kim, Junpyo and Choi, Junhyuk and Byun, Ilkwon and Tanaka, Masamitsu and Inoue, Koji and Kim, Jangwoo
关键词: quantum-classical interface, simulation, modeling, single flux quantum (SFQ), cryogenic computing, quantum computing

Abstract

A 10+K qubit Quantum-Classical Interface (QCI) is essential to realize the quantum supremacy. However, it is extremely challenging to architect scalable QCIs due to the complex scalability trade-offs regarding operating temperatures, device and wire technologies, and microarchitecture designs. Therefore, architects need a modeling tool to evaluate various QCI design choices and lead to an optimal scalable QCI architecture.In this paper, we propose (1) QIsim, an open-source QCI simulation framework, and (2) novel architectural optimizations for designing 10+K qubit QCIs toward quantum supremacy. To achieve the goal, we first implement detailed QCI microarchitectures to model all the existing temperature and technology candidates. Next, based on the microarchitectures, we develop our scalability-analysis tool (QIsim) and thoroughly validate it using previous works, post-layout analyses, and real quantum-machine experiments. Finally, we successfully develop our 60,000+ qubit-scale QCI designs by applying eight architectural optimizations driven by QIsim.

DOI: 10.1145/3579371.3589036

Astrea： Accurate Quantum Error-Decoding via Practical Minimum-Weight Perfect-Matching

作者: Vittal, Suhas and Das, Poulami and Qureshi, Moinuddin
关键词: real-time decoding, error decoding, quantum error correction

Abstract

Quantum devices suffer from high error rates, which makes them ineffective for running practical applications. Quantum computers can be made fault tolerant using Quantum Error Correction (QEC), which protects quantum information by encoding logical qubits using data qubits and parity qubits. The data qubits collectively store the quantum information and the parity qubits are measured periodically to produce a syndrome, which is decoded by a classical decoder to identify the location and type of errors. To prevent errors from accumulating and causing a logical error, decoders must accurately identify errors in real-time, necessitating the use of hardware solutions because software decoders are slow. Ideally, a real-time decoder must match the performance of the Minimum-Weight Perfect Matching (MWPM) decoder. However, due to the complexity of the underlying Blossom algorithm, state-of-the-art real-time decoders either use lookup tables, which are not scalable, or use approximate decoding, which significantly increases logical error rates.In this paper, we leverage two key insights to enable practical real-time MWPM decoding. First, for near-term implementations (with redundancies up to distance d = 7) of surface codes, the Hamming weight of the syndromes tends to be quite small (less than or equal to 10). For this regime, we propose Astrea, which simply performs a brute-force search for the few hundred possible options to perform accurate decoding within a few nanoseconds (1ns average, 456ns worst-case), thus representing the first decoder to be able to do MWPM in real-time up-to distance 7. Second, even for codes that produce syndromes with higher Hamming weights (e.g. d = 9) the search for optimal pairings can be made more efficient by simply discarding the weights that denote significantly lower probability than the logical error-rate of the code. For this regime, we propose a greedy design called Astrea-G, which filters high-cost weights and reorders the search from high-likelihood pairings to low-likelihood pairings to produce the most likely decoding within 1μs (average 450ns). Our evaluations show that Astrea-G provides similar logical error-rates as the software-based MWPM for up-to d = 9 codes while meeting the real-time decoding latency constraints.

DOI: 10.1145/3579371.3589037

OliVe： Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization

作者: Guo, Cong and Tang, Jiaming and Hu, Weiming and Leng, Jingwen and Zhang, Chen and Yang, Fan and Liu, Yunxin and Guo, Minyi and Zhu, Yuhao
关键词: quantization, outlier-victim pair, large language model

Abstract

Transformer-based large language models (LLMs) have achieved great success with the growing model size. LLMs’ size grows by 240\texttimes{

DOI: 10.1145/3579371.3589038

R2D2： Removing ReDunDancy Utilizing Linearity of Address Generation in GPUs

作者: Ha, Dongho and Oh, Yunho and Ro, Won Woo
关键词: redundant instruction, single instruction multiple thread, GPU

Abstract

A generally used GPU programming methodology is that adjacent threads access data in neighbor or specific-stride memory addresses and perform computations with the fetched data. This paper demonstrates that the memory addresses often exhibit a simple linear value pattern across GPU threads, as each thread uses built-in variables and constant values to compute the memory addresses. However, since the threads compute their context data individually, GPUs incur a heavy instruction overhead to calculate the memory addresses, even though they exhibit a simple pattern. We propose a GPU architecture called Removing ReDunDancy Utilizing Linearity of Address Generation (R2D2), reducing a large amount of the dynamic instruction count by detecting such linear patterns in the memory addresses and exploiting them for kernel computations. R2D2 detects linearities of the memory addresses with software support and pre-computes them before the threads execute the instructions. With the proposed scheme, each thread is able to compute its memory addresses with fewer dynamic instructions than conventional GPUs. In our evaluation, R2D2 achieves dynamic instruction reduction by 28%, 1.25x speedup, and energy consumption reduction by 17% over baseline GPU.

DOI: 10.1145/3579371.3589039

TaskFusion： An Efficient Transfer Learning Architecture with Dual Delta Sparsity for Multi-Task Natural Language Processing

作者: Fan, Zichen and Zhang, Qirui and Abillama, Pierre and Shoouri, Sara and Lee, Changwoo and Blaauw, David and Kim, Hun-Seok and Sylvester, Dennis
关键词: deep learning, accelerator, natural language processing, heterogeneous architecture, transformer, sparsity, multi-task, transfer learning

Abstract

The combination of pre-trained models and task-specific fine-tuning schemes, such as BERT, has achieved great success in various natural language processing (NLP) tasks. However, the large memory and computation costs of such models make it challenging to deploy them in edge devices. Moreover, in real-world applications like chatbots, multiple NLP tasks need to be processed together to achieve higher response credibility. Running multiple NLP tasks with specialized models for each task increases the latency and memory cost latency linearly with the number of tasks. Though there have been recent works on parameter-shared tuning that aim to reduce the total parameter size by partially sharing weights among multiple tasks, computation remains intensive and redundant despite different tasks using the same input. In this work, we identify that a significant portion of activations and weights can be reused among different tasks, to reduce cost and latency for efficient multi-task NLP. Specifically, we propose TaskFusion, an efficient transfer learning software-hardware co-design that exploits delta sparsity in both weights and activations to boost data sharing among tasks. For training, TaskFusion uses ℓ1 regularization on delta activation to learn inter-task data redundancies. A novel hardware-aware sub-task inference algorithm is proposed to exploit the dual delta sparsity. We then designed a dedicated heterogeneous architecture to accelerate multi-task inference with an optimized scheduling to increase hardware utilization and reduce off-chip memory access. Extensive experiments demonstrate that TaskFusion can reduce the number of floating point operations (FLOPs) by over 73% in multi-task NLP with negligible accuracy loss, while adding a new task at the cost of only < 2% parameter size increase. With the proposed architecture and optimized scheduling, Task-Fusion can achieve 1.48–2.43\texttimes{

DOI: 10.1145/3579371.3589040

Instant-3D： Instant Neural Radiance Field Training Towards On-Device AR/VR 3D Reconstruction

作者: Li, Sixu and Li, Chaojian and Zhu, Wenbo and Yu, Boyang (Tony) and Zhao, Yang (Katie) and Wan, Cheng and You, Haoran and Shi, Huihong and Lin, Yingyan (Celine)
关键词: hardware accelerator, neural radiance field (NeRF)

Abstract

Neural Radiance Field (NeRF) based 3D reconstruction is highly desirable for immersive Augmented and Virtual Reality (AR/VR) applications, but achieving instant (i.e., < 5 seconds) on-device NeRF training remains a challenge. In this work, we first identify the inefficiency bottleneck: the need to interpolate NeRF embeddings up to 200,000 times from a 3D embedding grid during each training iteration. To alleviate this, we propose Instant-3D, an algorithm-hardware co-design acceleration framework that achieves instant on-device NeRF training. Our algorithm decomposes the embedding grid representation in terms of color and density, enabling computational redundancy to be squeezed out by adopting different (1) grid sizes and (2) update frequencies for the color and density branches. Our hardware accelerator further reduces the dominant memory accesses for embedding grid interpolation by (1) mapping multiple nearby points’ memory read requests into one during the feed-forward process, (2) merging embedding grid updates from the same sliding time window during back-propagation, and (3) fusing different computation cores to support the different grid sizes needed by the color and density branches of Instant-3D algorithm. Extensive experiments validate the effectiveness of Instant-3D, achieving a large training time reduction of 41\texttimes{

DOI: 10.1145/3579371.3589115

Scaling Qubit Readout with Hardware Efficient Machine Learning Architectures

作者: Maurya, Satvik and Mude, Chaithanya Naik and Oliver, William D. and Lienhard, Benjamin and Tannu, Swamit
关键词: quantum control hardware, quantum computer architecture, qubit readout

Abstract

Reading a qubit is a fundamental operation in quantum computing. It translates quantum information into classical information enabling subsequent classification to assign the qubit states ‘0’ or ‘1’. Unfortunately, qubit readout is one of the most error-prone and slowest operations on a superconducting quantum processor. On state-of-the-art superconducting quantum processors, readout errors can range from 1–10%. These errors occur for various reasons - crosstalk, spontaneous state transitions, and excitation caused by the readout pulse. The error-prone nature of readout has resulted in significant research to design better discriminators to achieve higher qubit-readout accuracies. High readout accuracy is essential for enabling high fidelity for near-term noisy quantum computers and error-corrected quantum computers of the future.Prior works have used machine-learning-assisted single-shot qubit-state classification, where a deep neural network was used for more robust discrimination by compensating for crosstalk errors. However, the neural network size can limit the scalability of systems, especially if fast hardware discrimination is required. This state-of-the-art baseline design cannot be implemented on of-the-shelf FPGAs used for the control and readout of superconducting qubits in most systems, which increases the overall readout latency as discrimination has to be performed in software.In this work, we propose herqles, a scalable approach to improve qubit-state discrimination by using a hierarchy of matched filters in conjunction with a significantly smaller and scalable neural network for qubit-state discrimination. We achieve substantially higher readout accuracies (16.4% relative improvement) than the baseline with a scalable design that can be readily implemented on off-the-shelf FPGAs. We also show that herqles is more versatile and can support shorter readout durations than the baseline design without additional training overheads.

DOI: 10.1145/3579371.3589042

Q-BEEP： Quantum Bayesian Error Mitigation Employing Poisson Modeling over the Hamming Spectrum

作者: Stein, Samuel and Wiebe, Nathan and Ding, Yufei and Ang, James and Li, Ang
关键词: quantum algorithms, noisy intermediate scale quantum computing, state graphs, bayesian error mitigation, quantum error mitigation, quantum computing

Abstract

Quantum computing technology has grown rapidly in recent years, with new technologies being explored, error rates being reduced, and quantum processors’ qubit capacity growing. However, near-term quantum algorithms are still unable to be induced without compounding consequential levels of noise, leading to non-trivial erroneous results. Quantum Error Correction (in-situ error mitigation) and Quantum Error Mitigation (post-induction error mitigation) are promising fields of research within the quantum algorithm scene, aiming to alleviate quantum errors. IBM recently published an article stating that Quantum Error Mitigation is the path to quantum computing usefulness. A recent work, namely HAMMER, demonstrated the existence of a latent structure regarding post-circuit induction errors when mapping to the Hamming spectrum. However, they assumed that errors occur solely in local clusters, whereas we observe that at higher average Hamming distances this structure falls away. In this work, we show that such a correlated structure is not only local but extends certain non-local clustering patterns which can be precisely described by a Poisson distribution model taking the input circuit, the device run time status (i.e., calibration statistics) and qubit topology into consideration. Using this quantum error characterizing model, we developed an iterative algorithm over the generated Bayesian network state-graph for post-induction error mitigation. Thanks to more precise modeling of the error distribution latent structure and the proposed iterative method, our Q-Beep approach provides state of the art performance and can boost circuit execution fidelity by up to 234.6% on Bernstein-Vazirani circuits and on average 71.0% on QAOA solution quality, using 16 practical IBMQ quantum processors. For other benchmarks such as those in QASMBench, a fidelity improvement of up to 17.8% is attained. Q-Beep is a light-weight post-processing technique that can be performed offline and remotely, making it a useful tool for quantum vendors to adopt and provide more reliable circuit induction results. Q-Beep is maintained at github.com/pnnl/qbeep

DOI: 10.1145/3579371.3589043

Enabling High Performance Debugging for Variational Quantum Algorithms using Compressed Sensing

作者: Hao, Tianyi and Liu, Kun and Tannu, Swamit
关键词: debugging, variational quantum algorithms, quantum computing

Abstract

Variational quantum algorithms (VQAs) can potentially solve practical problems using contemporary Noisy Intermediate Scale Quantum (NISQ) computers. VQAs find near-optimal solutions in the presence of qubit errors by classically optimizing a loss function computed by parameterized quantum circuits. However, developing and testing VQAs is challenging due to the limited availability of quantum hardware, their high error rates, and the significant overhead of classical simulations. Furthermore, VQA researchers must pick the right initialization for circuit parameters, utilize suitable classical optimizer configurations, and deploy appropriate error mitigation methods. Unfortunately, these tasks are done in an ad-hoc manner today, as there are no software tools to configure and tune the VQA hyperparameters.In this paper, we present OSCAR (cOmpressed Sensing based Cost lAndscape Reconstruction) to help configure: 1) correct initialization, 2) noise mitigation techniques, and 3) classical optimizers to maximize the quality of the solution on NISQ hardware. OSCAR enables efficient debugging and performance tuning by providing users with the loss function landscape without running thousands of quantum circuits as required by the grid search. Using OSCAR, we can accurately reconstruct the complete cost landscape with up to 100X speedup. Furthermore, OSCAR can compute an optimizer function query in an instant by interpolating a computed landscape, thus enabling the trial run of a VQA configuration with considerably reduced overhead.

DOI: 10.1145/3579371.3589044

HAAC： A Hardware-Software Co-Design to Accelerate Garbled Circuits

作者: Mo, Jianqiao and Gopinath, Jayanth and Reagen, Brandon
关键词: hardware acceleration, cryptography

Abstract

Privacy and security have rapidly emerged as priorities in system design. One powerful solution for providing both is privacy-preserving computation, where functions are computed directly on encrypted data and control can be provided over how data is used. Garbled circuits (GCs) are a PPC technology that provide both confidential computing and control over how data is used. The challenge is that they incur significant performance overheads compared to plaintext. This paper proposes a novel garbled circuits accelerator and compiler, named HAAC, to mitigate performance overheads and make privacy-preserving computation more practical. HAAC is a hardware-software co-design. GCs are exemplars of co-design as programs are completely known at compile time, i.e., all dependence, memory accesses, and control flow are fixed. The design philosophy of HAAC is to keep hardware simple and efficient, maximizing area devoted to our proposed custom execution units and other circuits essential for high performance (e.g., on-chip storage). The compiler can leverage its program understanding to realize hardware’s performance potential by generating effective instruction schedules, data layouts, and orchestrating off-chip events. In taking this approach we can achieve ASIC performance/efficiency without sacrificing generality. Insights of our approach include how co-design enables expressing arbitrary GCs programs as streams, which simplifies hardware and enables complete memory-compute decoupling, and the development of a scratchpad that captures data reuse by tracking program execution, eliminating the need for costly hardware managed caches and tagging logic. We evaluate HAAC with VIP-Bench and achieve an average speedup of 589\texttimes{

DOI: 10.1145/3579371.3589045

Orinoco： Ordered Issue and Unordered Commit with Non-Collapsible Queues

作者: Chen, Dibei and Zhang, Tairan and Huang, Yi and Zhu, Jianfeng and Liu, Yang and Gou, Pengfei and Feng, Chunyang and Li, Binghua and Wei, Shaojun and Liu, Leibo
关键词: microarchitecture, out-of-order execution, instruction scheduling, out-of-order commit, processing-in-memory

Abstract

Modern out-of-order processors call for more aggressive scheduling techniques such as priority scheduling and out-of-order commit to make use of increasing core resources. Since these approaches prioritize the issue or commit of certain instructions, they face the conundrum of providing the capacity efficiency of scheduling structures while preserving the ideal ordering of instructions. Traditional collapsible queues are too expensive for today’s processors, while state-of-the-art queue designs compromise with the pseudo-ordering of instructions, leading to performance degradation as well as other limitations.In this paper, we present Orinoco, a microarchitecture/circuit co-design that supports ordered issue and unordered commit with non-collapsible queues. We decouple the temporal ordering of instructions from their queue positions by introducing an age matrix with the bit count encoding, along with a commit dependency matrix and a memory disambiguation matrix to determine instructions to prioritize issue or commit. We leverage the Processing-in-Memory (PIM) approach and efficiently implement the matrix schedulers as 8T SRAM arrays. Orinoco achieves an average IPC improvement of 14.8% over the baseline in-order commit core with the state-of-the-art scheduler while incurring overhead equivalent to a few kilobytes of SRAM.

DOI: 10.1145/3579371.3589046

OneQ： A Compilation Framework for Photonic One-Way Quantum Computation

作者: Zhang, Hezi and Wu, Anbang and Wang, Yuke and Li, Gushu and Shapourian, Hassan and Shabani, Alireza and Ding, Yufei
关键词: compiler, photonics, measurement-based quantum computing (MBQC), one-way quantum computing

Abstract

In this paper, we propose OneQ, the first optimizing compilation framework for one-way quantum computation towards realistic photonic quantum architectures. Unlike previous compilation efforts for solid-state qubit technologies, our innovative framework addresses a unique set of challenges in photonic quantum computing. Specifically, this includes the dynamic generation of qubits over time, the need to perform all computation through measurements instead of relying on 1-qubit and 2-qubit gates, and the fact that photons are instantaneously destroyed after measurements. As pioneers in this field, we demonstrate the vast optimization potential of photonic one-way quantum computing, showcasing the remarkable ability of OneQ to reduce computing resource requirements by orders of magnitude.

DOI: 10.1145/3579371.3589047

Inter-layer Scheduling Space Definition and Exploration for Tiled Accelerators

作者: Cai, Jingwei and Wei, Yuchen and Wu, Zuotong and Peng, Sen and Ma, Kaisheng
关键词: tiled accelerators, neural networks, inter-layer scheduling, scheduling

Abstract

With the continuous expansion of the DNN accelerator scale, inter-layer scheduling, which studies the allocation of computing resources to each layer and the computing order of all layers in a DNN, plays an increasingly important role in maintaining a high utilization rate and energy efficiency of DNN inference accelerators. However, current inter-layer scheduling is mainly conducted based on some heuristic patterns. The space of inter-layer scheduling has not been clearly defined, resulting in significantly limited optimization opportunities and a lack of understanding on different inter-layer scheduling choices and their consequences.To bridge the gaps, we first propose a uniform and systematic notation, the Resource Allocation Tree (RA Tree), to represent different inter-layer scheduling schemes and depict the overall space of inter-layer scheduling. Based on the notation, we then thoroughly analyze how different inter-layer scheduling choices influence the performance and energy efficiency of an accelerator step by step. Moreover, we show how to represent existing patterns in our notation and analyze their features. To thoroughly explore the space of the inter-layer scheduling for diverse tiled accelerators and workloads, we develop an end-to-end and highly-portable scheduling framework, SET. Compared with the state-of-the-art (SOTA) open-source Tangram framework, SET can, on average, achieves 1.78\texttimes{

DOI: 10.1145/3579371.3589048

ArchGym： An Open-Source Gymnasium for Machine Learning Assisted Architecture Design

作者: Krishnan, Srivatsan and Yazdanbakhsh, Amir and Prakash, Shvetank and Jabbour, Jason and Uchendu, Ikechukwu and Ghosh, Susobhan and Boroujerdian, Behzad and Richins, Daniel and Tripathy, Devashree and Faust, Aleksandra and Janapa Reddi, Vijay
关键词: reproducibility, baselines, open source, bayesian optimization, reinforcement learning, machine learning for system, machine learning for computer architecture, machine learning

Abstract

Machine learning (ML) has become a prevalent approach to tame the complexity of design space exploration for domain-specific architectures. While appealing, using ML for design space exploration poses several challenges. First, it is not straightforward to identify the most suitable algorithm from an ever-increasing pool of ML methods. Second, assessing the trade-offs between performance and sample efficiency across these methods is inconclusive. Finally, the lack of a holistic framework for fair, reproducible, and objective comparison across these methods hinders the progress of adopting ML-aided architecture design space exploration and impedes creating repeatable artifacts. To mitigate these challenges, we introduce ArchGym, an open-source gymnasium and easy-to-extend framework that connects a diverse range of search algorithms to architecture simulators. To demonstrate its utility, we evaluate ArchGym across multiple vanilla and domain-specific search algorithms in the design of a custom memory controller, deep neural network accelerators, and a custom SoC for AR/VR workloads, collectively encompassing over 21K experiments. The results suggest that with an unlimited number of samples, ML algorithms are equally favorable to meet the user-defined target specification if its hyperparameters are tuned thoroughly; no one solution is necessarily better than another (e.g., reinforcement learning vs. Bayesian methods). We coin the term “hyperparameter lottery” to describe the relatively probable chance for a search algorithm to find an optimal design provided meticulously selected hyperparameters. Additionally, the ease of data collection and aggregation in ArchGym facilitates research in ML-aided architecture design space exploration. As a case study, we show this advantage by developing a proxy cost model with an RMSE of 0.61% that offers a 2,000-fold reduction in simulation time. Code and data for ArchGym is available at https://bit.ly/ArchGym.

DOI: 10.1145/3579371.3589049

ISA-Grid： Architecture of Fine-grained Privilege Control for Instructions and Registers

作者: Fan, Shulin and Hua, Zhichao and Xia, Yubin and Chen, Haibo and Zang, Binyu
关键词: software isolation, instruction set architecture, privilege control

Abstract

Isolation is a critical mechanism for enhancing the security of computer systems. By controlling the access privileges of software and hardware resources, isolation mechanisms can decouple software into multiple isolated components and enforce the principle of least privilege. While existing isolation systems primarily focus on memory isolation, they overlook the isolation of instruction and register resources, which we refer to as ISA (Instruction Set Architecture) resources. However, previous works have shown that exploiting ISA resources can lead to serious security problems, such as breaking the system’s memory isolation property by abusing x86’s CR3 register. Furthermore, existing hardware only provides privilege-level-based access control for ISA resources, which is too coarse-grained for software decoupling. For example, ARM Cortex A53 has several hundred system instructions/registers, but only four exception levels (EL0 to EL3) are provided. Additionally, more than 100 instructions/registers for system control are available in only EL1 (the kernel mode). To address this problem, this paper proposes ISA-Grid, an architecture of fine-grained privilege control for instructions and registers. ISA-Grid is a hardware extension that enables the creation of multiple ISA domains, with each domain having different privileges to access instructions and registers. The ISA domain can provide bit-level fine-grained privilege control for registers. We implemented prototypes of ISA-Grid based on two different CPU cores: 1) a RISC-V CPU core on an FPGA board and 2) an x86 CPU core on a simulator. We applied ISA-Grid to different cases, including Linux kernel decomposition and enhancing existing security systems, to demonstrate how ISA-Grid can isolate ISA resources and mitigate attacks based on abusing them. The performance evaluation results on both x86 and RISC-V platforms with real-world applications showed that ISA-Grid has negligible runtime overhead (less than 1%).

DOI: 10.1145/3579371.3589050

DRAM Translation Layer： Software-Transparent DRAM Power Savings for Disaggregated Memory

作者: Jin, Wenjing and Jang, Wonsuk and Park, Haneul and Lee, Jongsung and Kim, Soosung and Lee, Jae W.
关键词: address translation, pooled memory, CXL, disaggregated memory, datacenters, power management, DRAM

Abstract

Memory disaggregation is a promising solution to scale memory capacity and bandwidth shared by multiple server nodes in a flexible and cost-effective manner. DRAM power consumption, which is reported to be around 40% of the total system power in the datacenter server, will become an even more serious concern in this high-capacity environment. Exploiting the low average utilization of DRAM capacity in today’s datacenters, it is appealing to put unallocated/cold DRAM ranks into a power-saving mode. However, the conventional DRAM address mapping with fine-grained interleaving to maximize rank-level parallelism is incompatible with such rank-level DRAM power management techniques. Furthermore, existing DRAM power-saving techniques often require intrusive changes to the system stack, including OS, memory controller (MC), or even DRAM devices, to pose additional challenges for deployment. Thus, we propose DRAM Translation Layer (DTL) for host software/MC-transparent DRAM power management with commodity DRAM devices. Inspired by Flash Translation Layer (FTL) in modern SSDs, DTL is placed in the CXL memory controller to provide (i) flexible address mappings between host physical address and DRAM device physical address and (ii) host-transparent memory page migration. Leveraging DTL, we propose two DRAM power-saving techniques with different temporal granularities to maximize the number of DRAM ranks that can enter low-power states while provisioning sufficient DRAM bandwidth: rank-level power-down and hotness-aware self-refresh. The first technique consolidates unallocated memory pages into a subset of ranks at deallocation of a virtual machine (VM) and turns them off transparently to both OS and host MC. Our evaluation with CloudSuite benchmarks demonstrates that this technique saves DRAM power by 31.6% on average at a 1.6% performance cost. The hotness-aware self-refresh scheme further reduces DRAM energy consumption by up to 14.9% with negligible performance loss via opportunistically migrating cold pages into a rank and making it enter self-refresh mode.

DOI: 10.1145/3579371.3589051

Supply Chain Aware Computer Architecture

作者: Ning, August and Tziantzioulis, Georgios and Wentzlaff, David
关键词: economics, modeling, chip shortage, semiconductor supply chain

Abstract

Progressively and increasingly, our society has become more and more dependent on semiconductors and semiconductor-enabled products and services. The importance of chips and their supply chains has been highlighted during the 2020-present chip shortage caused by manufacturing disruptions and increased demand due to the COVID-19 pandemic. However, semiconductor supply chains are inherently vulnerable to disruptions and chip crises can easily recur in the future.We present the first work that elevates supply chain conditions to be a first-class design constraint for future computer architectures. We characterize and model the chip creation process from standard tapeout to packaging to provide a framework for architects to quickly assess the time-to-market of their chips depending on their architecture and the current market conditions. In addition, we propose a novel metric, the Chip Agility Score (CAS) - a way to quantify a chip architecture’s resilience against production-side supply changes.We utilize our proposed time-to-market model, CAS, and chip design/manufacturing economic models to evaluate prominent architectures in the context of current and speculative supply chain changes. We find that using an older process node to re-release chips can decrease time-to-market by 73%-116% compared to using the most advanced processes. Also, mixed-process chiplet architectures can be 24%-51% more agile compared to equivalent single-process chiplet and monolithic designs respectively. Guided by our framework, we present an architectural design methodology that minimizes time-to-market and chip creation costs while maximizing agility for mass-produced legacy node chips.Our modeling framework and data sets are open-sourced to advance supply chain aware computer architecture research. https://github.com/PrincetonUniversity/ttm-cas

DOI: 10.1145/3579371.3589052

SHARP： A Short-Word Hierarchical Accelerator for Robust and Practical Fully Homomorphic Encryption

作者: Kim, Jongmin and Kim, Sangpyo and Choi, Jaewan and Park, Jaiyoung and Kim, Donghwan and Ahn, Jung Ho
关键词: hierarchical architecture, word length, accelerator, fully homomorphic encryption

Abstract

Fully homomorphic encryption (FHE) is an emerging cryptographic technology that guarantees the privacy of sensitive user data by enabling direct computations on encrypted data. Despite the security benefits of this approach, FHE is associated with prohibitively high levels of computational and memory overhead, preventing its widespread use in real-world services. Numerous domain-specific hardware designs have been proposed to address this issue, but most of them use excessive amounts of chip area and power, leaving room for further improvements in terms of practicality.We propose SHARP, a robust and practical accelerator for FHE. We analyze the implications of various hardware design choices on the functionality, performance, and efficiency of FHE. We conduct a multifaceted analysis of the impacts of the machine word length choice on the FHE acceleration, which, despite its importance with regard to hardware efficiency, has yet to be explored due to its complex correlation with various FHE parameters. A relatively short word length of 36 bits is discovered to be a robust and efficient solution for FHE accelerators. We devise an efficient hierarchical SHARP microarchitecture with a novel data organization and specialized functional units and substantially reduce the on-chip memory capacity requirement through architectural and software enhancements. This study demonstrates that SHARP delivers superior performance over prior FHE accelerators with a distinctly smaller chip area and lower power budget.

DOI: 10.1145/3579371.3589053

SPADE： A Flexible and Scalable Accelerator for SpMM and SDDMM

作者: Gerogiannis, Gerasimos and Yesil, Serif and Lenadora, Damitha and Cao, Dingyuan and Mendis, Charith and Torrellas, Josep
关键词: SDDMM, SpMM, sparse computations, hardware accelerator

Abstract

The widespread use of Sparse Matrix Dense Matrix Multiplication (SpMM) and Sampled Dense Matrix Dense Matrix Multiplication (SDDMM) kernels makes them candidates for hardware acceleration. However, accelerator design for these kernels faces two main challenges: (1) the overhead of moving data between CPU and accelerator (often including an address space conversion from the CPU’s virtual addresses) and (2) marginal flexibility to leverage the fact that different sparse input matrices benefit from different variations of the SpMM and SDDMM algorithms.To address these challenges, this paper proposes SPADE, a new SpMM and SDDMM hardware accelerator. SPADE avoids data transfers by tightly-coupling accelerator processing elements (PEs) with the cores of a multicore, as if the accelerator PEs were advanced functional units—allowing the accelerator to reuse the CPU memory system and its virtual addresses. SPADE attains flexibility and programmability by supporting a tile-based ISA—high level enough to eliminate the overhead of fetching and decoding fine-grained instructions. To prove the SPADE concept, we have taped-out a simplified SPADE chip. Further, simulations of a SPADE system with 224–1792 PEs show its high performance and scalability. A 224-PE SPADE system is on average 2.3x, 1.3x and 2.5x faster than a 56-core CPU, a server-class GPU, and an SpMM accelerator, respectively, without accounting for the host-accelerator data transfer overhead. If such overhead is taken into account, the 224-PE SPADE system is on average 43.4x and 52.4x faster than the GPU and the accelerator, respectively. Further, SPADE has a small area and power footprint.

DOI: 10.1145/3579371.3589054

K-D Bonsai： ISA-Extensions to Compress K-D Trees for Autonomous Driving Tasks

作者: E. Becker, Pedro H. and Arnau, Jos'{e
关键词: ISA-extension, compression, radius search, K-D tree, point cloud, autonomous driving hardware

Abstract

Autonomous Driving (AD) systems extensively manipulate 3D point clouds for object detection and vehicle localization. Thereby, efficient processing of 3D point clouds is crucial in these systems. In this work we propose K-D Bonsai, a technique to cut down memory usage during radius search, a critical building block of point cloud processing. K-D Bonsai exploits value similarity in the data structure that holds the point cloud (a k-d tree) to compress the data in memory. K-D Bonsai further compresses the data using a reduced floating-point representation, exploiting the physically limited range of point cloud values. For easy integration into nowadays systems, we implement K-D Bonsai through Bonsai-extensions, a small set of new CPU instructions to compress, decompress, and operate on points. To maintain baseline safety levels, we carefully craft the Bonsai-extensions to detect precision loss due to compression, allowing re-computation in full precision to take place if necessary. Therefore, K-D Bonsai reduces data movement, improving performance and energy efficiency, while guaranteeing baseline accuracy and programmability. We evaluate K-D Bonsai over the euclidean cluster task of Autoware.ai, a state-of-the-art software stack for AD. We achieve an average of 9.26% improvement in end-to-end latency, 12.19% in tail latency, and a reduction of 10.84% in energy consumption. Differently from expensive accelerators proposed in related work, K-D Bonsai improves radius search with minimal area increase (0.36%).

DOI: 10.1145/3579371.3589055

NeuRex： A Case for Neural Rendering Acceleration

作者: Lee, Junseo and Choi, Kwanseok and Lee, Jungi and Lee, Seokwon and Whangbo, Joonho and Sim, Jaewoong
关键词: accelerators, machine learning, neural networks, NeRF, neural rendering

Abstract

This paper presents NeuRex, an accelerator architecture that efficiently performs the modern neural rendering pipeline with an algorithmic enhancement and supporting hardware. NeuRex leverages the insights from an in-depth analysis of the state-of-the-art neural scene representation to make the multi-resolution hash encoding, which is the key operational primitive in modern neural renderings, more hardware-friendly and features a specialized hash encoding engine that enables us to effectively perform the primitive and the overall rendering pipeline. We implement and synthesize NeuRex using a commercial 28nm process technology and evaluate two versions of NeuRex (NeuRex-Edge, NeuRex-Server) on a range of scenes with different image resolutions for mobile and high-end computing platforms. Our evaluation shows that NeuRex achieves up to 9.88\texttimes{

DOI: 10.1145/3579371.3589056

FACT： FFN-Attention Co-optimized Transformer Architecture with Eager Correlation Prediction

作者: Qin, Yubin and Wang, Yang and Deng, Dazheng and Zhao, Zhiren and Yang, Xiaolong and Liu, Leibo and Wei, Shaojun and Hu, Yang and Yin, Shouyi
关键词: neural network, algorithm-hardware co-design, efficient computing, hardware accelerator, transformer

Abstract

Transformer model is becoming prevalent in various AI applications with its outstanding performance. However, the high cost of computation and memory footprint make its inference inefficient. We discover that among the three main computation modules in a Transformer model (QKV generation, attention computation, FFN), it is the QKV generation and FFN that contribute to the most power cost. While the attention computation, focused by most previous works, only has decent power share when dealing with extremely long inputs. Therefore, in this paper, we propose FACT, an efficient algorithm-hardware co-design optimizing all three modules of Transformer. We first propose an eager prediction algorithm which predicts the attention matrix before QKV generation. It further detects the unnecessary computation in QKV generation and assigns mixed-precision FFN with the predicted attention, which helps improve the throughput. Further, we propose FACT accelerator to efficiently support eager prediction with three designs. It avoids the large overhead of prediction by using log-based add-only operations for prediction. It eliminates the latency of prediction through an out-of-order scheduler that makes the eager prediction and computation work in full pipeline. It additionally avoids memory access conflict in the mixed-precision FFN with a novel diagonal storage pattern. Experiments on 22 benchmarks show that our FACT improves the throughput of the whole Transformer by 3.59\texttimes{

DOI: 10.1145/3579371.3589057

TEA： Time-Proportional Event Analysis

作者: Gottschall, Bj"{o
关键词: performance events, time proportionality, performance analysis

Abstract

As computer architectures become increasingly complex and heterogeneous, it becomes progressively more difficult to write applications that make good use of hardware resources. Performance analysis tools are hence critically important as they are the only way through which developers can gain insight into the reasons why their application performs as it does. State-of-the-art performance analysis tools capture a plethora of performance events and are practically non-intrusive, but performance optimization is still extremely challenging. We believe that the fundamental reason is that current state-of-the-art tools in general cannot explain why executing the application’s performance-critical instructions take time.We hence propose Time-Proportional Event Analysis (TEA) which explains why the architecture spends time executing the application’s performance-critical instructions by creating time-proportional Per-Instruction Cycle Stacks (PICS). PICS unify performance profiling and performance event analysis, and thereby (i) report the contribution of each static instruction to overall execution time, and (ii) break down per-instruction execution time across the (combinations of) performance events that a static instruction was subjected to across its dynamic executions. Creating time-proportional PICS requires tracking performance events across all in-flight instructions, but TEA only increases per-core power consumption by ~3.2 mW (~0.1%) because we carefully select events to balance insight and overhead. TEA leverages statistical sampling to keep performance overhead at 1.1% on average while incurring an average error of 2.1% compared to a non-sampling golden reference; a significant improvement upon the 55.6%, 55.5%, and 56.0% average error for AMD IBS, Arm SPE, and IBM RIS. We demonstrate that TEA’s accuracy matters by using TEA to identify performance issues in the SPEC CPU2017 benchmarks lbm and nab that, once addressed, yield speedups of 1.28\texttimes{

DOI: 10.1145/3579371.3589058

V10： Hardware-Assisted NPU Multi-tenancy for Improved Resource Utilization and Fairness

作者: Xue, Yuqi and Liu, Yiqi and Nai, Lifeng and Huang, Jian
关键词: ML accelerator, multi-tenancy, neural processing unit

Abstract

Modern cloud platforms have deployed neural processing units (NPUs) like Google Cloud TPUs to accelerate online machine learning (ML) inference services. To improve the resource utilization of NPUs, they allow multiple ML applications to share the same NPU, and developed both time-multiplexed and preemptive-based sharing mechanisms. However, our study with real-world NPUs discloses that these approaches suffer from surprisingly low utilization, due to the lack of support for fine-grained hardware resource sharing in the NPU. Specifically, its separate systolic array and vector unit cannot be fully utilized at the same time, which requires fundamental hardware assistance for supporting multi-tenancy.In this paper, we present V10, a hardware-assisted NPU multi-tenancy framework for improving resource utilization, while ensuring fairness for different ML services. We rethink the NPU architecture for supporting multi-tenancy. V10 employs an operator scheduler for enabling concurrent operator executions on the systolic array and the vector unit and offers flexibility for enforcing different priority-based resource-sharing mechanisms. V10 also enables fine-grained operator preemption and lightweight context switch in the NPU. To further improve NPU utilization, V10 also develops a clustering-based workload collocation mechanism for identifying the best-matching ML services on a shared NPU. We implement V10 with an NPU simulator. Our experiments with various ML workloads from MLPerf AI Benchmarks demonstrate that V10 can improve the overall NPU utilization by 1.64\texttimes{

DOI: 10.1145/3579371.3589059

GenDP： A Framework of Dynamic Programming Acceleration for Genome Sequencing Analysis

作者: Gu, Yufeng and Subramaniyan, Arun and Dunn, Tim and Khadem, Alireza and Chen, Kuan-Yu and Paul, Somnath and Vasimuddin, Md and Misra, Sanchit and Blaauw, David and Narayanasamy, Satish and Das, Reetuparna
关键词: bioinfomatics, genomics, reconfigurable architectures, hardware accelerators, computer architecture

Abstract

Genomics is playing an important role in transforming healthcare. Genetic data, however, is being produced at a rate that far outpaces Moore’s Law. Many efforts have been made to accelerate genomics kernels on modern commodity hardware such as CPUs and GPUs, as well as custom accelerators (ASICs) for specific genomics kernels. While ASICs provide higher performance and energy efficiency than general-purpose hardware, they incur a high hardware design cost. Moreover, in order to extract the best performance, ASICs tend to have significantly different architectures for different kernels. The divergence of ASIC designs makes it difficult to run commonly used modern sequencing analysis pipelines due to software integration and programming challenges.With the observation that many genomics kernels are dominated by dynamic programming (DP) algorithms, this paper presents GenDP, a framework of dynamic programming acceleration including DPAx, a DP accelerator, and DPMap, a graph partitioning algorithm that maps DP objective functions to the accelerator. DPAx supports DP kernels with various dependency patterns, such as 1D and 2D DP tables and long-range dependencies in the graph structure. DPAx also supports different DP objective functions and precisions required for genomics applications. GenDP is evaluated on genomics kernels in both short-read and long-read analysis pipelines, achieving 157.8\texttimes{

DOI: 10.1145/3579371.3589060

Programmable Olfactory Computing

作者: Bleier, Nathaniel and Wezelis, Abigail and Varshney, Lav and Kumar, Rakesh
关键词: No keywords

Abstract

While smell is arguably the most visceral of senses, olfactory computing has been barely explored in the mainstream. We argue that this is a good time to explore olfactory computing since a) a large number of driver applications are emerging, b) odor sensors are now dramatically better, and c) non-traditional form factors such as sensor, wearable, and xR devices that would be required to support olfactory computing are already getting widespread acceptance. Through a comprehensive review of literature, we identify the key algorithms needed to support a wide variety of olfactory computing tasks. We profiled these algorithms on existing hardware and identified several characteristics, including the preponderance of fixed-point computation, and linear operations, and real arithmetic; a variety of data memory requirements; and opportunities for data-level parallelism. We propose Ahromaa, a heterogeneous architecture for olfactory computing targeting extremely power and energy constrained olfactory computing workloads and evaluate it against baseline architectures of an MCU, a state-of-art CGRA, and an MCU with packed SIMD. Across our algorithms, Ahromaa’s operating modes outperform the baseline architectures by 1.36, 1.22, and 1.1\texttimes{

DOI: 10.1145/3579371.3589061

RAELLA： Reforming the Arithmetic for Efficient, Low-Resolution, and Low-Loss Analog PIM： No Retraining Required!

作者: Andrulis, Tanner and Emer, Joel S. and Sze, Vivienne
关键词: ReRAM, ADC, slicing, architecture, accelerator, neural networks, analog, compute in memory, processing in memory

Abstract

Processing-In-Memory (PIM) accelerators have the potential to efficiently run Deep Neural Network (DNN) inference by reducing costly data movement and by using resistive RAM (ReRAM) for efficient analog compute. Unfortunately, overall PIM accelerator efficiency is limited by energy-intensive analog-to-digital converters (ADCs). Furthermore, existing accelerators that reduce ADC cost do so by changing DNN weights or by using low-resolution ADCs that reduce output fidelity. These strategies harm DNN accuracy and/or require costly DNN retraining to compensate.To address these issues, we propose the RAELLA architecture. RAELLA adapts the architecture to each DNN; it lowers the resolution of computed analog values by encoding weights to produce near-zero analog values, adaptively slicing weights for each DNN layer, and dynamically slicing inputs through speculation and recovery. Low-resolution analog values allow RAELLA to both use efficient low-resolution ADCs and maintain accuracy without retraining, all while computing with fewer ADC converts.Compared to other low-accuracy-loss PIM accelerators, RAELLA increases energy efficiency by up to 4.9\texttimes{

DOI: 10.1145/3579371.3589062

RowPress： Amplifying Read Disturbance in Modern DRAM Chips

作者: Luo, Haocong and Olgun, Ataberk and Ya\u{g
关键词: testing, safety, security, reliability, RowHammer, rowpress, read disturbance, DRAM

Abstract

Memory isolation is critical for system reliability, security, and safety. Unfortunately, read disturbance can break memory isolation in modern DRAM chips. For example, RowHammer is awell-studied read-disturb phenomenon where repeatedly opening and closing (i.e., hammering) a DRAM row many times causes bitflips in physically nearby rows.This paper experimentally demonstrates and analyzes another widespread read-disturb phenomenon, RowPress, in real DDR4 DRAM chips. RowPress breaks memory isolation by keeping a DRAM row open for a long period of time, which disturbs physically nearby rows enough to cause bitflips. We show that RowPress amplifies DRAM’s vulnerability to read-disturb attacks by significantly reducing the number of row activations needed to induce a bitflip by one to two orders of magnitude under realistic conditions. In extreme cases, RowPress induces bitflips in a DRAM row when an adjacent row is activated only once. Our detailed characterization of 164 real DDR4 DRAM chips shows that RowPress 1) affects chips from all three major DRAM manufacturers, 2) gets worse as DRAM technology scales down to smaller node sizes, and 3) affects a different set of DRAM cells from RowHammer and behaves differently from RowHammer as temperature and access pattern changes. We also show that cells vulnerable to RowPress are very different from cells vulnerable to retention failures.We demonstrate in a real DDR4-based system with RowHammer protection that 1) a user-level program induces bitflips by leveraging RowPress while conventional RowHammer cannot do so, and 2) a memory controller that adaptively keeps the DRAM row open for a longer period of time based on access pattern can facilitate RowPress-based attacks. To prevent bitflips due to RowPress, we describe and analyze four potential mitigation techniques, including a new methodology that adapts existing RowHammer mitigation techniques to also mitigate RowPress with low additional performance overhead. We evaluate this methodology and demonstrate that it is effective on a variety of workloads. We open source all our code and data to facilitate future research on RowPress.

DOI: 10.1145/3579371.3589063

CAMJ： Enabling System-Level Energy Modeling and Architectural Exploration for In-Sensor Visual Computing

作者: Ma, Tianrui and Feng, Yu and Zhang, Xuan and Zhu, Yuhao
关键词: analog modeling, energy modeling, in-sensor computing

Abstract

CMOS Image Sensors (CIS) are fundamental to emerging visual computing applications. While conventional CIS are purely imaging devices for capturing images, increasingly CIS integrate processing capabilities such as Deep Neural Network (DNN). Computational CIS expand the architecture design space, but to date no comprehensive energy model exists. This paper proposes CamJ, a detailed energy modeling framework that provides a component-level energy breakdown for computational CIS and is validated against nine recent CIS chips. We use CamJ to demonstrate three use-cases that explore architectural trade-offs including computing in vs. off CIS, 2D vs. 3D-stacked CIS design, and analog vs. digital processing inside CIS. The code of CamJ is available at: https://github.com/horizon-research/CamJ.

DOI: 10.1145/3579371.3589064

DynAMO： Improving Parallelism Through Dynamic Placement of Atomic Memory Operations

作者: Soria-Pardos, V'{\i
关键词: data placement, atomic memory operations, microarchitecture, multi-core architectures

Abstract

With increasing core counts in modern multi-core designs, the overhead of synchronization jeopardizes the scalability and efficiency of parallel applications. To mitigate these overheads, modern cache-coherent protocols offer support for Atomic Memory Operations (AMOs) that can be executed near-core (near) or remotely in the on-chip memory hierarchy (far).This paper evaluates current available static AMO execution policies implemented in multi-core Systems-on-Chip (SoC) designs, which select AMOs’ execution placement (near or far) based on the cache block coherence state. We propose three static policies and show that the performance of static policies is application dependent. Moreover, we show that one of our proposed static policies outperforms currently available implementations.Furthermore, we propose DynAMO, a predictor that selects the best location to execute the AMOs. DynAMO identifies the different locality patterns to make informed decisions, improving AMO latency and increasing overall throughput. DynAMO outperforms the best-performing static policy and provides geometric mean speed-ups of 1.09\texttimes{

DOI: 10.1145/3579371.3589065