MICRO 2023

Session details： Best Paper Session

作者: Bartolini, Davide Basilio
关键词: No keywords

Abstract

No abstract available.

Clockhands： Rename-free Instruction Set Architecture for Out-of-order Processors

作者: Koizumi, Toru and Shioya, Ryota and Sugita, Shu and Amano, Taichi and Degawa, Yuya and Kadomoto, Junichiro and Irie, Hidetsugu and Sakai, Shuichi
关键词: Superscalar processor, Register renaming, Register lifetime, Power efficiency, Out-of-order execution, Instruction set architecture, Compiler

Abstract

Out-of-order superscalar processors are currently the only architecture that speeds up irregular programs, but they suffer from poor power efficiency. To tackle this issue, we focused on how to specify register operands. Specifying operands by register names, as conventional RISC does, requires register renaming, resulting in poor power efficiency and preventing an increase in the front-end width. In contrast, a recently proposed architecture called STRAIGHT specifies operands by inter-instruction distance, thereby eliminating register renaming. However, STRAIGHT has strong constraints on instruction placement, which generally results in a large increase in the number of instructions. We propose Clockhands, a novel instruction set architecture that has multiple register groups and specifies a value as “the value written in this register group k times before.” Clockhands does not require register renaming as in STRAIGHT. In contrast, Clockhands has much looser constraints on instruction placement than STRAIGHT, allowing programs to be written with almost the same number of instructions as Conventional RISC. We implemented a cycle-accurate simulator, FPGA implementation, and first-step compiler for Clockhands and evaluated benchmarks including SPEC CPU. On a machine with an eight-fetch width, the evaluation results showed that Clockhands consumes 7.4% less energy than RISC while having performance comparable to RISC. This energy reduction increases significantly to 24.4% when simulating a futuristic up-scaled processor with a 16-fetch width, which shows that Clockhands enables a wider front-end.

DOI: 10.1145/3613424.3614272

Decoupled Vector Runahead

作者: Naithani, Ajeya and Roelandts, Jaime and Ainsworth, Sam and Jones, Timothy M. and Eeckhout, Lieven
关键词: speculative vectorization, runahead, prefetching, graph processing, CPU microarchitecture

Abstract

We present Decoupled Vector Runahead (DVR), an in-core prefetching technique, executing separately to the main application thread, that exploits massive amounts of memory-level parallelism to improve the performance of applications featuring indirect memory accesses. DVR dynamically infers loop bounds at run-time, recognizing striding loads, and vectorizing subsequent instructions that are part of an indirect chain. It proactively issues memory accesses for the resulting loads far into the future, even when the out-of-order core has not yet stalled, bringing their data into the L1 cache, and thus providing timely prefetches for the main thread. DVR can adjust the degree of vectorization at run-time, vectorize the same chain of indirect memory accesses across multiple invocations of an inner loop, and efficiently handle branch divergence along the vectorized chain. DVR runs as an on-demand, speculative, in-order, lightweight hardware subthread alongside the main thread within the core and incurs a minimal hardware overhead of only 1139 bytes. Relative to a large superscalar 5-wide out-of-order baseline and Vector Runahead — a recent microarchitectural technique to accelerate indirect memory accesses on out-of-order processors — DVR delivers 2.4 \texttimes{

DOI: 10.1145/3613424.3614255

CryptoMMU： Enabling Scalable and Secure Access Control of Third-Party Accelerators

作者: Alam, Faiz and Lee, Hyokeun and Bhattacharjee, Abhishek and Awad, Amro
关键词: cryptography, access control, accelerator-rich architecture, IOMMU

Abstract

Due to increasing energy and performance gaps between general-purpose processors and hardware accelerators (e.g., FPGA or ASIC), clear trends for leveraging accelerators arise in various fields or workloads, such as edge devices, cloud systems, and data centers. Moreover, system integrators desire higher flexibility to deploy custom accelerators based on their performance, power, and cost constraints, where such integration can be as early as (1) at the design time when third-party intellectual properties (IPs) are used, (2) at integration/upgrade time when third-party discrete chip accelerators are used, or (3) during runtime as in reconfigurable logic. A malicious third-party accelerator can compromise the entire system by accessing other processes’ data, overwriting OS data structures, etc. To eliminate these security ramifications, a unit similar to a memory management unit (MMU), namely IOMMU, is typically used to scrutinize memory accesses from I/O devices, including accelerators. Still, IOMMU incurs significant performance overhead because it resides on the critical path of each I/O memory access. In this paper, we propose a novel scheme, CryptoMMU, to delegate the translation processes to accelerators, whereas the authentication of the targeted address is elegantly performed using a cryptography-based approach. As a result, CryptoMMU facilitates the private caching of translation in each accelerator, providing better scalability. Our evaluation results show that CryptoMMU improves system throughput by an average of 2.97 \texttimes{

DOI: 10.1145/3613424.3614311

Phantom： Exploiting Decoder-detectable Mispredictions

作者: Wikner, Johannes and Trujillo, Dani"{e
关键词: Speculative execution, Spectre, Side-channel attack, Branch target injection

Abstract

Violating the Von Neumann sequential processing principle at the microarchitectural level is commonplace to reach high performing CPU hardware — violations are safe as long as software executes correctly at the architectural interface. Speculative execution attacks exploit these violations and queue up secret-dependent memory accesses allowed by long speculation windows due to the late detection of these violations in the pipeline. In this paper, we show that recent AMD and Intel CPUs speculate very early in their pipeline, even before they decode the current instruction. This mechanism enables new sources of speculation to be triggered from almost any instruction, enabling a new class of attacks that we refer to as Phantom. Unlike Spectre, Phantom speculation windows are short since the violations are detected early. Nonetheless, Phantom allows for transient fetch and transient decode on all recent x86-based microarchitectures, and transient execution on AMD Zen 1 and 2. We build a number of exploits using these new Phantom primitives and discuss why mitigating them is difficult in practice.

DOI: 10.1145/3613424.3614275

Session details： Session 1A： Accelerators Based on HW/SW Co-Design Accelerators for Matrix Processing

作者: Pellauer, Michael
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637180

AuRORA： Virtualized Accelerator Orchestration for Multi-Tenant Workloads

作者: Kim, Seah and Zhao, Jerry and Asanovic, Krste and Nikolic, Borivoje and Shao, Yakun Sophia
关键词: SoC Integration, Resource Management, Multi-tenant system, Multi-core, Microarchitecture, Machine Learning, Accelerators

Abstract

With the widespread adoption of deep neural networks (DNNs) across applications, there is a growing demand for DNN deployment solutions that can seamlessly support multi-tenant execution. This involves simultaneously running multiple DNN workloads on heterogeneous architectures with domain-specific accelerators. However, existing accelerator interfaces directly bind the accelerator’s physical resources to user threads, without an efficient mechanism to adaptively re-partition available resources. This leads to high programming complexities and performance overheads due to sub-optimal resource allocation, making scalable many-accelerator deployment impractical. To address this challenge, we propose AuRORA, a novel accelerator integration methodology that enables scalable accelerator deployment for multi-tenant workloads. In particular, AuRORA supports virtualized accelerator orchestration via co-designing the hardware-software stack of accelerators to allow adaptively binding current workloads onto available accelerators. We demonstrate that AuRORA achieves 2.02 higher overall SLA satisfaction, 1.33 overall system throughput, and 1.34 overall fairness compared to existing accelerator integration solutions with less than 2.7% area overhead.

DOI: 10.1145/3613424.3614280

UNICO： Unified Hardware Software Co-Optimization for Robust Neural Network Acceleration

作者: Rashidi, Bahador and Gao, Chao and Lu, Shan and Wang, Zhisheng and Zhou, Chunhua and Niu, Di and Sun, Fengyu
关键词: Neural Network Accelerator, Multi-Level Optimization, HW-SW Co-Design, HW Robustness

Abstract

Specialized hardware has become an indispensable component to deep neural network (DNN) acceleration. To keep up with the rapid evolution of neural networks, holistic and automated solutions for jointly optimizing both hardware (HW) architectures and software (SW) mapping have been studied. These studies face two major challenges. First, the combined HW-SW design space is vast, which hinders the finding of optimal or near-optimal designs. This issue is exacerbated for industrial cases when cycle accurate models are used for design evaluation in the joint optimization. Second, HW design is prone to overfitting to the input DNNs used in the HW-SW co-optimization. To address these issues, in this paper, we propose UNICO, an efficient Unified Co-Optimization framework with a novel Robustness metric for better HW generalization. Guided by a high-fidelity surrogate model, UNICO employs multi-objective Bayesian optimization to effectively explore the HW design space, and conducts adaptive, parallel and scalable software mapping search based on successive halving. To reduce HW overfitting, we propose a HW robustness metric by relating a HW configuration’s quality to its sensitivity in software mapping search, and quantitatively incorporate this metric to search for more robust HW design(s). We implement UNICO in open source accelerator platform, and compare it with the state-of-the-art solution HASCO. Experiments show that UNICO significantly outperforms HASCO; it finds design(s) with similar quality to HASCO up to 4 \texttimes{

DOI: 10.1145/3613424.3614282

Spatula： A Hardware Accelerator for Sparse Matrix Factorization

作者: Feldmann, Axel and Sanchez, Daniel
关键词: sparse linear algebra, matrix factorization, LU., Hardware accelerators, Cholesky

Abstract

Solving sparse systems of linear equations is a crucial component in many science and engineering problems, like simulating physical systems. Sparse matrix factorization dominates a large class of these solvers. Efficient factorization algorithms have two key properties that make them challenging for existing architectures: they consist of small tasks that are structured and compute-intensive, and sparsity induces long chains of data dependences among these tasks. Data dependences make GPUs struggle, while CPUs and prior sparse linear algebra accelerators also suffer from low compute throughput. We present Spatula, an architecture for accelerating sparse matrix factorization algorithms. Spatula hardware combines systolic processing elements that execute structured tasks at high throughput with a flexible scheduler that handles challenging data dependences. Spatula enables a novel scheduling algorithm that avoids stalls and load imbalance while reducing data movement, achieving high compute utilization. As a result, Spatula outperforms a GPU running the state-of-the-art sparse Cholesky and LU factorization implementations by gmean 47 \texttimes{

DOI: 10.1145/3613424.3623783

Session details： Session 1B： Architectural Support/ Programming Languages, Case Study

作者: Ghose, Saugata
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637181

Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices

作者: Sun, Yan and Yuan, Yifan and Yu, Zeduo and Kuper, Reese and Song, Chihun and Huang, Jinghan and Ji, Houxiang and Agarwal, Siddharth and Lou, Jiaqi and Jeong, Ipoom and Wang, Ren and Ahn, Jung Ho and Xu, Tianyin and Kim, Nam Sung
关键词: tiered-memory management, measurement, Compute eXpress Link

Abstract

The ever-growing demands for memory with larger capacity and higher bandwidth have driven recent innovations on memory expansion and disaggregation technologies based on Compute eXpress Link (CXL). Especially, CXL-based memory expansion technology has recently gained notable attention for its ability not only to economically expand memory capacity and bandwidth but also to decouple memory technologies from a specific memory interface of the CPU. However, since CXL memory devices have not been widely available, they have been emulated using DDR memory in a remote NUMA node. In this paper, for the first time, we comprehensively evaluate a true CXL-ready system based on the latest 4th-generation Intel Xeon CPU with three CXL memory devices from different manufacturers. Specifically, we run a set of microbenchmarks not only to compare the performance of true CXL memory with that of emulated CXL memory but also to analyze the complex interplay between the CPU and CXL memory in depth. This reveals important differences between emulated CXL memory and true CXL memory, some of which will compel researchers to revisit the analyses and proposals from recent work. Next, we identify opportunities for memory-bandwidth-intensive applications to benefit from the use of CXL memory. Lastly, we propose a CXL-memory-aware dynamic page allocation policy, Caption to more efficiently use CXL memory as a bandwidth expander. We demonstrate that Caption can automatically converge to an empirically favorable percentage of pages allocated to CXL memory, which improves the performance of memory-bandwidth-intensive applications by up to 24% when compared to the default page allocation policy designed for traditional NUMA systems.

DOI: 10.1145/3613424.3614256

Memento： Architectural Support for Ephemeral Memory Management in Serverless Environments

作者: Wang, Ziqi and Zhao, Kaiyang and Li, Pei and Jacob, Andrew and Kozuch, Michael and Mowry, Todd and Skarlatos, Dimitrios
关键词: Serverless, Memory Management, Function-as-a-Service, Cloud computing

Abstract

Serverless computing is an increasingly attractive paradigm in the cloud due to its ease of use and fine-grained pay-for-what-you-use billing. However, serverless computing poses new challenges to system design due to its short-lived function execution model. Our detailed analysis reveals that memory management is responsible for a major amount of function execution cycles. This is because functions pay the full critical-path costs of memory management in both userspace and the operating system without the opportunity to amortize these costs over their short lifetimes. To address this problem, we propose Memento, a new hardware-centric memory management design based upon our insights that memory allocations in serverless functions are typically small, and either quickly freed after allocation or freed when the function exits. Memento alleviates the overheads of serverless memory management by introducing two key mechanisms: (i) a hardware object allocator that performs in-cache memory allocation and free operations based on arenas, and (ii) a hardware page allocator that manages a small pool of physical pages used to replenish arenas of the object allocator. Together these mechanisms alleviate memory management overheads and bypass costly userspace and kernel operations. Memento naturally integrates with existing software stacks through a set of ISA extensions that enable seamless integration with multiple languages runtimes. Finally, Memento leverages the newly exposed memory allocation semantics in hardware to introduce a main memory bypass mechanism and avoid unnecessary DRAM accesses for newly allocated objects. We evaluate Memento with full-system simulations across a diverse set of containerized serverless workloads and language runtimes. The results show that Memento achieves function execution speedups ranging between 8–28% and 16% on average. Furthermore, Memento hardware allocators and main memory bypass mechanisms drastically reduce main memory traffic by 30% on average. The combined effects of Memento reduce the pricing cost of function execution by 29%. Finally, we demonstrate the applicability of Memento beyond functions, to major serverless platform operations and long-running data processing applications.

DOI: 10.1145/3613424.3623795

Simultaneous and Heterogenous Multithreading

作者: Hsu, Kuan-Chieh and Tseng, Hung-Wei
关键词: No keywords

Abstract

The landscape of modern computers is undoubtedly heterogeneous, as all computing platforms integrate multiple types of processing units and hardware accelerators. However, the entrenched programming models focus on using only the most efficient processing units for each code region, underutilizing the processing power within heterogeneous computers. This paper simultaneous and heterogenous multithreading (SHMT), a programming and execution model that enables opportunities for “real” parallel processing using heterogeneous processing units. In contrast to conventional models, SHMT can utilize heterogeneous types of processing units concurrently for the same code region. Furthermore, SHMT presents an abstraction and a runtime system to facilitate parallel execution. More importantly, SHMT needs to additionally address the heterogeneity in data precision that various processing units support to ensure the quality of the result. This paper implements and evaluates SHMT on an embedded system platform with a GPU and an Edge TPU. SHMT achieves up to 1.95 \texttimes{

DOI: 10.1145/3613424.3614285

Session details： Session 1C： Design Automation, Synthesis, Hardware Generation

作者: Jeffrey, Mark
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637182

Accelerating RTL Simulation with Hardware-Software Co-Design

作者: Elsabbagh, Fares and Sheikhha, Shabnam and Ying, Victor A. and Nguyen, Quan M. and Emer, Joel S and Sanchez, Daniel
关键词: speculative execution, simulation, register-transfer-level, hardware acceleration, domain-specific architectures., dataflow execution

Abstract

Fast simulation of digital circuits is crucial to build modern chips. But RTL (Register-Transfer-Level) simulators are slow, as they cannot exploit multicores well. Slow simulation lengthens chip design time and makes bugs more frequent. We present ASH, a parallel architecture tailored to simulation workloads. ASH consists of a tightly codesigned hardware architecture and compiler for RTL simulation. ASH exploits two key opportunities. First, it performs dataflow execution of small tasks to leverage the fine-grained parallelism in simulation workloads. Second, it performs selective event-driven execution to run only the fraction of the design exercised each cycle, skipping ineffectual tasks. ASH hardware provides a novel combination of dataflow and speculative execution, and ASH’s compiler features several novel techniques to automatically leverage this hardware. We evaluate ASH in simulation using large Verilog designs. An ASH chip with 256 simple cores is gmean 1,485 \texttimes{

DOI: 10.1145/3613424.3614257

Fast, Robust and Transferable Prediction for Hardware Logic Synthesis

作者: Xu, Ceyu and Sharma, Pragya and Wang, Tianshu and Wills, Lisa Wu
关键词: RTL-level Synthesis, Neural Networks, Logic Synthesis Prediction, Integrated Circuits

Abstract

The increasing complexity of computer chips and the slow logic synthesis process have become major bottlenecks in the hardware design process, also hindering the ability of hardware generators to make informed design decisions while considering hardware costs. While various models have been proposed to predict physical characteristics of hardware designs, they often suffer from limited domain adaptability and open-source hardware design data scarcity. In this paper, we present SNS v2, a fast, robust, and transferable hardware synthesis predictor based on deep learning models. Inspired by modern natural language processing models, SNS v2 adopts a three-phase training approach encompassing pre-training, fine-tuning, and domain adaptation, enabling it to leverage more abundant unlabeled and off-domain training data. Additionally, we propose a novel contrastive learning approach based on circuit equivalence to enhance model robustness. Our experiments demonstrate that SNS v2 achieves two to three orders of magnitude faster speed compared to conventional EDA tools, while maintaining state-of-the-art prediction accuracy. We also show that SNS v2 can be seamlessly integrated into hardware generator frameworks for real-time cost estimation, resulting in higher quality design recommendations in a significantly reduced time frame.

DOI: 10.1145/3613424.3623794

Khronos： Fusing Memory Access for Improved Hardware RTL Simulation

作者: Zhou, Kexing and Liang, Yun and Lin, Yibo and Wang, Runsheng and Huang, Ru
关键词: Register transfer level simulation, Memory access optimization, Hardware simulation and emulation

Abstract

The use of register transfer level (RTL) simulation is critical for hardware design in various aspects including verification, debugging, and design space exploration. Among various RTL simulation techniques, cycle-accurate software RTL simulation is the most prevalent approach due to its easy accessibility and high flexibility. The current state-of-the-art cycle-accurate simulators mainly use full-cycle RTL simulation that models RTL as a directed acyclic computational graph and traverses the graph in each simulation cycle. However, the adoption of full-cycle simulation makes them mainly focus on optimizing the logic evaluation within one simulation cycle, neglecting temporal optimization opportunities. In this work, we propose Khronos, a cycle-accurate software RTL simulation tool that optimizes the memory accesses to improve simulation speed. RTL simulation often involves a large number of register buffers, making memory access one of the performance bottlenecks. The key insight of Khronos is that a large number of memory accesses across consecutive clock cycles exhibit temporal localities, by fusing those accesses we can reduce the memory traffic and thus improve the overall performance. In order to do this, we first propose a queue-connected operation graph to capture temporal data dependencies. After that, we reschedule the operations and fuse the state access across cycles, reducing the pressure on the host memory hierarchy. To minimize the number of memory accesses, we formulate a linear-constraint non-linear objective integer programming problem and solve it by linearizing to a minimum-cost flow problem iteratively. Experiments show that Khronos can save up to 88% of cache access and achieve an average acceleration of 2.0x (up to 4.3x) for various hardware designs compared to state-of-the-art simulators.

DOI: 10.1145/3613424.3614301

Session details： Session 2A： ML Design Space ExplorationGeneration

作者: Krishna, Tushar
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637183

SecureLoop： Design Space Exploration of Secure DNN Accelerators

作者: Lee, Kyungmi and Yan, Mengjia and Emer, Joel and Chandrakasan, Anantha
关键词: neural networks, accelerator scheduling, Trusted execution environment

Abstract

Deep neural networks (DNNs) are gaining popularity in a wide range of domains, ranging from speech and video recognition to healthcare. With this increased adoption comes the pressing need for securing DNN execution environments on CPUs, GPUs, and ASICs. While there are active research efforts in supporting a trusted execution environment (TEE) on CPUs, the exploration in supporting TEEs on accelerators is limited, with only a few solutions available [18, 19, 27]. A key limitation along this line of work is that these secure DNN accelerators narrowly consider a few specific architectures. The design choices and the associated cost for securing these architectures do not transfer to other diverse architectures. This paper strives to address this limitation by developing a design space exploration tool for supporting TEEs on diverse DNN accelerators. We target secure DNN accelerators equipped with cryptographic engines where the cryptographic operations are closely coupled with the data movement in the accelerators. These operations significantly complicate the scheduling for DNN accelerators, as the scheduling needs to account for the extra on-chip computation and off-chip memory accesses introduced by these cryptographic operations, and even needs to account for potential interactions across DNN layers. We tackle these challenges in our tool, called SecureLoop, by introducing a scheduling search engine with the following attributes: 1) considers the cryptographic overhead associated with every off-chip data access, 2) uses an efficient modular arithmetic technique to compute the optimal authentication block assignment for each individual layer, and 3) uses a simulated annealing algorithm to perform cross-layer optimizations. Compared to the conventional schedulers, our tool finds the schedule for secure DNN designs with up to 33.2% speedup and 50.2% improvement of energy-delay-product.

DOI: 10.1145/3613424.3614273

DOSA： Differentiable Model-Based One-Loop Search for DNN Accelerators

作者: Hong, Charles and Huang, Qijing and Dinh, Grace and Subedar, Mahesh and Shao, Yakun Sophia
关键词: Machine learning accelerators, Design space exploration

Abstract

In the hardware design space exploration process, it is critical to optimize both hardware parameters and algorithm-to-hardware mappings. Previous work has largely approached this simultaneous optimization problem by separately exploring the hardware design space and the mapspace—both individually large and highly nonconvex spaces—independently. The resulting combinatorial explosion has created significant difficulties for optimizers. In this paper, we introduce DOSA, which consists of differentiable performance models and a gradient descent-based optimization technique to simultaneously explore both spaces and identify high-performing design points. Experimental results demonstrate that DOSA outperforms random search and Bayesian optimization by 2.80 \texttimes{

DOI: 10.1145/3613424.3623797

TorchSparse++： Efficient Training and Inference Framework for Sparse Convolution on GPUs

作者: Tang, Haotian and Yang, Shang and Liu, Zhijian and Hong, Ke and Yu, Zhongming and Li, Xiuyu and Dai, Guohao and Wang, Yu and Han, Song
关键词: sparse convolution, point cloud, neural network, high-performance computing, graph, GPU

Abstract

Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems. Since the computation pattern is sparse and irregular, specialized high-performance kernels are required. Existing GPU libraries offer two dataflow types for sparse convolution. The gather-GEMM-scatter dataflow is easy to implement but not optimal in performance, while the dataflows with overlapped computation and memory access (e.g. implicit GEMM) are highly performant but have very high engineering costs. In this paper, we introduce TorchSparse++, a new GPU library that achieves the best of both worlds. We create a highly efficient Sparse Kernel Generator that generates performant sparse convolution kernels at less than one-tenth of the engineering cost of the current state-of-the-art system. On top of this, we design the Sparse Autotuner, which extends the design space of existing sparse convolution libraries and searches for the best dataflow configurations for training and inference workloads. Consequently, TorchSparse++ achieves 2.9 \texttimes{

DOI: 10.1145/3613424.3614303

Session details： Session 2B： Microarchitecture

作者: Sorin, Daniel
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637184

Branch Target Buffer Organizations

作者: Perais, Arthur and Sheikh, Rami
关键词: Instruction fetch, Branch Target Buffers, BTB

Abstract

To accommodate very large instruction footprints, modern high-performance processors rely on fetch directed instruction prefetching through huge branch predictors and a hierarchy of Branch Target Buffers (BTBs). Recently, significant effort has been undertaken to reduce the footprint of each branch in the BTB, in order to either minimize the storage occupied by the BTB on die, or to increase the number of tracked branches at iso-storage. However, designing for branch density, while necessary, is only one dimension of BTB efficacy. In particular, BTB entry organization plays a significant role in improving instruction fetch throughput, which is a necessary step towards increased performance. In this paper, we first revisit the advantages and drawbacks of three classical BTB organizations in the context of multi-level BTB hierarchies. We then consider three possible improvements to increase the fetch PC throughput of the Region BTB and Block BTB organizations, bridging most of the performance gap with the impractical but highly storage-efficient Instruction BTB organization, thus paving the way for future very high fetch throughput machines.

DOI: 10.1145/3613424.3623774

Warming Up a Cold Front-End with Ignite

作者: Schall, David and Sandberg, Andreas and Grot, Boris
关键词: instruction delivery, front-end prefetching and serverless, Microarchitecture

Abstract

Serverless computing is a popular software deployment model for the cloud, in which applications are designed as a collection of stateless tasks. Developers are charged for the CPU time and memory footprint during the execution of each serverless function, which incentivizes them to reduce both runtime and memory usage. As a result, functions tend to be short (often on the order of a few milliseconds) and compact (128–256 MB). Cloud providers can pack thousands of such functions on a server, resulting in frequent context switches and a tremendous degree of interleaving. As a result, when a given memory-resident function is re-invoked, it commonly finds its on-chip microarchitectural state completely cold due to thrashing by other functions — a phenomenon termed lukewarm invocation. Our analysis shows that the cold microarchitectural state due to lukewarm invocations is highly detrimental to performance, which corroborates prior work. The main source of performance degradation is the front-end, composed of instruction delivery, branch identification via the BTB and the conditional branch prediction. State-of-the-art front-end prefetchers show only limited effectiveness on lukewarm invocations, falling considerably short of an ideal front-end. We demonstrate that the reason for this is the cold microarchitectural state of the branch identification and prediction units. In response, we introduce Ignite, a comprehensive restoration mechanism for front-end microarchitectural state targeting instructions, BTB and branch predictor via unified metadata. Ignite records an invocation’s control flow graph in compressed format and uses that to restore the front-end structures the next time the function is invoked. Ignite outperforms state-of-the-art front-end prefetchers, improving performance by an average of 43% by significantly reducing instruction, BTB and branch predictor MPKI.

DOI: 10.1145/3613424.3614258

ArchExplorer： Microarchitecture Exploration Via Bottleneck Analysis

作者: Bai, Chen and Huang, Jiayi and Wei, Xuechao and Ma, Yuzhe and Li, Sicheng and Zheng, Hongzhong and Yu, Bei and Xie, Yuan
关键词: Microprocessor, Microarchitecture, Design Space Exploration

Abstract

Design space exploration (DSE) for microarchitecture parameters is an essential stage in microprocessor design to explore the trade-offs among performance, power, and area (PPA). Prior work either employs excessive expert efforts to guide microarchitecture parameter tuning or demands high computing resources to prepare datasets and train black-box prediction models for DSE. In this work, we aim to circumvent the domain knowledge requirements through automated bottleneck analysis and propose ArchExplorer, which reveals microarchitecture bottlenecks to guide DSE with much fewer simulations. ArchExplorer consists of a new graph formulation of microexecution, an optimal critical path construction algorithm, and hardware resource reassignment strategies. Specifically, the critical path is constructed from the microexecution to uncover the performance-critical microarchitecture bottlenecks, which facilitates ArchExplorer to reclaim the hardware budgets of performance-insensitive structures that consume unnecessary power and area. These budgets are then reassigned to the microarchitecture bottlenecks for performance boost while maintaining the power and area constraints under the total budget envelope. Experiments show that ArchExplorer can find better PPA Pareto-optimal designs, achieving an average of higher Pareto hypervolume using at most fewer simulations compared to the state-of-the-art approaches.

DOI: 10.1145/3613424.3614289

Session details： Session 2C： Accelerators for Graphs, Robotics

作者: Neuman, Sabrina
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637185

DF-GAS： a Distributed FPGA-as-a-Service Architecture towards Billion-Scale Graph-based Approximate Nearest Neighbor Search

作者: Zeng, Shulin and Zhu, Zhenhua and Liu, Jun and Zhang, Haoyu and Dai, Guohao and Zhou, Zixuan and Li, Shuangchen and Ning, Xuefei and Xie, Yuan and Yang, Huazhong and Wang, Yu
关键词: Graph, FPGA, Embedding Retrieval, Distributed Architecture, Approximate Nearest Neighbor Search

Abstract

Embedding retrieval is a crucial task for recommendation systems. Graph-based approximate nearest neighbor search (GANNS) is the most commonly used method for retrieval, and achieves the best performance on billion-scale datasets. Unfortunately, the existing CPU- and GPU-based GANNS systems are difficult to optimize the throughput under the latency constraints on billion-scale datasets, due to the underutilized local memory bandwidth (5-45%) and the expensive remote data access overhead (∼ 85% of the total latency). In this paper, we first introduce a practically ideal GANNS architecture for billion-scale datasets, which facilitates a detailed analysis of the challenges and characteristics of distributed GANNS systems. Then, at the architecture level, we propose DF-GAS, a Distributed FPGA-as-a-Service (FPaaS) architecture for accelerating billion-scale Graph-based Approximate nearest neighbor Search. DF-GAS uses a feature-packing memory access engine and a data prefetching and delayed processing scheme to increase local memory bandwidth by 36-42% and reduce remote data access overhead by 76.2%, respectively. At the system level, we exploit the “full-graph + sub-graph” hybrid parallel search scheme on distributed FPaaS system. It achieves million-level query-per-second with sub-millisecond latency on billion-scale GANNS for the first time. Extensive evaluations on million-scale and billion-scale datasets show that DF-GAS achieves an average of 55.4 \texttimes{

DOI: 10.1145/3613424.3614292

Dadu-RBD： Robot Rigid Body Dynamics Accelerator with Multifunctional Pipelines

作者: Yang, Yuxin and Chen, Xiaoming and Han, Yinhe
关键词: Robotics, Rigid Body Dynamics, Pipeline, Multifunctional, Dataflow, Accelerator

Abstract

Rigid body dynamics is a core technology in the robotics field. In trajectory optimization and model predictive control algorithms, there are usually a large number of rigid body dynamics computing tasks. Using CPUs to process these tasks consumes a lot of time, which will affect the real-time performance of robots. To this end, we propose a multifunctional robot rigid body dynamics accelerator, named Dadu-RBD, to address the performance bottleneck. By analyzing different functions commonly used in robot dynamics calculations, we summarize their relationships and characteristics, then optimize them according to the hardware. Based on this, Dadu-RBD can fully reuse common hardware modules when processing different computing tasks. By dynamically switching the dataflow path, Dadu-RBD can accelerate various dynamics functions without reconfiguring the hardware. We design the Round-Trip Pipeline and Structure-Adaptive Pipelines for Dadu-RBD, which can greatly improve the throughput of the accelerator. Robots with different structures and parameters can be optimized specifically. Compared with the state-of-the-art CPU, GPU dynamics libraries and FPGA accelerator, Dadu-RBD can significantly improve the performance.

DOI: 10.1145/3613424.3614298

MEGA Evolving Graph Accelerator

作者: Gao, Chao and Afarin, Mahbod and Rahman, Shafiur and Abu-Ghazaleh, Nael and Gupta, Rajiv
关键词: temporal locality, redundancy removal, iterative graph algorithms, evolving graphs, common graph, batch oriented execution

Abstract

Graph Processing is an emerging workload for applications working with unstructured data, such as social network analysis, transportation networks, bioinformatics and operations research. We examine the problem of graph analytics over evolving graphs, which are graphs that change over time. The problem is challenging because it requires evaluation of a graph query on a sequence of graph snapshots over a time window, typically to track the progression of a property over time. In this paper, we introduce MEGA, a hardware accelerator designed for efficiently evaluating queries over evolving graphs. MEGA leverages CommonGraph, a recently proposed software approach for incrementally processing evolving graphs that gains efficiency by avoiding the need to process expensive deletions by converting them into additions. MEGA supports incremental event-based streaming of edge additions as well as execution of multiple snapshots concurrently to support evolving graphs. We propose Batch-Oriented-Execution (BOE), a novel batch-update scheduling technique that activates snapshots that share batches simultaneously to achieve both computation and data reuse. We introduce optimizations that pack compatible batches together, and pipeline batch processing. To the best of our knowledge, MEGA is the first graph accelerator for evolving graphs that evaluates graph queries over multiple snapshots simultaneously. MEGA achieves 24 \texttimes{

DOI: 10.1145/3613424.3614260

Session details： Session 3A： ML Sparsity

作者: Panda, Biswabandan
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637186

Eureka： Efficient Tensor Cores for One-sided Unstructured Sparsity in DNN Inference

作者: Gondimalla, Ashish and Thottethodi, Mithuna and Vijaykumar, T. N.
关键词: unstructured sparsity, deep neural network (DNN) inference. one-sided sparsity, Tensor cores

Abstract

Deep neural networks (DNNs), while enormously popular, continue to place ever higher compute demand for which GPUs provide specialized matrix multipliers called tensor cores. To reduce the compute demand via sparsity, Nvidia Ampere’s tensor cores support 2:4 structured sparsity in the filters (i.e., two non-zeros out of four values) which provides uniform 50% sparsity without any load imbalance issues. Consequently, the sparse tensor cores maintain (input or output) operand stationarity, which is fundamental for avoiding high-overhead hardware, requiring only one extra 4-1 multiplexer per multiply-accumulate unit (MAC). However, 2:4 sparsity is limited to 2x improvements in performance and energy without loss of accuracy, whereas unstructured sparsity provides 5-6x opportunity albeit while causing load imbalance. Previous papers on unstructured sparsity incur high hardware overhead (e.g., buffering, crossbars, scatter-gather networks, and address calculators) mainly due to sacrificing operand stationarity in favor of load balance. To avoid adding high overheads to the highly-efficient tensor cores, we propose Eureka, an efficient tensor core for unstructured sparsity. Eureka addresses load imbalance via three contributions: (1) Our key insight is that a slight weakening of output stationarity achieves load balance most of the time while incurring only a modest hardware overhead. Accordingly, we propose single-step uni-directional displacement (SUDS), where a filter element’s multiplication can either occur in its original position or be displaced to a vacant MAC in the adjacent row below while the accumulation occurs in the original row to restore output stationarity. SUDS is an offline technique for inference. (2) We provide an optimal algorithm for work assignment for SUDS. (3) To achieve fewer bubbles in the tensor core’s systolic pipeline due to the irregularity of unstructured sparsity, we propose offline systolic scheduling to group together the sparse filters with similar, statically-known execution times (based on the number of non-zeros). Our evaluation shows that Eureka achieves 4.8x and 2.4x speedups, and 3.1x and 1.8x energy reductions over dense and 2:4 sparse (Ampere) implementations, respectively, and incurs area and power overheads of 6% and 11.5%, respectively, over Ampere.

DOI: 10.1145/3613424.3614312

RM-STC： Row-Merge Dataflow Inspired GPU Sparse Tensor Core for Energy-Efficient Sparse Acceleration

作者: Huang, Guyue and Wang, Zhengyang and Tsai, Po-An and Zhang, Chen and Ding, Yufei and Xie, Yuan
关键词: Sparse Tensor Core, GPU, DNN Acceleration

Abstract

This paper proposes RM-STC, a novel GPU tensor core architecture designed for sparse Deep Neural Networks (DNNs) with two key innovations: (1) native support for both training and inference and (2) high efficiency for all sparsity degrees. To achieve the first goal, RM-STC employs a uniform sparse encoding scheme that natively supports all operations holistically in forward and backward passes, thereby eliminating the need for costly sparse encoding transformation in between. For the second goal, RM-STC takes inspiration from the row-merge dataflow and combines the input-gathering and output-scattering hardware features to minimize the energy overhead. Experiments show that RM-STC achieves significant speedups and energy efficiency improvements over dense tensor cores and previous sparse tensor cores.

DOI: 10.1145/3613424.3623775

Sparse-DySta： Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads

作者: Fan, Hongxiang and Venieris, Stylianos I. and Kouris, Alexandros and Lane, Nicholas
关键词: Sparse Multi-DNN Scheduling, Dynamic and Static Approach, Algorithm and Hardware Co-Design

Abstract

Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices, such as mobile phones where multiple tasks serve a single user for daily activities, and data centers, where various requests are raised from millions of users, as seen with large language models. To reduce the costly computational and memory requirements of these workloads, various efficient sparsification approaches have been introduced, resulting in widespread sparsity across different types of DNN models. In this context, there is an emerging need for scheduling sparse multi-DNN workloads, a problem that is largely unexplored in previous literature. This paper systematically analyses the use-cases of multiple sparse DNNs and investigates the opportunities for optimizations. Based on these findings, we propose Dysta, a novel bi-level dynamic and static scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling. Both static and dynamic components of Dysta are jointly designed at the software and hardware levels, respectively, to improve and refine the scheduling approach. To facilitate future progress in the study of this class of workloads, we construct a public benchmark that contains sparse multi-DNN workloads across different deployment scenarios, spanning from mobile phones and AR/VR wearables to data centers. A comprehensive evaluation on the sparse multi-DNN benchmark demonstrates that our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4 \texttimes{

DOI: 10.1145/3613424.3614263

Session details： Session 3B： GPUs

作者: Vijaykumar, Nandita and Neuman, Sabrina
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637187

MAD MAcce： Supporting Multiply-Add Operations for Democratizing Matrix-Multiplication Accelerators

作者: Sung, Seunghwan and Hur, Sujin and Kim, Sungwoo and Ha, Dongho and Oh, Yunho and Ro, Won Woo
关键词: Tensor Cores, High Performance Computing, GPU

Abstract

Modern GPUs commonly employ specialized matrix multiplication units (MXUs) to accelerate matrix multiplication, the core computation of deep learning workloads. However, it is challenging to exploit the MXUs for GPGPU applications whose fundamental algorithms do not rely on matrix multiplication. Furthermore, an additional programming effort is necessary to tailor existing code or algorithms using dedicated APIs or libraries to utilize MXUs. Therefore, MXUs are often underutilized even when GPUs hunger for higher throughput. We observe that the intensive multiply-and-add (MAD) instructions often become bottlenecks in compute-intensive applications. Furthermore, such MAD instructions create computations similar to the dot-product operations of MXUs when they have data dependency. By leveraging these observations, we propose a novel MXU architecture called MAD MAcce that can handle both matrix multiplication and MAD operations. In our design, GPU compiler detects target MAD instructions by analyzing the instruction stream and generates new instructions for MAD Macce in a programmer-transparent manner. Then, MAD MAcce executes the newly generated instructions. By offloading MAD operations to the MXUs, GPUs can exploit the high throughput of MXUs for various domains without significant hardware modification or additional programming efforts. In our evaluation, MAD MAcce achieves up to 2.13 \texttimes{

DOI: 10.1145/3613424.3614247

Path Forward Beyond Simulators： Fast and Accurate GPU Execution Time Prediction for DNN Workloads

作者: Li, Ying and Sun, Yifan and Jog, Adwait
关键词: Performance Model, Graphics Processing Units, Deep Neural Networks

Abstract

Today, DNNs’ high computational complexity and sub-optimal device utilization present a major roadblock to democratizing DNNs. To reduce the execution time and improve device utilization, researchers have been proposing new system design solutions, which require performance models (especially GPU models) to help them with pre-product concept validation. Currently, researchers have been utilizing simulators to predict execution time, which provides high flexibility and acceptable accuracy, but at the cost of a long simulation time. Simulators are becoming increasingly impractical to model today’s large-scale systems and DNNs, urging us to find alternative lightweight solutions. To solve this problem, we propose using a data-driven method for modeling DNNs system performance. We first build a dataset that includes the execution time of numerous networks/layers/kernels. After identifying the relationships of directly known information (e.g., network structure, hardware theoretical computing capabilities), we discuss how to build a simple, yet accurate, performance model for DNNs execution time. Our observations on the dataset demonstrate prevalent linear relationships between the GPU kernel execution times, operation counts, and input/output parameters of DNNs layers. Guided by our observations, we develop a fast, linear-regression-based DNNs execution time predictor. Our evaluation using various image classification models suggests our method can predict new DNNs performance with a 7% error and new GPU performance with a 15.2% error. Our case studies also demonstrate how the performance model can facilitate future DNNs system research.

DOI: 10.1145/3613424.3614277

G10： Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations

作者: Zhang, Haoyang and Zhou, Yirui and Xue, Yuqi and Liu, Yiqi and Huang, Jian
关键词: Unified Virtual Memory, Solid State Drives, GPUDirect Storage, GPU Memory, Deep Learning Compiler

Abstract

To break the GPU memory wall for scaling deep learning workloads, a variety of architecture and system techniques have been proposed recently. Their typical approaches include memory extension with flash memory and direct storage access. However, these techniques still suffer from suboptimal performance and introduce complexity to the GPU memory management, making them hard to meet the scalability requirement of deep learning workloads today. In this paper, we present a unified GPU memory and storage architecture named G10 driven by the fact that the tensor behaviors of deep learning workloads are highly predictable. G10 integrates the host memory, GPU memory, and flash memory into a unified memory space, to scale the GPU memory capacity while enabling transparent data migrations. Based on this unified GPU memory and storage architecture, G10 utilizes compiler techniques to characterize the tensor behaviors in deep learning workloads. Therefore, it can schedule data migrations in advance by considering the available bandwidth of flash memory and host memory. The cooperative mechanism between deep learning compilers and the unified memory architecture enables G10 to hide data transfer overheads in a transparent manner. We implement G10 based on an open-source GPU simulator. Our experiments demonstrate that G10 outperforms state-of-the-art GPU memory solutions by up to 1.75 \texttimes{

DOI: 10.1145/3613424.3614309

Session details： Session 4A： ML Architecture

作者: Tsai, Po-An
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637188

MAICC ： A Lightweight Many-core Architecture with In-Cache Computing for Multi-DNN Parallel Inference

作者: Fan, Renhao and Cui, Yikai and Chen, Qilin and Wang, Mingyu and Zhang, Youhui and Zheng, Weimin and Li, Zhaolin
关键词: No keywords

Abstract

The growing complexity and diversity of neural networks in the fields of autonomous driving and intelligent robots have facilitated the research of many-core architectures, which can offer sufficient programming flexibility to simultaneously support multi-DNN parallel inference with different network structures and sizes compared to domain-specific architectures. However, due to the tight constraints of area and power consumption, many-core architectures typically use lightweight scalar cores without vector units and are almost unable to meet the high-performance computing needs of multi-DNN parallel inference. To solve the above problem, we design an area- and energy-efficient many-core architecture by integrating large amounts of lightweight processor cores with RV32IMA ISA. The architecture leverages the emerging SRAM-based computing-in-memory technology to implement vector instruction extensions by reusing memory cells in the data cache instead of conventional logic circuits. Thus, the data cache in each core can be reconfigured as the memory part and the computing part with the latter tightly coupled with the core pipeline, enabling parallel execution of the basic RISC-V instructions and the extended multi-cycle vector instructions. Furthermore, a corresponding execution framework is proposed to effectively map DNN models onto the many-core architecture by using intra-layer and inter-layer pipelining, which potentially supports multi-DNN parallel inference. Experimental results show that the proposed MAICC architecture obtains a 4.3 \texttimes{

DOI: 10.1145/3613424.3614268

Cambricon-U： A Systolic Random Increment Memory Architecture for Unary Computing

作者: Guo, Hongrui and Zhao, Yongwei and Li, Zhangmai and Hao, Yifan and Liu, Chang and Song, Xinkai and Li, Xiaqing and Du, Zidong and Zhang, Rui and Guo, Qi and Chen, Tianshi and Xu, Zhiwei
关键词: unary computing;, systolic array, skew number

Abstract

Unary computing, whose arithmetics require only one logic gate, has enabled efficient DNN processing, especially on strictly power-constrained devices. However, unary computing still confronts the power efficiency bottleneck for buffering unary bitstreams. The buffering of unary bitstreams requires accumulating bits into large bitwidth binary numbers. The large bitwidth binary number needs to activate all bits per cycle in case of carry propagation. As a result, the accumulation process accounts for 32%-70% of the power budget. To push the boundary of power efficiency, we propose Cambricon-U, a systolic random increment memory architecture featuring efficient accumulation. By leveraging skew number data format, Cambricon-U only activates no more than three bits (instead of all bits) from each number per accumulating cycle. Experimental results show that Cambricon-U reduces 51% power on unary accumulation, and improves 1.18-1.45 \texttimes{

DOI: 10.1145/3613424.3614286

Improving Data Reuse in NPU On-chip Memory with Interleaved Gradient Order for DNN Training

作者: Kim, Jungwoo and Na, Seonjin and Lee, Sanghyeon and Lee, Sunho and Huh, Jaehyuk
关键词: scheduling, on-chip memory, accelerators, DNN training

Abstract

During training tasks for machine learning models with neural processing units (NPUs), the most time-consuming part is the backward pass, which incurs significant overheads due to off-chip memory accesses. For NPUs, to mitigate the long latency and limited bandwidth of such off-chip DRAM accesses, the software-managed on-chip scratchpad memory (SPM) plays a crucial role. As the backward pass computation must be optimized to improve the effectiveness of SPM, this study identifies a new data reuse pattern specific to the backward computation. The backward pass includes independent input and weight gradient computations sharing the same output gradient in each layer. Conventional sequential processing does not exploit the potential inter-operation data reuse opportunity within SPM. With this new opportunity of data reuse in the backward pass, this study proposes a novel data flow transformation scheme called interleaved gradient order, consisting of three techniques to enhance the utilization of NPU scratchpad memory. The first technique shuffles the input and weight gradient computations by interleaving two operations into a single fused operation to reduce redundant output gradient accesses. The second technique adjusts the tile access order for the interleaved gradient computations to maximize the potential data locality. However, since the best order is not fixed for all tensors, we propose a selection algorithm to find the most suitable order based on the tensor dimensions. The final technique further improves data reuse chances by using the best partitioning and mapping scheme for two gradient computations for single-core and multi-core NPUs. The simulation-based evaluation with single-core edge and server NPUs shows that the combined techniques can improve performance by 29.3% and 14.5% for edge and server NPUs respectively. Furthermore, with a quad-core server NPU, the proposed techniques reduce the execution time by 23.7%.

DOI: 10.1145/3613424.3614299

TT-GNN： Efficient On-Chip Graph Neural Network Training via Embedding Reformation and Hardware Optimization

作者: Qu, Zheng and Niu, Dimin and Li, Shuangchen and Zheng, Hongzhong and Xie, Yuan
关键词: Tensor-train Decomporition, Hardware Accelerator, Graph Neural Networks

Abstract

Training Graph Neural Networks on large graphs is challenging due to the need to store graph data and move them along the memory hierarchy. In this work, we tackle this by effectively compressing graph embedding matrix such that the model training can be fully enabled with on-chip compute and memory resources. Specifically, we leverage the graph homophily property and consider using Tensor-train to represent the graph embedding. This allows nodes with similar neighborhoods to partially share the feature representation. While applying Tensor-train reduces the size of the graph embedding, it imposes several challenges to hardware design. On one hand, utilizing low-rank representation requires the features to be decompressed before being sent to GNN models, which introduces extra computation overhead. On the other hand, the decompressed features might still exceed on-chip memory capacity even with the minibatch setting, causing inefficient off-chip memory access. Thus, we propose the TT-GNN hardware accelerator with a specialized dataflow tailored for on-chip Tensor-train GNN learning. Based on the on-chip memory capacity and training configuration, TT-GNN adaptively breaks down a minibatch into smaller microbatches that can be fitted on-chip. The microbatch composition and scheduling order are designed to maximize data reuse and reduce redundant computations both across and within microbatches. To mitigate TT computation overhead, we further propose a unified algorithm to jointly handle TT decompression during forward propagation and TT gradient derivation during backward propagation. Evaluated on a series of benchmarks, the proposed software-hardware solution is able to outperform existing CPU-GPU training systems on both training performance (1.55 ∼ 4210 \texttimes{

DOI: 10.1145/3613424.3614305

SUPPORTING ENERGY-BASED LEARNING WITH AN ISING MACHINE SUBSTRATE： A CASE STUDY ON RBM

作者: Vengalam, Uday Kumar Reddy and Liu, Yongchao and Geng, Tong and Wu, Hui and Huang, Michael
关键词: No keywords

Abstract

Nature apparently does a lot of computation constantly. If we can harness some of that computation at an appropriate level, we can potentially perform certain type of computation (much) faster and more efficiently than we can do with a von Neumann computer. Indeed, many powerful algorithms are inspired by nature and are thus prime candidates for nature-based computation. One particular branch of this effort that has seen some recent rapid advances is Ising machines. Some Ising machines are already showing better performance and energy efficiency for optimization problems. Through design iterations and co-evolution between hardware and algorithm, we expect more benefits from nature-based computing systems in the future. In this paper, we make a case for an augmented Ising machine suitable for both training and inference using an energy-based machine learning algorithm. We show that with a small change, the Ising substrate accelerates key parts of the algorithm and achieves non-trivial speedup and efficiency gain. With a more substantial change, we can turn the machine into a self-sufficient gradient follower to virtually complete training entirely in hardware. This can bring about 29x speedup and about 1000x reduction in energy compared to a Tensor Processing Unit (TPU) host.

DOI: 10.1145/3613424.3614315

Session details： Session 4B： Quantum

作者: Kobayashi, Hiroaki
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637189

QuComm： Optimizing Collective Communication for Distributed Quantum Computing

作者: Wu, Anbang and Ding, Yufei and Li, Ang
关键词: quantum computing, quantum compiler

Abstract

Distributed quantum computing (DQC) is a scalable way to build a large-scale quantum computing system. Previous compilers for DQC focus on either qubit-to-qubit inter-node gates or qubit-to-node nonlocal circuit blocks, missing opportunities of optimizing collective communication which consists of nonlocal gates over multiple nodes. In this paper, we observe that by utilizing patterns of collective communication, we can greatly reduce the amount of inter-node communication required to implement a group of nonlocal gates. We propose QuComm, the first compiler framework which unveils and analyzes collective communication patterns hidden in distributed quantum programs and efficiently routes inter-node gates on any DQC architecture based on discovered patterns, cutting down the overall communication cost of the target program. We also provide the first formalization of the communication buffer concept in DQC compiling. The communication buffer utilizes data qubits to store remote entanglement so that we can ensure enough communication resources on any DQC architecture to support the proposed optimizations for collective communication. Experimental results show that, compared to the state-of-the-art baseline, QuComm reduces the amount of inter-node communication by 54.9% on average, over various distributed quantum programs and DQC hardware configurations.

DOI: 10.1145/3613424.3614253

QuCT： A Framework for Analyzing Quantum Circuit by Extracting Contextual and Topological Features

作者: Tan, Siwei and Lang, Congliang and Xiang, Liang and Wang, Shudi and Jia, Xinghui and Tan, Ziqi and Li, Tingting and Yin, Jieming and Shang, Yongheng and Python, Andre and Lu, Liqiang and Yin, Jianwei
关键词: quantum error correction, quantum computing, quantum circuit synthesis

Abstract

In the current Noisy Intermediate-Scale Quantum era, quantum circuit analysis is an essential technique for designing high-performance quantum programs. Current analysis methods exhibit either accuracy limitations or high computational complexity for obtaining precise results. To reduce this tradeoff, we propose QuCT, a unified framework for extracting, analyzing, and optimizing quantum circuits. The main innovation of QuCT is to vectorize each gate with each element, quantitatively describing the degree of the interaction with neighboring gates. Extending from the vectorization model, we propose two representative downstream models for fidelity prediction and unitary decomposition. The fidelity prediction model performs a linear transformation on all gate vectors and aggregates the results to estimate the overall circuit fidelity. By identifying critical weights in the transformation matrix, we propose two optimizations to improve the circuit fidelity. In the unitary decomposition model, we significantly reduce the search space by bridging the gap between unitary and circuit via gate vectors. Experiments show that QuCT improves the accuracy of fidelity prediction by 4.2 \texttimes{

DOI: 10.1145/3613424.3614274

ERASER： Towards Adaptive Leakage Suppression for Fault-Tolerant Quantum Computing

作者: Vittal, Suhas and Das, Poulami and Qureshi, Moinuddin
关键词: Quantum Error Correction, Leakage Suppression

Abstract

Quantum error correction (QEC) codes can tolerate hardware errors by encoding fault-tolerant logical qubits using redundant physical qubits and detecting errors using parity checks. Leakage errors occur in quantum systems when a qubit leaves its computational basis and enters higher energy states. These errors severely limit the performance of QEC due to two reasons. First, they lead to erroneous parity checks that obfuscate the accurate detection of errors. Second, the leakage spreads to other qubits and creates a pathway for more errors over time. Prior works tolerate leakage errors by using leakage reduction circuits (LRCs) that modify the parity check circuitry of QEC codes. Unfortunately, naively using LRCs always throughout a program is sub-optimal because LRCs incur additional two-qubit operations that (1) facilitate leakage transport, and (2) serve as new sources of errors. Ideally, LRCs should only be used if leakage occurs, so that errors from both leakage as well as additional LRC operations are simultaneously minimized. However, identifying leakage errors in real-time is challenging. To enable the robust and efficient usage of LRCs, we propose ERASER that speculates the subset of qubits that may have leaked and only uses LRCs for those qubits. Our studies show that the majority of leakage errors typically impact the parity checks. We leverage this insight to identify the leaked qubits by analyzing the patterns in the failed parity checks. We propose ERASER+M that enhances ERASER by detecting leakage more accurately using qubit measurement protocols that can classify qubits into |0⟩, |1⟩ and |L⟩ states. ERASER and ERASER+M improve the logical error rate by up to 4.3 \texttimes{

DOI: 10.1145/3613424.3614251

Systems Architecture for Quantum Random Access Memory

作者: Xu, Shifan and Hann, Connor T. and Foxman, Ben and Girvin, Steven M. and Ding, Yongshan
关键词: Quantum Random Access Memory, Quantum Computing

Abstract

Operating on the principles of quantum mechanics, quantum algorithms hold the promise for solving problems that are beyond the reach of the best-available classical algorithms. An integral part of realizing such speedup is the implementation of quantum queries, which read data into forms that quantum computers can process. Quantum random access memory (QRAM) is a promising architecture for realizing quantum queries. However, implementing QRAM in practice poses significant challenges, including query latency, memory capacity and fault-tolerance. In this paper, we propose the first end-to-end system architecture for QRAM. First, we introduce a novel QRAM that hybridizes two existing implementations and achieves asymptotically superior scaling in space (qubit number) and time (circuit depth). Like in classical virtual memory, our construction enables queries to a virtual address space larger than what is actually available in hardware. Second, we present a compilation framework to synthesize, map, and schedule QRAM circuits on realistic hardware. For the first time, we demonstrate how to embed large-scale QRAM on a 2D Euclidean space, such as a 2D square grid layout, with minimal routing overhead. Third, we show how to leverage the intrinsic biased-noise resilience of the proposed QRAM for implementation on either Noisy Intermediate-Scale Quantum (NISQ) or Fault-Tolerant Quantum Computing (FTQC) hardware. Finally, we validate these results numerically via both classical simulation and quantum hardware experimentation. Our novel Feynman-path-based simulator allows for efficient simulation of noisy QRAM circuits at a larger scale than previously possible. Collectively, our results outline the set of software and hardware controls needed to implement practical QRAM.

DOI: 10.1145/3613424.3614270

HetArch： Heterogeneous Microarchitectures for Superconducting Quantum Systems

作者: Stein, Samuel and Sussman, Sara and Tomesh, Teague and Guinn, Charles and Tureci, Esin and Lin, Sophia Fuhui and Tang, Wei and Ang, James and Chakram, Srivatsan and Li, Ang and Martonosi, Margaret and Chong, Fred and Houck, Andrew A. and Chuang, Isaac L. and Demarco, Michael
关键词: Superconducting Quantum Systems, Quantum Computing Architecture, Quantum Computing

Abstract

Noisy Intermediate-Scale Quantum Computing (NISQ) has dominated headlines in recent years, with the longer-term vision of Fault-Tolerant Quantum Computation (FTQC) offering significant potential albeit at currently intractable resource costs and quantum error correction (QEC) overheads. For problems of interest, FTQC will require millions of physical qubits with long coherence times, high-fidelity gates, and compact sizes to surpass classical systems. Just as heterogeneous specialization has offered scaling benefits in classical computing, it is likewise gaining interest in FTQC. However, systematic use of heterogeneity in either hardware or software elements of FTQC systems remains a serious challenge due to the vast design space and variable physical constraints. This paper meets the challenge of making heterogeneous FTQC design practical by introducing HetArch, a toolbox for designing heterogeneous quantum systems, and using it to explore heterogeneous design scenarios. Using a hierarchical approach, we successively break quantum algorithms into smaller operations (akin to classical application kernels), thus greatly simplifying the design space and resulting tradeoffs. Specializing to superconducting systems, we then design optimized heterogeneous hardware composed of varied superconducting devices, abstracting physical constraints into design rules that enable devices to be assembled into standard cells optimized for specific operations. Finally, we provide a heterogeneous design space exploration framework which reduces the simulation burden by a factor of 104 or more and allows us to characterize optimal design points. We use these techniques to design superconducting quantum modules for entanglement distillation, error correction, and code teleportation, reducing error rates by 2.6 \texttimes{

DOI: 10.1145/3613424.3614300

Session details： Session 4C： Emerging Technologies： Superconducting, Photonics, DNA

作者: Inoue, Koji
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637190

Efficiently Enabling Block Semantics and Data Updates in DNA Storage

作者: Sharma, Puru and Lim, Cheng-Kai and Lin, Dehui and Pote, Yash and Jevdjic, Djordje
关键词: Data Updates in DNA Storage, DNA Storage, Block Storage

Abstract

We propose a novel and flexible DNA-storage architecture, which divides the storage space into fixed-size units (blocks) that can be independently and efficiently accessed at random for both read and write operations, and further allows efficient sequential access to consecutive data blocks. In contrast to prior work, in our architecture a pair of random-access PCR primers of length 20 does not define a single object, but an independent storage partition, which is internally blocked and managed independently of other partitions. We expose the flexibility and constraints with which the internal address space of each partition can be managed, and incorporate them into our design to provide rich and functional storage semantics, such as block-storage organization, efficient implementation of data updates, and sequential access. To leverage the full power of the prefix-based nature of PCR addressing, we define a methodology for transforming the internal addressing scheme of a partition into an equivalent that is PCR-compatible. This allows us to run PCR with primers that can be variably elongated to include a desired part of the internal address, and thus narrow down the scope of the reaction to retrieve a specific block or a range of blocks within the partition with sufficiently high accuracy. Our wetlab evaluation demonstrates the practicality of the proposed ideas and a 140x reduction in sequencing cost and latency for retrieval of individual blocks within the partition.

DOI: 10.1145/3613424.3614308

ReFOCUS： Reusing Light for Efficient Fourier Optics-Based Photonic Neural Network Accelerator

作者: Li, Shurui and Yang, Hangbo and Wong, Chee Wei and Sorger, Volker J. and Gupta, Puneet
关键词: on-chip photonics, neural network accelerator, deep learning, Photonic neural network, Fourier optics, 4F system

Abstract

In recent years, there has been a significant focus on achieving low-latency and high-throughput convolutional neural network (CNN) inference. Integrated photonics offers the potential to substantially expedite neural networks due to its inherent low-latency properties. Recently, on-chip Fourier optics-based neural network accelerators have been demonstrated and achieved superior energy efficiency for CNN acceleration. By incorporating Fourier optics, computationally intensive convolution operations can be performed instantaneously through on-chip lenses at a significantly lower cost compared to other on-chip photonic neural network accelerators. This is thanks to the complexity reduction offered by the convolution theorem and the passive Fourier transforms computed by on-chip lenses. However, conversion overhead between optical and digital domains and memory access energy still hinder overall efficiency. We introduce ReFOCUS, a Joint Transform Correlator (JTC) based on-chip neural network accelerator that efficiently reuses light through optical buffers. By incorporating optical delay lines, wavelength-division multiplexing, dataflow, and memory hierarchy optimization, ReFOCUS minimizes both conversion overhead and memory access energy. As a result, ReFOCUS achieves 2 \texttimes{

DOI: 10.1145/3613424.3623798

SupeRBNN： Randomized Binary Neural Network Using Adiabatic Superconductor Josephson Devices

作者: Li, Zhengang and Yuan, Geng and Yamauchi, Tomoharu and Masoud, Zabihi and Xie, Yanyue and Dong, Peiyan and Tang, Xulong and Yoshikawa, Nobuyuki and Tiwari, Devesh and Wang, Yanzhi and Chen, Olivia
关键词: Superconducting, Stochastic Computing, Deep Learning, BNN, AQFP

Abstract

Adiabatic Quantum-Flux-Parametron (AQFP) is a superconducting logic with extremely high energy efficiency. By employing the distinct polarity of current to denote logic ‘0’ and ‘1’, AQFP devices serve as excellent carriers for binary neural network (BNN) computations. Although recent research has made initial strides toward developing an AQFP-based BNN accelerator, several critical challenges remain, preventing the design from being a comprehensive solution. In this paper, we propose SupeRBNN, an AQFP-based randomized BNN acceleration framework that leverages software-hardware co-optimization to eventually make the AQFP devices a feasible solution for BNN acceleration. Specifically, we investigate the randomized behavior of the AQFP devices and analyze the impact of crossbar size on current attenuation, subsequently formulating the current amplitude into the values suitable for use in BNN computation. To tackle the accumulation problem and improve overall hardware performance, we propose a stochastic computing-based accumulation module and a clocking scheme adjustment-based circuit optimization method. To effectively train the BNN models that are compatible with the distinctive characteristics of AQFP devices, we further propose a novel randomized BNN training solution that utilizes algorithm-hardware co-optimization, enabling simultaneous optimization of hardware configurations. In addition, we propose implementing batch normalization matching and the weight rectified clamp method to further improve the overall performance. We validate our SupeRBNN framework across various datasets and network architectures, comparing it with implementations based on different technologies, including CMOS, ReRAM, and superconducting RSFQ/ERSFQ. Experimental results demonstrate that our design achieves an energy efficiency of approximately 7.8 \texttimes{

DOI: 10.1145/3613424.3623771

SuperBP： Design Space Exploration of Perceptron-Based Branch Predictors for Superconducting CPUs

作者: Zha, Haipeng and Tannu, Swamit and Annavaram, Murali
关键词: Single Flux Quantum, SFQ, Perceptron, Branch Prediction

Abstract

Single Flux Quantum (SFQ) superconducting technology has a considerable advantage over CMOS in power and performance. SFQ CPUs can also help scale quantum computing technologies, as SFQ circuits can be integrated with qubits due to their amenability to a cryogenic environment. Recently, there have been significant developments in VLSI design automation tools, making it feasible to design pipelined SFQ CPUs. SFQ technology, however, is constrained by the number of Josephson Junctions (JJs) integrated into a single chip. Prior works focused on JJ-efficient SFQ datapath designs. Pipelined SFQ CPUs also require branch predictors that provide the best prediction accuracy for a given JJ budget. In this paper, we design and evaluate the original Perceptron branch predictor and a later variant named the Hashed Perceptron predictor in terms of their accuracy and JJ usage. Since branch predictors, to date, have not been designed for SFQ CPUs, we first design a baseline predictor built using non-destructive readout (NDRO) cells for storing the perceptron weights. Given that NDRO cells are JJ intensive, we propose an enhanced JJ-efficient design, called SuperBP, that uses high-capacity destructive readout (HC-DRO) cells to store perceptron weights. HC-DRO is a recently introduced multi-bit fluxon storage cell that stores 2 bits per cell. HC-DRO cells double the weight storage density over basic DRO cells to improve prediction accuracy for a given JJ count. However, naive integration of HC-DRO with SFQ logic is inefficient as HC-DRO cells store multiple fluxons in a single cell, which needs a decoding step on a read and an encoding step on a write. SuperBP presents novel inference and prediction update circuits for the Perceptron predictor that can directly operate on the native 2-bit HC-DRO weights without decoding and encoding, thereby reducing the JJ use. SuperBP reduces the JJ count by 39% compared to the NDRO-based design. We evaluate the performance of Perceptron and its hashed variants with the HC-DRO cell design using a range of benchmarks, including several SPEC CPU 2017, mobile, and server traces from the 5th Championship Branch Predictor competition. Our evaluation shows that for a given JJ count, the basic Perceptron variant of SuperBP provides better accuracy than the hashed variant. The hashed variant uses multiple weight tables, each of which needs its own access decoder, and decoder designs in SFQ consume a significant number of JJs. Thus, the hashed variant of SuperBP wastes the JJ budget for accessing multiple tables, leaving a smaller weight storage capacity, which compromises prediction accuracy. The basic Perceptron variant of SuperBP improves prediction accuracy by 13.6% over the hashed perceptron variant for an exemplar 30K JJ budget.

DOI: 10.1145/3613424.3614267

SUSHI： Ultra-High-Speed and Ultra-Low-Power Neuromorphic Chip Using Superconducting Single-Flux-Quantum Circuits

作者: Liu, Zeshi and Chen, Shuo and Qu, Peiyao and Liu, Huanli and Niu, Minghui and Ying, Liliang and Ren, Jie and Tang, Guangming and You, Haihang
关键词: Superconducting, Spiking Neural Networks, Single-Flux-Quantum, Neuromorphic

Abstract

The rapid single-flux-quantum (RSFQ) superconducting technology is highly promising due to its ultra-high-speed computation with ultra-low-power consumption, making it an ideal solution for the post-Moore era. In superconducting technology, information is encoded and processed based on pulses that resemble the neuronal pulses present in biological neural systems. This has led to a growing research focus on implementing neuromorphic processing using superconducting technology. However, current research on superconducting neuromorphic processing does not fully leverage the advantages of superconducting circuits due to incomplete neuromorphic design and approach. Although they have demonstrated the benefits of using superconducting technology for neuromorphic hardware, their designs are mostly incomplete, with only a few components validated, or based solely on simulation. This paper presents SUSHI (Superconducting neUromorphic proceSsing cHIp) to fully leverage the potential of superconducting neuromorphic processing. Based on three guiding principles and our architectural and methodological designs, we address existing challenges and enables the design of verifiable and fabricable superconducting neuromorphic chips. We fabricate and verify a chip of SUSHI using superconducting circuit technology. Successfully obtaining the correct inference results of a complete neural network on the chip, this is the first instance of neural networks being completely executed on a superconducting chip to the best of our knowledge. Our evaluation shows that using approximately 105 Josephson junctions, SUSHI achieves a peak neuromorphic processing performance of 1,355 giga-synaptic operations per second (GSOPS) and a power efficiency of 32,366 GSOPS per Watt (GSOPS/W). This power efficiency outperforms the state-of-the-art neuromorphic chips TrueNorth and Tianjic by 81 and 50 times, respectively.

DOI: 10.1145/3613424.3623787

Session details： Session 5A： Security Encryption/Confidentiality Support

作者: Saileshwar, Gururaj
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637191

AQ2PNN： Enabling Two-party Privacy-Preserving Deep Neural Network Inference with Adaptive Quantization

作者: Luo, Yukui and Xu, Nuo and Peng, Hongwu and Wang, Chenghong and Duan, Shijin and Mahmood, Kaleel and Wen, Wujie and Ding, Caiwen and Xu, Xiaolin
关键词: Two-party computing, Quantization, Privacy-Preserving machine learning, FPGA, Deep learning

Abstract

The growing prevalence of Machine Learning as a Service (MLaaS) enables a wide range of applications but simultaneously raises numerous security and privacy concerns. A key issue involves the potential privacy exposure of involved parties, such as the customer’s input data and the vendor’s model. Consequently, two-party computing (2PC) has emerged as a promising solution to safeguard the privacy of different parties during deep neural network (DNN) inference. However, the state-of-the-art (SOTA) 2PC-DNN techniques are tailored explicitly to traditional instruction set architecture (ISA) systems like CPUs and CPU+GPU. This reliance on ISA systems significantly constrains their energy efficiency, as these architectures typically employ 32- or 64-bit instruction sets. In contrast, the possibilities of harnessing dynamic and adaptive quantization to build high-performance 2PC-DNNs remain largely unexplored due to the lack of compatible algorithms and hardware accelerators. To mitigate the bottleneck of SOTA solutions and fill the existing research gaps, this work investigates the construction of 2PC-DNNs on field programmable gate arrays (FPGAs). We introduce AQ2PNN, an end-to-end framework that effectively employs adaptive quantization schemes to develop high-performance 2PC-DNNs on FPGAs. From an algorithmic perspective, AQ2PNN introduces an innovative 2PC-ReLU method to replace Yao’s Garbled Circuits (GC). Regarding hardware, AQ2PNN employs an extensive set of building blocks for linear operators, non-linear operators, and a specialized Oblivious Transfer (OT) module for secure data exchange, respectively. These algorithm-hardware co-designed modules extremely utilize the fine-grained reconfigurability of FPGAs, to adapt the data bit-width of different DNN layers in the ciphertext domain, thereby reducing communication overhead between parties without compromising DNN performance, such as accuracy. We thoroughly assess AQ2PNN using widely adopted DNN architectures, including ResNet18, ResNet50, and VGG16, all trained on ImageNet and producing quantized models. Experimental results demonstrate that AQ2PNN outperforms SOTA solutions, achieving significantly reduced communication overhead by , improved energy efficiency by 26.3 \texttimes{

DOI: 10.1145/3613424.3614297

CHERIoT： Complete Memory Safety for Embedded Devices

作者: Amar, Saar and Chisnall, David and Chen, Tony and Filardo, Nathaniel Wesley and Laurie, Ben and Liu, Kunyan and Norton, Robert and Moore, Simon W. and Tao, Yucong and Watson, Robert N. M. and Xia, Hongyan
关键词: No keywords

Abstract

The ubiquity of embedded devices is apparent. The desire for increased functionality and connectivity drives ever larger software stacks, with components from multiple vendors and entities. These stacks should be replete with isolation and memory safety technologies, but existing solutions impinge upon development, unit cost, power, scalability, and/or real-time constraints, limiting their adoption and production-grade deployments. As memory safety vulnerabilities mount, the situation is clearly not tenable and a new approach is needed. To slake this need, we present a novel adaptation of the CHERI capability architecture, co-designed with a green-field, security-centric RTOS. It is scaled for embedded systems, is capable of fine-grained software compartmentalization, and provides affordances for full inter-compartment memory safety. We highlight central design decisions and offloads and summarize how our prototype RTOS uses these to enable memory-safe, compartmentalized applications. Unlike many state-of-the-art schemes, our solution deterministically (not probabilistically) eliminates memory safety vulnerabilities while maintaining source-level compatibility. We characterize the power, performance, and area microarchitectural impacts, run microbenchmarks of key facilities, and exhibit the practicality of an end-to-end IoT application. The implementation shows that full memory safety for compartmentalized embedded systems is achievable without violating resource constraints or real-time guarantees, and that hardware assists need not be expensive, intrusive, or power-hungry.

DOI: 10.1145/3613424.3614266

Accelerating Extra Dimensional Page Walks for Confidential Computing

作者: Du, Dong and Yang, Bicheng and Xia, Yubin and Chen, Haibo
关键词: No keywords

Abstract

To support highly scalable and fine-grained computing paradigms such as microservices and serverless computing better, modern hardware-assisted confidential computing systems, such as Intel TDX and ARM CCA, introduce permission table to achieve fine-grained and scalable memory isolation among different domains. However, it also adds an extra dimension to page walks besides page tables, leading to significantly more memory references (e.g., 4 → 12 for RISC-V Sv39)1. We observe that most costs (about 75%) caused by the extra dimension of page walks are used to validate page table pages. Based on this observation, this paper proposes HPMP (Hybrid Physical Memory Protection), a hardware-software co-design (on RISC-V) that protects page table pages using segment registers and normal pages using permission tables to balance scalability and performance. We have implemented HPMP and Penglai-HPMP (a TEE system based on HPMP) on FPGA with two RISC-V cores (both in-order and out-of-order). Evaluation results show that HPMP can reduce costs by 23.1%–73.1% on BOOM and significantly improve performance on real-world applications, including serverless computing (FunctionBench) and Redis.

DOI: 10.1145/3613424.3614293

GME： GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

作者: Shivdikar, Kaustubh and Bao, Yuhui and Agrawal, Rashmi and Shen, Michael and Jonatan, Gilbert and Mora, Evelio and Ingare, Alexander and Livesay, Neal and Abell'{A
关键词: Zero-trust frameworks, Modular reduction, Fully Homomorphic Encryption (FHE), Custom accelerators, CU-side interconnects

Abstract

Fully Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE has garnered significant attention over the past decade as it supports secure outsourcing of data processing to remote cloud services. Despite its promise of strong data privacy and security guarantees, FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. This overhead is presently a major barrier to the commercial adoption of FHE. In this work, we leverage GPUs to accelerate FHE, capitalizing on a well-established GPU ecosystem available in the cloud. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture. First, GME integrates a lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain ciphertext in cache across FHE kernels, thus eliminating redundant memory transactions. Second, to tackle compute bottlenecks, GME introduces special MOD-units that provide native custom hardware support for modular reduction operations, one of the most commonly executed sets of operations in FHE. Third, by integrating the MOD-unit with our novel pipelined 64-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by . Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of 796 \texttimes{

DOI: 10.1145/3613424.3614279

MAD： Memory-Aware Design Techniques for Accelerating Fully Homomorphic Encryption

作者: Agrawal, Rashmi and De Castro, Leo and Juvekar, Chiraag and Chandrakasan, Anantha and Vaikuntanathan, Vinod and Joshi, Ajay
关键词: SimFHE, Memory Bottleneck Analysis, Hardware Acceleration, Fully Homomorphic Encryption, Cache Optimizations, CKKS Scheme, Bootstrapping

Abstract

Cloud computing has made it easier for individuals and companies to get access to large compute and memory resources. However, it has also raised privacy concerns about the data that users share with the remote cloud servers. Fully homomorphic encryption (FHE) offers a solution to this problem by enabling computations over encrypted data. Unfortunately, all known constructions of FHE require a noise term for security, and this noise grows during computation. To perform unlimited computations on the encrypted data, we need to perform a periodic noise reduction step known as bootstrapping. This bootstrapping operation is memory-bound as it requires several GBs of data. This leads to orders of magnitude increase in the time required for operating on encrypted data as compared to unencrypted data. In this work, we first present an in-depth analysis of the bootstrapping operation in the CKKS FHE scheme. Similar to other existing works, we observe that CKKS bootstrapping exhibits a low arithmetic intensity (< 1 Op/byte). We then propose memory-aware design (MAD) techniques to accelerate the bootstrapping operation of the CKKS FHE scheme. Our proposed MAD techniques are agnostic of the underlying compute platform and can be equally applied to GPUs, CPUs, FPGAs, and ASICs. Our MAD techniques make use of several caching optimizations that enable maximal data reuse and perform reordering of operations to reduce the amount of data that needs to be transferred to/from the main memory. In addition, our MAD techniques include several algorithmic optimizations that reduce the number of data access pattern switches and the expensive NTT operations. Applying our MAD optimizations for FHE improves bootstrapping arithmetic intensity by 3 \texttimes{

DOI: 10.1145/3613424.3614302

Session details： Session 5B： Prefetching

作者: Peled, Leeor
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637192

Micro-Armed Bandit： Lightweight & Reusable Reinforcement Learning for Microarchitecture Decision-Making

作者: Gerogiannis, Gerasimos and Torrellas, Josep
关键词: Simultaneous Multithreading, Reinforcement Learning, Prefetching, Multi-Armed Bandits, Microarchitecture, Machine Learning for Architecture

Abstract

Online Reinforcement Learning (RL) has been adopted as an effective mechanism in various decision-making problems in microarchitecture. Its high adaptability and the ability to learn at runtime are attractive characteristics in microarchitecture settings. However, although hardware RL agents are effective, they suffer from two main problems. First, they have high complexity and storage overhead. This complexity stems from decomposing the environment into a large number of states and then, for each of these states, bookkeeping many action values. Second, many RL agents are engineered for a specific application and are not reusable. In this work, we tackle both of these shortcomings by designing an RL agent that is both lightweight and reusable across different microarchitecture decision-making problems. We find that, in some of these problems, only a small fraction of the action space is useful in a given time window. We refer to this property as temporal homogeneity in the action space. Motivated by this property, we design an RL agent based on Multi-Armed Bandit algorithms, the simplest form of RL. We call our agent Micro-Armed Bandit. We showcase our agent in two use cases: data prefetching and instruction fetch in simultaneous multithreaded (SMT) processors. For prefetching, our agent outperforms non-RL prefetchers Bingo and MLOP by 2.6% and 2.3% (geometric mean), respectively, and attains similar performance as the state-of-the-art RL prefetcher Pythia—with the dramatically lower storage requirement of only 100 bytes. For SMT instruction fetch, our agent outperforms the Hill Climbing method by 2.2% (geometric mean).

DOI: 10.1145/3613424.3623780

CLIP： Load Criticality based Data Prefetching for Bandwidth-constrained Many-core Systems

作者: Panda, Biswabandan
关键词: Prefetching, Instruction criticality, DRAM, Cache

Abstract

Hardware prefetching is a latency-hiding technique that hides the costly off-chip DRAM accesses. However, state-of-the-art prefetchers fail to deliver performance improvement in the case of many-core systems with constrained DRAM bandwidth. For SPEC CPU2017 homogeneous workloads, the state-of-the-art Berti L1 prefetcher, on a 64-core system with four and eight DRAM channels, incurs performance slowdowns of 24% and 16%, respectively. However, Berti improves performance by 35% if we use an unrealistic configuration of 64 DRAM channels for a 64-core system (one DRAM channel per core). Prior approaches such as prefetch throttling and critical load prefetching are not effective in the presence of state-of-the-art prefetchers. Existing load criticality predictors fail to detect loads that are critical in the presence of hardware prefetching and the best predictor provides an average critical load prediction accuracy of 41%. Existing prefetch throttling techniques use prefetch accuracy as one of the primary metrics. However, these techniques offer limited benefits for state-of-the-art prefetchers that deliver high prefetch accuracy and use prefetcher-specific throttling and filtering. We propose CLIP, a novel load criticality predictor for hardware prefetching with constrained DRAM bandwidth. Our load criticality predictor provides an average accuracy of more than 93% and as high as 100%. CLIP also filters out the critical loads that lead to accurate prefetching. For a 64-core system with eight DRAM channels, CLIP improves the effectiveness of state-of-the-art Berti prefetcher by 24% and 9% for 45 and 200 64-core homogeneous and heterogeneous workload mixes, respectively. We show that CLIP is equally effective in the presence of other state-of-the-art L1 and L2 prefetchers. Overall, CLIP incurs a storage overhead of 1.56KB/core.

DOI: 10.1145/3613424.3614245

Snake： A Variable-length Chain-based Prefetching for GPUs

作者: Mostofi, Saba and Falahati, Hajar and Mahani, Negin and Lotfi-Kamran, Pejman and Sarbazi-Azad, Hamid
关键词: Prefetching, Performance., On-Chip Memory, GPU

Abstract

Graphics Processing Units (GPUs) utilize memory hierarchy and Thread-Level Parallelism (TLP) to tolerate off-chip memory latency, which is a significant bottleneck for memory-bound applications. However, parallel threads generate a large number of memory requests, which increases the average memory latency and degrades cache performance due to high contention. Prefetching is an effective technique to reduce memory access latency, and prior research shows the positive impact of stride-based prefetching on GPU performance. However, existing prefetching methods only rely on fixed strides. To address this limitation, this paper proposes a new prefetching technique, Snake, which is built upon chains of variable strides, using throttling and memory decoupling strategies. Snake achieves 80% coverage and 75% accuracy in prefetching demand memory requests, resulting in a 17% improvement in total GPU performance and energy consumption for memory-bound General-Purpose Graphics Processing Unit (GPGPU) applications.

DOI: 10.1145/3613424.3623782

Treelet Prefetching For Ray Tracing

作者: Chou, Yuan Hsi and Nowicki, Tyler and Aamodt, Tor M.
关键词: ray tracing, prefetching, hardware accelerator, graphics, GPU

Abstract

Ray tracing is traditionally only used in offline rendering to produce images of high fidelity because it is computationally expensive. Recent Graphics Processing Units (GPUs) have included dedicated accelerators to bring ray tracing to real-time rendering for video games and other graphics applications. These accelerators focus on finding the closest intersection between a ray and a scene using a hierarchical tree data structure called a Bounding Volume Hierarchy (BVH) tree. However, BVH tree traversal is still very costly due to divergent rays accessing different parts of the tree, with each ray following a unique pointer-chasing sequence that is difficult to optimize with traditional methods. To address this, we propose treelet prefetching to reduce the latency of ray traversal. Treelets are smaller subtrees created by splitting the BVH tree. When a ray visits a treelet root node, we prefetch the corresponding treelet, enabling deeper levels of the tree to be fetched in advance. This reduces the latency associated with pointer-chasing during tree traversal. Our approach uses a hardware prefetcher with a two-stack treelet based traversal algorithm, maximizing the benefits of treelet prefetching. Our simulation results show treelet prefetching on average improves performance of the baseline RT Unit in Vulkan-Sim by 32.1% while maintaining the same power consumption.

DOI: 10.1145/3613424.3614288

Session details： Session 5C： Processing-In-Memory

作者: Skarlatos, Dimitrios
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637193

NAS-SE： Designing A Highly-Efficient In-Situ Neural Architecture Search Engine for Large-Scale Deployment

作者: Wan, Qiyu and Wang, Lening and Wang, Jing and Song, Shuaiwen Leon and Fu, Xin
关键词: Evolutionary algorithm, Hardware accelerator, In-memory computing, Neural architecture search

Abstract

The emergence of Neural Architecture Search (NAS) enables an automated neural network development process that potentially replaces manually-enabled machine learning expertise. A state-of-the-art NAS method, namely One-Shot NAS, has been proposed to drastically reduce the lengthy search time for a wide spectrum of conventional NAS methods. Nevertheless, the search cost is still prohibitively expensive for practical large-scale deployment with real-world applications. In this paper, we reveal that the fundamental cause for inefficient deployment of One-Shot NAS in both single-device and large-scale scenarios originates from the massive redundant off-chip weight access during the numerous DNN inference in sequential searching. Inspired by its algorithmic characteristics, we depart from the traditional CMOS-based architecture designs and propose a promising processing-in-memory design alternative to perform in-situ architecture search, which helps fundamentally address the redundancy issue. Moreover, we further discovered two major performance challenges of directly porting the searching process onto the existing PIM-based accelerators: severe pipeline contention and resource under-utilization. By leveraging these insights, we propose the first highly-efficient in-situ One-Shot NAS search engine design, named NAS-SE, for both single-device and large-scale deployment scenarios. NAS-SE is equipped with a two-phased network diversification strategy for eliminating resource contention, and a novel hardware mapping scheme for boosting the resource utilization by an order of magnitude. Our extensive evaluation demonstrates that NAS-SE significantly outperforms the state-of-the-art digital-based customized NAS accelerator (NASA) with an average speedup of 8.8 \texttimes{

DOI: 10.1145/3613424.3614265

XFM： Accelerated Software-Defined Far Memory

作者: Patel, Neel and Mamandipoor, Amin and Quinn, Derrick and Alian, Mohammad
关键词: Near-Memory Processing, Compression, Accelerator

Abstract

DRAM constitutes over 50% of server cost and 75% of the embodied carbon footprint of a server. To mitigate DRAM cost, far memory architectures have emerged. They can be separated into two broad categories: software-defined far memory (SFM) and disaggregated far memory (DFM). In this work, we compare the cost of SFM and DFM in terms of their required capital investment, operational expense, and carbon footprint. We show that, for applications whose data sets are compressible and have predictable memory access patterns, it takes several years for a DFM to break even with an equivalent capacity SFM in terms of cost and sustainability. We then introduce XFM, a near-memory accelerated SFM architecture, which exploits the coldness of data during SFM-initiated swap ins and outs. XFM leverages refresh cycles to seamlessly switch the access control of DRAM between the CPU and near-memory accelerator. XFM parallelizes near-memory accelerator accesses with row refreshes and removes the memory interference caused by SFM swap ins and outs. We modify an open source far memory implementation to implement a full-stack, user-level XFM. Our experimental results use a combination of an FPGA implementation, simulation, and analytical modeling to show that XFM eliminates memory bandwidth utilization when performing compression and decompression operations with SFM s of capacities up to 1TB. The memory and cache utilization reductions translate to 5 ∼ 27% improvement in the combined performance of co-running applications.

DOI: 10.1145/3613424.3623776

Affinity Alloc： Taming Not-So Near-Data Computing

作者: Wang, Zhengrong and Liu, Christopher and Beckmann, Nathan and Nowatzki, Tony
关键词: Near-Data Computing, Memory Allocation, Data Structure Co-Design, Data Placement, Data Layout

Abstract

To mitigate the data movement bottleneck on large multicore systems, the near-data computing paradigm (NDC) offloads computation to where the data resides on-chip. The benefit of NDC heavily depends on spatial affinity, where all relevant data are in the same location, e.g. same cache bank. However, existing NDC works lack a general and systematic solution: they either ignore the problem and abort NDC when there is no spatial affinity, or rely on error-prone manual data placement. Our insight is that the essential affinity relationship, i.e. data A should be close to data B, is orthogonal to microarchitecture details and input sizes. By co-optimizing the data structure and capturing this general affinity information in the data allocation interface, the allocator can automatically optimize for data affinity and load balance to make NDC computations truly near data. With this insight, we propose affinity alloc, a general framework to optimize data layout for near-data computing. It comprises an extended allocator runtime, co-optimized data structures, and lightweight extensions to the OS and microarchitecture. Evaluated on parallel workloads across broad domains, affinity alloc achieves 2.26 \texttimes{

DOI: 10.1145/3613424.3623778

MVC： Enabling Fully Coherent Multi-Data-Views through the Memory Hierarchy with Processing in Memory

作者: Fujiki, Daichi
关键词: Processing-in-Memory, Caches, Cache Coherence Protocol

Abstract

Fusing computation and memory through Processing-in-Memory (PIM) provides a radical solution to the memory wall problem by minimizing communication overheads for data-intensive tasks, leading to a revolutionary shift in computer architecture. Although PIM has demonstrated promising results at different layers of the memory hierarchy, few studies have explored integrating compute memories into the memory management system, specifically in relation to coherence protocol. This paper presents MVC, a framework that leverages existing coherence protocols to enable fully coherent views throughout the memory hierarchy. By introducing coherent views, which are user-defined compact representations of conventional data structures, MVC can minimize data movement and harness the reusability of PIM output. The locality-aware MVC views significantly enhance the performance and energy efficiency of various irregular workloads.

DOI: 10.1145/3613424.3623784

AESPA： Asynchronous Execution Scheme to Exploit Bank-Level Parallelism of Processing-in-Memory

作者: Kal, Hongju and Yoo, Chanyoung and Ro, Won Woo
关键词: Processing-in-Memory, Matrix-Vector Multiplication, Execution Method

Abstract

This paper presents an asynchronous execution scheme to leverage the bank-level parallelism of near-bank processing-in-memory (PIM). We observe that performing memory operations underutilizes the parallelism of PIM computation because near-bank PIMs are designated to operate all banks synchronously. The all-bank computation can be delayed when one of the banks performs the basic memory commands, such as read/write requests and activation/precharge operations. We aim to mitigate the throughput degradation and especially focus on execution delay caused by activation/precharge operations. For all-bank execution accessing the same row of all banks, a large number of activation/precharge operations inevitably occur. Considering the timing parameter limiting the rate of row-open operations (tFAW), the throughput might decrease even further. To resolve this activation/precharge overhead, we propose AESPA, a new parallel execution scheme that operates banks asynchronously. AESPA is different from the previous synchronous execution in that (1) the compute command of AESPA targets a single bank, and (2) each processing unit computes data stored in multiple DRAM columns. By doing so, while one bank computes multiple DRAM columns, the memory controller issues activation/precharge or PIM compute commands to other banks. Thus, AESPA hides the activation latency of PIM computation and fully utilizes the aggregated bandwidth of the banks. For this, we modify hardware and software to support vector and matrix computation of previous near-bank PIM architectures. In particular, we change the matrix-vector multiplication based on an inner product to fit it on AESPA PIM. Previous matrix-vector multiplication requires data broadcasting and simultaneous computation across all processing units. By changing the matrix-vector multiplication method, AESPA PIM can transfer data to respective processing units and start computation asynchronously. As a result, the near-bank PIMs adopting AESPA achieve 33.5% and 59.5% speedup compared to two different state-of-the-art PIMs.

DOI: 10.1145/3613424.3614314

Session details： Session 6A： Security Hardware

作者: Ajorpaz, Samira Mirbagher
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637194

ReCon： Efficient Detection, Management, and Use of Non-Speculative Information Leakage

作者: Aimoniotis, Pavlos and Kvalsvik, Amund Bergland and Chen, Xiaoyue and Sj"{a
关键词: side-channels, non-speculative leakage, load pair, Speculation

Abstract

In a speculative side-channel attack, a secret is improperly accessed and then leaked by passing it to a transmitter instruction. Several proposed defenses effectively close this security hole by either delaying the secret from being loaded or propagated, or by delaying dependent transmitters (e.g., loads) from executing when fed with tainted input derived from an earlier speculative load. This results in a loss of memory-level parallelism and performance. A security definition proposed recently, in which data already leaked in non-speculative execution need not be considered secret during speculative execution, can provide a solution to the loss of performance. However, detecting and tracking non-speculative leakage carries its own cost, increasing complexity. The key insight of our work that enables us to exploit non-speculative leakage as an optimization to other secure speculation schemes is that the majority of non-speculative leakage is simply due to pointer dereferencing (or base-address indexing) — essentially what many secure speculation schemes prevent from taking place speculatively. We present ReCon that: i) efficiently detects non-speculative leakage by limiting detection to pairs of directly-dependent loads that dereference pointers (or index a base-address); and ii) piggybacks non-speculative leakage information on the coherence protocol. In ReCon, the coherence protocol remembers and propagates the knowledge of what has leaked and therefore what is safe to dereference under speculation. To demonstrate the effectiveness of ReCon, we show how two state-of-the-art secure speculation schemes, Non-speculative Data Access (NDA) and speculative Taint Tracking (STT), leverage this information to enable more memory-level parallelism both in a single core scenario and in a multicore scenario: NDA with ReCon reduces the performance loss by 28.7% for SPEC2017, 31.5% for SPEC2006, and 46.7% for PARSEC; STT with ReCon reduces the loss by 45.1%, 39%, and 78.6%, respectively.

DOI: 10.1145/3613424.3623770

Uncore Encore： Covert Channels Exploiting Uncore Frequency Scaling

作者: Guo, Yanan and Cao, Dingyuan and Xin, Xin and Zhang, Youtao and Yang, Jun
关键词: Side channel, Security, Cache

Abstract

Modern processors dynamically adjust clock frequencies and voltages to reduce energy consumption. Recent Intel processors separate the uncore frequency from the core frequency, using Uncore Frequency Scaling (UFS) to adapt the uncore frequency to various workloads. While UFS improves power efficiency, it also introduces security vulnerabilities. In this paper, we study the feasibility of covert channels exploiting UFS. First, we conduct a series of experiments to understand the details of UFS, such as the factors that can cause uncore frequency variations. Then, based on the results, we build the first UFS-based covert channel, UF-variation, which works both across-cores and across-processors. Finally, we analyze the robustness of UF-variation under known defense mechanisms against uncore covert channels, and show that UF-variation remains functional even with those defenses in place.

DOI: 10.1145/3613424.3614259

Hardware Support for Constant-Time Programming

作者: Miao, Yuanqing and Kandemir, Mahmut Taylan and Zhang, Danfeng and Zhang, Yingtian and Tan, Gang and Wu, Dinghao
关键词: Side channel leakage, Constant time programming, Cache

Abstract

Side-channel attacks are one of the rising security concerns in modern computing platforms. Observing this, researchers have proposed both hardware-based and software-based strategies to mitigate side-channel attacks, targeting not only on-chip caches but also other hardware components like memory controllers and on-chip networks. While hardware-based solutions to side-channel attacks are usually costly to implement as they require modifications to the underlying hardware, software-based solutions are more practical as they can work on unmodified hardware. One of the recent software-based solutions is constant-time programming, which tries to transform an input program to be protected against side-channel attacks such that an operation working on a data element/block to be protected would execute in an amount of time that is independent of the input. Unfortunately, while quite effective from a security angle, constant-time programming can lead to severe performance penalties. Motivated by this observation, in this paper, we explore novel hardware support to make constant-time programming much more efficient than its current implementations. Specifically, we present a new hardware component that can greatly improve the performance of constant-time programs with large memory footprints. The key idea in our approach is to add a small structure into the architecture and two accompanying instructions, which collectively expose the existence/dirtiness information of multiple cache lines to the application program, so that the latter can perform more efficient side-channel mitigation. Our experimental evaluation using three benchmark programs with secret data clearly show the effectiveness of the proposed approach over a state-of-the-art implementation of constant-time programming. Specifically, in the three benchmark programs tested, our approach leads to about 7x reduction in performance overheads over the state-of-the-art approach.

DOI: 10.1145/3613424.3623796

AutoCC： Automatic Discovery of Covert Channels in Time-Shared Hardware

作者: Orenes-Vera, Marcelo and Yun, Hyunsung and Wistoff, Nils and Heiser, Gernot and Benini, Luca and Wentzlaff, David and Martonosi, Margaret
关键词: verification, timing channel, temporal partitioning, microarchitectural, information flow, formal, flush., data leak, covert channel, FPV

Abstract

Covert channels enable information leakage between security domains that should be isolated by observing execution differences in shared hardware. These channels can appear in any stateful shared resource, including caches, predictors, and accelerators. Previous works have identified many vulnerable components, demonstrating and defending against attacks via reverse engineering. However, this approach requires much human effort and reasoning. With the Cambrian explosion of specialized hardware, it is becoming increasingly difficult to identify all vulnerabilities manually. To tackle this challenge, we propose AutoCC, a methodology that leverages formal property verification (FPV) to automatically discover covert channels in hardware that is shared between processes. AutoCC operates at the register-transfer level (RTL) to exhaustively examine any machine state left by a process after a context switch that creates an execution difference. Upon finding such a difference, AutoCC provides a precise execution trace showing how the information was encoded into the machine state and recovered. Leveraging AutoCC’s flow to generate FPV testbenches that apply our methodology, we evaluated it on four open-source hardware projects, including two RISC-V cores and two accelerators. Without hand-written code or directed tests, AutoCC uncovered known covert channels (within minutes instead of many hours of test-driven emulations) and unknown ones. Although AutoCC is primarily intended to find covert channels, our evaluation has also found RTL bugs, demonstrating that AutoCC is an effective tool to test both the security and reliability of hardware designs.

DOI: 10.1145/3613424.3614254

Session details： Session 6B： Datacenter Networks

作者: Carlson, Trevor E.
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637195

NeuroLPM - Scaling Longest Prefix Match Hardware with Neural Networks

作者: Rashelbach, Alon and de Paula, Igor and Silberstein, Mark
关键词: No keywords

Abstract

Longest Prefix Match engines (LPM) are broadly used in computer systems and especially in modern network devices such as Network Interface Cards (NICs), switches and routers. However, existing LPM hardware fails to scale to millions of rules required by modern systems, is often optimized for specific applications, and thus is performance-sensitive to the structure of LPM rules. We describe NeuroLPM, a new architecture for multi-purpose LPM hardware that replaces queries in traditional memory-intensive trie- and hash-table data structures with inference in a lightweight Neural Network-based model, called RQRMI. NeuroLPM scales to millions of rules under small on-die SRAM budget and achieves stable, rule-structure-agnostic performance, allowing its use in a variety of applications. We solve several unique challenges when implementing RQRMI inference in hardware, including minimizing the amount of floating point computations while maintaining query correctness, and scaling the rule-set size while ensuring small, deterministic off-chip memory bandwidth. We prototype NeuroLPM in Verilog and evaluate it on real-world packet forwarding rule-sets and network traces. NeuroLPM offers substantial scalability benefits without any application-specific optimizations. For example, it is the only algorithm that can serve a 950K-large rule-set at an average of 196M queries per second with 4.5MB of SRAM, only within 2% of the best-case throughput of the state-of-the-art Tree Bitmap and SAIL on smaller rule-sets. With 2MB of SRAM, it reduces the DRAM bandwidth per query, the dominant performance factor, by up to 9 \texttimes{

DOI: 10.1145/3613424.3623769

Space Microdatacenters

作者: Bleier, Nathaniel and Mubarik, Muhammad Husnain and Swenson, Gary R and Kumar, Rakesh
关键词: Micro datacenter, Compute in space, Computational satellite

Abstract

Earth observation (EO) has been a key task for satellites since the first time a satellite was put into space. The temporal and spatial resolution at which EO satellites take pictures has been increasing to support space-based applications, but this increases the amount of data each satellite generates. We observe that future EO satellites will generate so much data that this data cannot be transmitted to Earth due to the limited capacity of communication that exists between space and Earth. We show that conventional data reduction techniques such as compression [126] and early discard [41] do not solve this problem, nor does a direct enhancement of today’s RF-based infrastructure [133, 153] for space-Earth communication. We explore an unorthodox solution instead - moving to space the computation that would have happened on the ground. This alleviates the need for data transfer to Earth. We analyze ten non-longitudinal RGB and hyperspectral image processing Earth observation applications for their computation and power requirements and discover that these requirements cannot be met by the small satellites that dominate today’s EO missions. We make a case for space microdatacenters - large computational satellites whose primary task is to support in-space computation of EO data. We show that one 4KW space microdatacenter can support the computation need of a majority of applications, especially when used in conjunction with early discard. We do find, however, that communication between EO satellites and space microdatacenters becomes a bottleneck. We propose three space microdatacenter-communication co-design strategies – k − list-based network topology, microdatacenter splitting, and moving space microdatacenters to geostationary orbit – that alleviate the bottlenecks and enable effective usage of space microdatacenters.

DOI: 10.1145/3613424.3614271

LogNIC： A High-Level Performance Model for SmartNICs

作者: Guo, Zerui and Lin, Jiaxin and Bai, Yuebin and Kim, Daehyeok and Swift, Michael and Akella, Aditya and Liu, Ming
关键词: SmartNIC, Programmable Networks, Architectural Modeling

Abstract

SmartNICs have become an indispensable communication fabric and computing substrate in today’s data centers and enterprise clusters, providing in-network computing capabilities for traversed packets and benefiting a range of applications across the system stack. Building an efficient SmartNIC-assisted solution is generally non-trivial and tedious as it requires programmers to understand the SmartNIC architecture, refactor application logic to match the device’s capabilities and limitations, and correlate an application execution with traffic characteristics. A high-level SmartNIC performance model can decouple the underlying SmartNIC hardware device from its offloaded software implementations and execution contexts, thereby drastically simplifying and facilitating the development process. However, prior architectural models can hardly be applied due to their limited capabilities in dissecting the SmartNIC-offloaded program’s complexity, capturing the nondeterministic overlapping between computation and I/O, and perceiving diverse traffic profiles. This paper presents the LogNIC model that systematically analyzes the performance characteristics of a SmartNIC-offloaded program. Unlike conventional execution flow-based modeling, LogNIC employs a packet-centric approach that examines SmartNIC execution based on how packets traverse heterogeneous computing domains, on-/off-chip interconnects, and memory subsystems. It abstracts away the low-level device details, represents a deployed program as an execution graph, retains a handful of configurable parameters, and generates latency/throughput estimation for a given traffic profile. It further exposes a couple of extensions to handle multi-tenancy, traffic interleaving, and accelerator peculiarity. We demonstrate the LogNIC model’s capabilities using both commodity SmartNICs and an academic prototype under five application scenarios. Our evaluations show that LogNIC can estimate performance bounds, explore software optimization strategies, and provide guidelines for new hardware designs.

DOI: 10.1145/3613424.3614291

Heterogeneous Die-to-Die Interfaces： Enabling More Flexible Chiplet Interconnection Systems

作者: Feng, Yinxiao and Xiang, Dong and Ma, Kaisheng
关键词: Routing, Network-on-Chip, Interface, Interconnection, Chiplet

Abstract

The chiplet architecture is one of the emerging methodologies and is believed to be scalable and economical. However, most current multi-chiplet systems are based on one uniform die-to-die interface, which severely limits flexibility. First, any interface has specific applicable workloads/scales/scenarios; therefore, chiplets with a uniform interface cannot be freely reused in different systems. Second, since modern computing systems must deal with complex and mixed tasks, the uniform interface does not cope well with flexible workloads, especially for large-scale systems. To deal with these inflexibilities, we propose the idea of Heterogeneous Interface (Hetero-IF), which allows chiplets to use two different interfaces (parallel IF and serial IF) at the same time. Hetero-IF can combine the advantages of different interfaces and cover up the disadvantages of each, thus improving flexibility and performance. However, adopting hetero-IF-based multi-chiplet interconnection systems still faces many challenges. The microarchitecture, scheduling, interconnection, and routing issues have not been discussed so far. In this paper, we put forward two typical hetero-IF implementations: Hetero-PHY and Hetero-Channel. Based on these two implementations, detailed usages and scheduling methods are discussed. We also present the interconnection methods for hetero-IF-based multi-chiplet systems and show how to apply deadlock-free routing algorithms. Extensive evaluations, including simulation and circuit verification, are made on these systems. The experiment results show that hetero-IF provides more flexible interconnection and scheduling possibilities to achieve better performance and energy metrics under various workloads.

DOI: 10.1145/3613424.3614310

Session details： Session 6C： Reliability, Availability

作者: Gabbay, Freddy
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637196

Predicting Future-System Reliability with a Component-Level DRAM Fault Model

作者: Jung, Jeageun and Erez, Mattan
关键词: Memory reliability

Abstract

We introduce a new fault model for recent and future DRAM systems that uses empirical analysis to derive DRAM internal-component level fault models. This modeling level offers higher fidelity and greater predictive capability than prior models that rely on logical-address based characterization and modeling. We show how to derive the model, overcoming several challenges of using a publicly-available dataset of memory error logs. We then demonstrate the utility of our model by scaling it and analyzing the expected reliability of DDR5, HBM3, and LPDDR5 based systems. In addition to the novelty of the analysis and the model itself, we draw several insights regarding on-die ECC design and tradeoffs and the efficacy of repair/retirement mechanisms.

DOI: 10.1145/3613424.3614294

Impact of Voltage Scaling on Soft Errors Susceptibility of Multicore Server CPUs

作者: Agiakatsikas, Dimitris and Papadimitriou, George and Karakostas, Vasileios and Gizopoulos, Dimitris and Psarakis, Mihalis and Belanger-Champagne, Camille and Blackmore, Ewart
关键词: voltage and frequency scaling, soft errors, silent data corruptions, power consumption, neutron radiation testing, error resilience, energy efficiency, Microprocessor reliability

Abstract

Microprocessor power consumption and dependability are both crucial challenges that designers have to cope with due to shrinking feature sizes and increasing transistor counts in a single chip. These two challenges are mutually destructive: microprocessor reliability deteriorates at lower supply voltages that save power. An important dependability metric for microprocessors is their radiation-induced soft error rate (SER). This work goes beyond state-of-the-art by assessing the trade-offs between voltage scaling and soft error rate (SER) on a microprocessor system executing workloads on real hardware and a full software stack setup. We analyze data from accelerated neutron radiation testing for nominal and reduced microprocessor operating voltages. We perform our experiments on a 64-bit Armv8 multicore microprocessor built on 28 nm process technology. We show that the SER of SRAM arrays can increase up to 40.4% when the device operates at reduced supply voltage levels. To put our findings into context, we also estimate the radiation-induced Failures in Time (FIT) rate of various workloads for all the studied voltage levels. Our results show that the total and the Silent Data Corruptions (SDC) FIT of the microprocessor operating at voltage-scaled conditions can be 6.6 \texttimes{

DOI: 10.1145/3613424.3614304

Si-Kintsugi： Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI

作者: Hanson, Edward and Li, Shiyu and Zhou, Guanglei and Cheng, Feng and Wang, Yitu and Bose, Rohan and Li, Hai and Chen, Yiran
关键词: workload scheduling, spatial network-on-chip, multi-core architectures, defective cores, AI acceleration

Abstract

The growing demand for higher compute and memory capacity driven by artificial intelligence (AI) applications pushes higher core counts in modern systems. Many-core architectures exhibiting spatial interconnects with high on-chip bandwidth are ideal for these workloads due to their data movement flexibility and sheer parallelism. However, the size of such platforms makes them particularly susceptible to manufacturing defects, prompting a need for designs and mechanisms that improve yield. Despite these techniques, nonfunctional cores and links are unavoidable. Although prior works address defective cores by disabling them and only scheduling workload to functional ones, communication latency through spatial interconnects is tightly associated with the locations of defective cores and cores with assigned work. Based on this observation, we present Si-Kintsugi, a defect-aware workload scheduling framework for spatial architectures with mesh topology. First, we design a novel and generalizable workload mapping representation and cost function that integrates defect pattern information. The mapping representation is formed into a 1D vector with simple constraints, making it an ideal candidate for open source heuristic-based optimization algorithms. After a communication latency optimized workload mapping is found, dataflow between the mapped cores is automatically generated to balance communication and computation cost. Si-Kintsugi is extensively evaluated on various workloads (i.e., BERT, ResNet, GEMM) across a wide range of defect patterns and rates. Experiment results show that Si-Kintsugi generates a workload schedule that is on average 1.34 \texttimes{

DOI: 10.1145/3613424.3614278

How to Kill the Second Bird with One ECC： The Pursuit of Row Hammer Resilient DRAM

作者: Kim, Michael Jaemin and Wi, Minbok and Park, Jaehyun and Ko, Seoyoung and Choi, Jaeyoung and Nam, Hwayoung and Kim, Nam Sung and Ahn, Jung Ho and Lee, Eojin
关键词: security, row-hammer, main memory, Memory system, DRAM

Abstract

Error-correcting code (ECC) has been widely used in DRAM-based memory systems to address the exacerbating random errors following the fabrication process scaling. However, ECCs including the strong form of Chipkill have not been so effective against Row Hammer (RH), which incurs bursts of errors discretely corrupting the whole row beyond the ECC correction capability. We propose Cube, a novel chip-wise physical to DRAM address randomization scheme that leverages the abundant detection capability of on-die-ECC (OECC) and correction capability of Chipkill against RH. Cube allows for synergistic cooperation between ECC, probabilistic RH-protection schemes, and the system with minimal to no modification for each. First, Cube scrambles the rows of each chip using a boot-time key in a way that distributes RH victims to multiple Chipkill codewords. Second, Cube utilizes the newly observed distinct RH error characteristics from real DRAM chips, to swiftly diagnose the RH victim rows using the error profile from OECC scrubbing, and even correct it leveraging Chipkill. When combined, Cube decreases the failure probability of PARA and a state-of-the-art RH protection scheme SRS by up to 10− 25. At a target failure probability of 10− 10 per year on a DDR5 rank under the RH threshold of 2K, Cube reduces the performance and table size overhead of SRS by up to 24.3% and 39.9%, respectively.

DOI: 10.1145/3613424.3623777

Session details： Session 7A： Accelerators Various

作者: Jones, Alex K.
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637197

Bucket Getter： A Bucket-based Processing Engine for Low-bit Block Floating Point (BFP) DNNs

作者: Lo, Yun-Chen and Liu, Ren-Shuo
关键词: Floating-point architecture, Deep learning, Bucket-based accumulation

Abstract

Block floating point (BFP), an efficient numerical system for deep neural networks (DNNs), achieves a good trade-off between dynamic range and hardware costs. Specifically, prior works have demonstrated that BFP format with 3 ∼ 5-bit mantissa can achieve FP32-comparable accuracy for various DNN workloads. We find that the floating-point adder (FP-Acc), which contains modules for normalization, alignment, addition, and fixed-point-to-floating-point (FXP2FP) conversion, dominates the power and area overheads, hence hindering the hardware efficiency of state-of-the-art low-bit BFP processing engines (BFP-PE). To mitigate the identified issue, we propose Bucket Getter, a novel architecture implemented with the following techniques for improving the energy efficiency and area efficiency: 1) we propose a bucket-based accumulation unit prior to FP-Acc, which uses multiple small accumulators (buckets) that are responsible for a small range of exponent values where intermediate results are distributed accordingly, and b) accumulate in FXP domain. This reduces the activities of power-hungry a) alignment and b) format conversion units. 2) We propose inter-bucket carry propagation, which allows each bucket to transmit overflow to an adjacent bucket and further reduces the activity of FP-Acc. 3) We propose an out-of-bound-aware, adaptive and circular bucket accumulator to significantly reduce the overhead for the bucket-based accumulator. 4) We further propose shared FP-Acc, which exploits the low activity of FP-Acc in the bucket-based architecture and shares an FP-Acc across several MAC engines to reduce the area overhead of FP-Acc. The experimental results based on TSMC 40 nm demonstrate that our proposed Bucket Getter architecture reduces the computational energy by up to 57% and improves the area efficiency by up to 1.4 \texttimes{

DOI: 10.1145/3613424.3614249

ACRE： Accelerating Random Forests for Explainability

作者: McCrabb, Andrew and Ahmed, Aymen and Bertacco, Valeria
关键词: No keywords

Abstract

As machine learning models become more widespread, they are being increasingly applied in applications that heavily impact people’s lives (e.g., medical diagnoses, judicial system sentences, etc.). Several communities are thus calling for ML models to be not only accurate, but also explainable. To achieve this, recommendations must be augmented with explanations summarizing how each recommendation outcome is derived. Explainable Random Forest (XRF) models are popular choices in this space, as they are both very accurate and can be augmented with explainability functionality, allowing end-users to learn how and why a specific outcome was reached. However, the limitations of XRF models hamper their adoption, the foremost being the high computational demands associated with training such models to support high-accuracy classifications, while also annotating them with explainability meta-data. In response, we present ACRE, a hardware accelerator to support XRF model training. ACRE accelerates key operations that bottleneck performance, while maintaining meta-data critical to support explainability. It leverages a novel Processing-in-Memory hardware unit, co-located with banks of a 3D-stacked High-Bandwidth Memory (HBM). The unit locally accelerates the execution of key training computations, boosting effective data-transfer bandwidth. Our evaluation shows that, when ACRE augments HBM3 memory, it yields an average system-level training performance improvement of 26.6x, compared to a baseline multicore processor solution with DDR4 memory. Further, ACRE yields a 2.5x improvement when compared to an HBM3 architecture baseline, increasing to 5x when not bottlenecked by a 16k-thread limit in the host. Finally, due to much higher performance, we observe that ACRE provides a 16.5x energy reduction overall, over a DDR baseline.

DOI: 10.1145/3613424.3623788

δLTA： Decoupling Camera Sampling from Processing to Avoid Redundant Computations in the Vision Pipeline

作者: Taranco, Ra'{u
关键词: Image Similarity, Image Signal Processor, Computation Reuse

Abstract

Continuous Vision (CV) systems are essential for emerging applications like Autonomous Driving (AD) and Augmented/Virtual Reality (AR/VR). A standard CV System-on-a-Chip (SoC) pipeline includes a frontend for image capture and a backend for executing vision algorithms. The frontend typically captures successive similar images with gradual positional and orientational variations. As a result, many regions between consecutive frames yield nearly identical results when processed in the backend. Despite this, current systems process every image region at the camera’s sampling rate, overlooking the fact that the actual rate of change in these regions could be significantly lower. In this work, we introduce δ LTA (δont’t Look Twice, it’s Alright), a novel frontend that decouples camera frame sampling from backend processing by extending the camera with the ability to discard redundant image regions before they enter subsequent CV pipeline stages. δ LTA informs the backend about the image regions that have notably changed, allowing it to focus solely on processing these distinctive areas and reusing previous results to approximate the outcome for similar ones. As a result, the backend processes each image region using different processing rates based on its temporal variation. δ LTA features a new Image Signal Processing (ISP) design providing similarity filtering functionality, seamlessly integrated with other ISP stages to incur zero-latency overhead in the worst-case scenario. It also offers an interface for frontend-backend collaboration to fine-tune similarity filtering based on the application requirements. To illustrate the benefits of this novel approach, we apply it to a state-of-the-art CV localization application, typically employed in AD and AR/VR. We show that δ LTA removes a significant fraction of unneeded frontend and backend memory accesses and redundant backend computations, which reduces the application latency by 15.22% and its energy consumption by 17%.

DOI: 10.1145/3613424.3614261

Session details： Session 7B： Caches, Intermitent Computing, Persistency

作者: Ausavarungnirun, Rachata
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637198

McCore： A Holistic Management of High-Performance Heterogeneous Multicores

作者: Kwon, Jaewon and Lee, Yongju and Kal, Hongju and Kim, Minjae and Kim, Youngsok and Ro, Won Woo
关键词: reinforcement learning, multi-core architectures, memory hierarchy, heterogeneous computing, hardware-based scheduling, cache partitioning

Abstract

Heterogeneous multicore systems have emerged as a promising approach to scale performance in high-end desktops within limited power and die size constraints. Despite their advantages, these systems face three major challenges: memory bandwidth limitation, shared cache contention, and heterogeneity. Small cores in these systems tend to occupy a significant portion of shared LLC and memory bandwidth, despite their lower computational capabilities, leading to performance degradation of up to 18% in memory-intensive workloads. Therefore, it is crucial to address these challenges holistically, considering shared resources and core heterogeneity while managing shared cache and bandwidth. To tackle these issues, we propose McCore, a comprehensive solution that reorganizes the heterogeneous multicore memory hierarchy and effectively leverages this structure through a hardware-based reinforcement learning (RL) scheduler. The McCore structure aims to enhance performance by partitioning the shared LLC based on each cluster’s asymmetric computing power and conditionally enabling fine-grained access in small cores. The McCore RL agent holistically controls these structures, incorporating a hardware-based online RL scheduler that accounts for bandwidth utilization and caching effectiveness to consider the heterogeneity in McCore structures. By implementing the RL agent module as hardware that cooperates with existing hardware monitors and performance counters, low-latency scheduling is enabled without burdening the OS kernel. McCore achieves a 25.1% performance gain compared to the baseline and significantly outperforms existing state-of-the-art cache partitioning, sparse access managing schemes, and heterogeneous multicore schedulers, providing a comprehensive solution for high-performance heterogeneous multicore systems.

DOI: 10.1145/3613424.3614295

SweepCache： Intermittence-Aware Cache on the Cheap

作者: Zhou, Yuchen and Zeng, Jianping and Jeong, Jungi and Choi, Jongouk and Jung, Changhee
关键词: failure-atomic, energy harvesting, compiler/architecture co-design

Abstract

This paper presents SweepCache, a new compiler/architecture co-design scheme that can equip energy harvesting systems with a volatile cache in a performant yet lightweight way. Unlike prior just-in-time checkpointing designs that persists volatile data just before power failure and thus dedicates additional energy, SweepCache partitions program into a series of recoverable regions and persists stores at region granularity to fully utilize harvested energy for computation. In particular, SweepCache introduces persist buffer—as a redo buffer resident in nonvolatile memory (NVM)—to keep the main memory consistent across power failure while persisting region’s stores in a failure-atomic manner. Specifically, for writebacks during region execution, SweepCache saves their cachelines to the persist buffer. At each region end, SweepCache first flushes dirty cachelines to the buffer, allowing the next region to start with a clean cache, and then moves all buffered cachelines to the corresponding NVM locations. In this way, no matter when power failure occurs, the buffer contents or their memory locations always remain intact, which serves as a basis for correct recovery. To hide the persistence delay, SweepCache speculatively starts a region right after the prior region finishes its execution—as if its stores were already persisted—with the two regions having their own persist buffer, i.e., dual-buffering. This region-level parallelism helps SweepCache to achieve the full potential of a high-performance data cache. The experimental results show that compared to the original cache-free nonvolatile processor, SweepCache delivers speedups of 14.60x and 14.86x—outperforming the state-of-the-art work by 3.47x and 3.49x—for two representative energy harvesting power traces, respectively.

DOI: 10.1145/3613424.3623781

Persistent Processor Architecture

作者: Zeng, Jianping and Jeong, Jungi and Jung, Changhee
关键词: No keywords

Abstract

This paper presents PPA (Persistent Processor Architecture), simple microarchitectural support for lightweight yet performant whole-system persistence. PPA offers fully transparent crash consistency to all sorts of program covering the entire computing stack and even legacy applications without any source code change or recompilation. As a basis for crash consistency, PPA leverages so-called store integrity that preserves store operands during program execution, persists them on impending power failure, and replays the stores when power comes back. In particular, PPA realizes the store integrity via hardware by keeping the operands in a physical register file (PRF), though the stores are committed. Such store integrity enforcement leads to region-level persistence, i.e., whenever PRF runs out, PPA starts a new region after ensuring that all stores of the prior region have already been written to persistent memory. To minimize the pipeline stall across regions, PPA writes back the stores of each region asynchronously, overlapping their persistence latency with the execution of other instructions in the region. The experimental results with 41 applications from SPEC CPU2006/2017, SPLASH3, STAMP, WHISPER, and DOE Mini-apps show that PPA incurs only a 2% average run-time overhead and a 0.005% areal cost, while the state-of-the-art work suffers a 26% overhead along with prohibitively high hardware and energy costs.

DOI: 10.1145/3613424.3623772

Session details： Session 8A： Accelerators for Neural Nets Accelerators for Matrix Processing

作者: Clemons, Jason
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637199

ADA-GP： Accelerating DNN Training By Adaptive Gradient Prediction

作者: Janfaza, Vahid and Mandal, Shantanu and Mahmud, Farabi and Muzahid, Abdullah
关键词: Training, Systolic arrays, Prediction, Hardware accelerators

Abstract

Neural network training is inherently sequential where the layers finish the forward propagation in succession, followed by the calculation and back-propagation of gradients (based on a loss function) starting from the last layer. The sequential computations significantly slow down neural network training, especially the deeper ones. Prediction has been successfully used in many areas of computer architecture to speed up sequential processing. Therefore, we propose ADA-GP, which uses gradient prediction adaptively to speed up deep neural network (DNN) training while maintaining accuracy. ADA-GP works by incorporating a small neural network to predict gradients for different layers of a DNN model. ADA-GP uses a novel tensor reorganization method to make it feasible to predict a large number of gradients. ADA-GP alternates between DNN training using backpropagated gradients and DNN training using predicted gradients. ADA-GP adaptively adjusts when and for how long gradient prediction is used to strike a balance between accuracy and performance. Last but not least, we provide a detailed hardware extension in a typical DNN accelerator to realize the speed up potential from gradient prediction. Our extensive experiments with fifteen DNN models show that ADA-GP can achieve an average speed up of 1.47 \texttimes{

DOI: 10.1145/3613424.3623779

HighLight： Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity

作者: Wu, Yannan Nellie and Tsai, Po-An and Muralidharan, Saurav and Parashar, Angshuman and Sze, Vivienne and Emer, Joel
关键词: structured sparsity, hardware-software co-design, computer architecture, Deep learning accelerator

Abstract

Due to complex interactions among various deep neural network (DNN) optimization techniques, modern DNNs can have weights and activations that are dense or sparse with diverse sparsity degrees. To offer a good trade-off between accuracy and hardware performance, an ideal DNN accelerator should have high flexibility to efficiently translate DNN sparsity into reductions in energy and/or latency without incurring significant complexity overhead. This paper introduces hierarchical structured sparsity (HSS), with the key insight that we can systematically represent diverse sparsity degrees by having them hierarchically composed from multiple simple sparsity patterns. As a result, HSS simplifies the underlying hardware since it only needs to support simple sparsity patterns; this significantly reduces the sparsity acceleration overhead, which improves efficiency. Motivated by such opportunities, we propose a simultaneously efficient and flexible accelerator, named HighLight, to accelerate DNNs that have diverse sparsity degrees (including dense). Due to the flexibility of HSS, different HSS patterns can be introduced to DNNs to meet different applications’ accuracy requirements. Compared to existing works, HighLight achieves a geomean of up to 6.4 \texttimes{

DOI: 10.1145/3613424.3623786

Exploiting Inherent Properties of Complex Numbers for Accelerating Complex Valued Neural Networks

作者: Lee, Hyunwuk and Jang, Hyungjun and Kim, Sungbin and Kim, Sungwoo and Cho, Wonho and Ro, Won Woo
关键词: Quantization, Complex Valued Neural Networks, Accelerators

Abstract

Since conventional Deep Neural Networks (DNNs) use real numbers as their data, they are unable to capture the imaginary values and the correlations between real and imaginary values in applications that use complex numbers. To address this limitation, Complex Valued Neural Networks (CVNNs) have been introduced, enabling to capture the context of complex numbers for various applications such as Magnetic Resonance Imaging (MRI), radar, and sensing. CVNNs handle their data with complex numbers and adopt complex number arithmetic to their layer operations, so they exhibit distinct design challenges with real-valued DNNs. The first challenge is the data representation of the complex number, which requires two values for a single data, doubling the total data size of the networks. Moreover, due to the unique operations of the complex-valued layers, CVNNs require a specialized scheduling policy to fully utilize the hardware resources and achieve optimal performance. To mitigate the design challenges, we propose software and hardware co-design techniques that effectively resolves the memory and compute overhead of CVNNs. First, we propose Polar Form Aware Quantization (PAQ) that utilizes the characteristics of the complex number and their unique value distribution on CVNNs. Then, we propose our hardware accelerator that supports PAQ and CVNN operations. Lastly, we design a CVNN-aware scheduling scheme that optimizes the performance and resource utilization of an accelerator by aiming at the special layer operations of CVNN. PAQ achieves 62.5% data compression over CVNNs using FP16 while retaining a similar error with INT8 quantization, and our hardware support PAQ with only 2% area overhead over conventional systolic array architecture. In our evaluation, PAQ hardware with the scheduling scheme achieves a 32% lower latency and 30% lower energy consumption than other accelerators.

DOI: 10.1145/3613424.3614287

Point Cloud Acceleration by Exploiting Geometric Similarity

作者: Chen, Cen and Zou, Xiaofeng and Shao, Hongen and Li, Yangfan and Li, Kenli
关键词: Software-Hardware Co-Design, Redundancy-Aware Computation, Point cloud, Hardware Accelerator

Abstract

Deep learning on point clouds has attracted increasing attention for various emerging 3D computer vision applications, such as autonomous driving, robotics, and virtual reality. These applications interact with people in real-time on edge devices and thus require low latency and low energy. To accelerate the execution of deep neural networks (DNNs) on point clouds, some customized accelerators have been proposed, which achieved a significantly higher performance with reduced energy consumption than GPUs and existing DNN accelerators. In this work, we reveal that DNNs execution on geometrically adjacent points exhibits similar values and relations, and exhibits a large amount of redundant computation and communication due to the correlations. To address this issue, we propose GDPCA, a geometry-aware differential point cloud accelerator, which can exploit geometric similarity to reduce these redundancies for point cloud neural networks. GDPCA is supported by an algorithm and architecture co-design. Our proposed algorithm can discover and reduce computation and communication redundancies with geometry-aware and differential execution mechanisms. Then a novel architecture is designed to support the proposed algorithm and transform the redundancy reduction into performance improvement. GDPCA performs the same computations and gives the same accuracy as traditional point cloud neural networks. To the best of our knowledge, GDPCA is the first accelerator that can reduce execution redundancies for point cloud neural networks by exploiting geometric similarity. Our proposed GDPCA system gains an average of 2.9 \texttimes{

DOI: 10.1145/3613424.3614290

HARP： Hardware-Based Pseudo-Tiling for Sparse Matrix Multiplication Accelerator

作者: Kim, Jinkwon and Jang, Myeongjae and Nam, Haejin and Kim, Soontae
关键词: Tiling, Sparse matrix multiplication, Sparse Matrix, Spare matrix tiling, SpGEMM, Hardware accelerator, Application-specific hardware

Abstract

General sparse matrix-matrix multiplication (SpGEMM) is a memory-bound workload, due to the compression format used. To minimize data movements for input matrices, outer product accelerators have been proposed. Since these accelerators access input matrices only once and then generate numerous partial products, managing the generated partial products is the key optimization factor. To reduce the number of partial products handled, the state-of-the-art accelerator uses software to tile an input matrix. However, the software-based tiling has three limitations. First, a user manually executes the tiling software and manages the tiles. Second, generating a compression format for each tile incurs memory-intensive operations. Third, an accelerator that uses the compression format cannot skip ineffectual accesses for input matrices. To overcome these limitations, this paper proposes hardware-based pseudo-tiling (HARP), which enables logical tiling of the original compressed matrix without generating a compression format for each tile. To this end, HARP utilizes our proposed Runtime Operand Descriptor to point to an effectual column-row pair in a pseudo-tile. Consequently, HARP enables a user to use the accelerator as a normal SpGEMM operation does without tiling. Furthermore, HARP does not require a compression format for each tile and can skip ineffectual accesses for input matrices. To further improve the efficiency of pseudo-tiling, HARP performs super-tiling to combine pseudo-tiles and sub-tiling to further tile a pseudo-tile. Experiment results show that HARP achieves 4 \texttimes{

DOI: 10.1145/3613424.3623790

Session details： Session 8B： Virtual Memory (Translation)

作者: Alian, Mohammad
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637200

IDYLL： Enhancing Page Translation in Multi-GPUs via Light Weight PTE Invalidations

作者: Li, Bingyao and Guo, Yanan and Wang, Yueqi and Jaleel, Aamer and Yang, Jun and Tang, Xulong
关键词: page table invalidation, page sharing, multi-GPU

Abstract

Multi-GPU systems have emerged as a desirable platform to deliver high computing capabilities and large memory capacity to accommodate large dataset sizes. However, naively employing multi-GPU incurs non-scalable performance. One major reason is that execution efficiency suffers expensive address translations in multi-GPU systems. The data-sharing nature of GPU applications requires page migration between GPUs to mitigate non-uniform memory access overheads. Unfortunately, frequent page migration incurs substantial page table invalidation overheads to ensure translation coherence. A comprehensive investigation of multi-GPU address translation efficiency identifies two significant bottlenecks caused by page table invalidation requests: (i) increased latency for demand TLB miss requests and (ii) increased waiting latency for performing page migrations. Based on observations, we propose IDYLL, which reduces the number of page table invalidations by maintaining an “in-PTE" directory and reduces invalidation latency by batching multiple invalidation requests to exploit spatial locality. We show that IDYLL improves overall performance by 69.9% on average.

DOI: 10.1145/3613424.3614269

Victima： Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources

作者: Kanellopoulos, Konstantinos and Nam, Hong Chul and Bostanci, Nisa and Bera, Rahul and Sadrosadati, Mohammad and Kumar, Rakesh and Bartolini, Davide Basilio and Mutlu, Onur
关键词: Virtualization, Virtual Memory, TLB, Microarchitecture, Memory Systems, Memory Hierarchy, Cache, Address Translation

Abstract

Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both solutions have significant drawbacks: increased access latency, power and area (for hardware TLBs), and costly memory accesses, the need for large contiguous memory blocks, and complex OS modifications (for software-managed TLBs). We present Victima, a new software-transparent mechanism that drastically increases the translation reach of the processor by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries, thereby providing an additional low-latency and high-capacity component that backs up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Leveraging the PTW-CP, Victima uses the valuable cache space only for TLB entries that correspond to costly-to-translate pages, reducing the impact on cached application data. Second, a TLB-aware cache replacement policy prioritizes keeping TLB entries in the cache hierarchy by considering (i) the translation pressure (e.g., last-level TLB miss rate) and (ii) the reuse characteristics of the TLB entries. Our evaluation results show that in native (virtualized) execution environments Victima improves average end-to-end application performance by 7.4% (28.7%) over the baseline four-level radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art software-managed TLB, across 11 diverse data-intensive workloads. Victima delivers similar performance as a system that employs an optimistic 128K-entry L2 TLB, while avoiding the associated area and power overheads. Victima (i) is effective in both native and virtualized environments, (ii) is completely transparent to application and system software, (iii) unlike large software-managed TLBs, does not require contiguous physical allocations, (iv) is compatible with modern large page mechanisms and (iv) incurs very small area and power overheads of and , respectively, on a modern high-end CPU. The source code of Victima is freely available at https://github.com/CMU-SAFARI/Victima.

DOI: 10.1145/3613424.3614276

Utopia： Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings

作者: Kanellopoulos, Konstantinos and Bera, Rahul and Stojiljkovic, Kosta and Bostanci, F. Nisa and Firtina, Can and Ausavarungnirun, Rachata and Kumar, Rakesh and Hajinazar, Nastaran and Sadrosadati, Mohammad and Vijaykumar, Nandita and Mutlu, Onur
关键词: Virtualization, Virtual Memory, TLB, Microarchitecture, Memory Systems, Cache, Address Translation

Abstract

Conventional virtual memory (VM) frameworks enable a virtual address to flexibly map to any physical address. This flexibility necessitates large data structures to store virtual-to-physical mappings, which leads to high address translation latency and large translation-induced interference in the memory hierarchy, especially in data-intensive workloads. On the other hand, restricting the address mapping so that a virtual address can only map to a specific set of physical addresses can significantly reduce address translation overheads by making use of compact and efficient translation structures. However, restricting the address mapping flexibility across the entire main memory severely limits data sharing across different processes and increases data accesses to the swap space of the storage device even in the presence of free memory. We propose Utopia, a new hybrid virtual-to-physical address mapping scheme that allows both flexible and restrictive hash-based address mapping schemes to harmoniously co-exist in the system. The key idea of Utopia is to manage physical memory using two types of physical memory segments: restrictive segments and flexible segments. A restrictive segment uses a restrictive, hash-based address mapping scheme that maps virtual addresses to only a specific set of physical addresses and enables faster address translation using compact translation structures. A flexible segment employs the conventional fully-flexible address mapping scheme. By mapping data to a restrictive segment, Utopia enables faster address translation with lower translation-induced interference. At the same time, Utopia retains the ability to use the flexible address mapping to (i) support conventional VM features such as data sharing and (ii) avoid storing data in the swap space of the storage device when program data does not fit inside a restrictive segment. Our evaluation using 11 diverse data-intensive workloads shows that Utopia improves performance by 24% in a single-core system over the baseline conventional four-level radix-tree page table design, whereas the best prior state-of-the-art contiguity-aware translation scheme improves performance by 13%. Utopia provides 95% of the performance benefits of an ideal address translation scheme where every translation request hits in the first-level TLB. All of Utopia’s benefits come at a modest cost of 0.64% area overhead and 0.72% power overhead compared to a modern high-end CPU. The source code of Utopia is freely available at https://github.com/CMU-SAFARI/Utopia.

DOI: 10.1145/3613424.3623789

Architectural Support for Optimizing Huge Page Selection Within the OS

作者: Manocha, Aninda and Yan, Zi and Tureci, Esin and Arag'{o
关键词: virtual memory, operating systems, memory management, hardware-software co-design, graph processing, cache architectures

Abstract

Irregular, memory-intensive applications often incur high translation lookaside buffer (TLB) miss rates that result in significant address translation overheads. Employing huge pages is an effective way to reduce these overheads, however in real systems the number of available huge pages can be limited when system memory is nearly full and/or fragmented. Thus, huge pages must be used selectively to back application memory. This work demonstrates that choosing memory regions that incur the most TLB misses for huge page promotion best reduces address translation overheads. We call these regions High reUse TLB-sensitive data (HUBs). Unlike prior work which relies on expensive per-page software counters to identify promotion regions, we propose new architectural support to identify these regions dynamically at application runtime. We propose a promotion candidate cache (PCC) that identifies HUB candidates based on hardware page table walks after a last-level TLB miss. This small, fixed-size structure tracks huge page-aligned regions (consisting of N base pages), ranks them based on observed page table walk frequency, and only keeps the most frequently accessed ones. Evaluated on applications of various memory intensity, our approach successfully identifies application pages incurring the highest address translation overheads. Our approach demonstrates that with the help of a PCC, the OS only needs to promote of the application footprint to achieve more than of the peak achievable performance, yielding 1.19-1.33 \texttimes{

DOI: 10.1145/3613424.3614296

Session details： Session 8C： Benchmarking and Methodology

作者: Moret'{o
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637201

Photon： A Fine-grained Sampled Simulation Methodology for GPU Workloads

作者: Liu, Changxi and Sun, Yifan and Carlson, Trevor E.
关键词: Workload sampling, Simulation, GPU

Abstract

GPUs, due to their massively-parallel computing architectures, provide high performance for data-parallel applications. However, existing GPU simulators are too slow to enable architects to quickly evaluate their hardware designs and software analysis studies. Sampled simulation methodologies are one common way to speed up CPU simulation. However, GPUs apply drastically different execution models that challenge the sampled simulation methods designed for CPU simulations. Recent GPU sampled simulation methodologies do not fully take advantage of the GPU’s special architecture features, such as limited types of basic blocks or warps. Moreover, these methods depend on up-front analysis via profiling tools or functional simulation, making them difficult to use. To address this, we extensively studied the execution patterns of a variety of GPU workloads and propose Photon, a sampled simulation methodology tailored to GPUs. Photon incorporates methodologies that automatically consider different levels of GPU execution, such as kernels, warps, and basic blocks. Photon does not require up-front profiling of GPU workloads and utilizes a light-weight online analysis method based on the identification of highly repetitive software behavior. We evaluate Photon using a variety of GPU workloads, including real-world applications like VGG and ResNet. The final result shows that Photon reduces the simulation time needed to perform one inference of ResNet-152 with batch size 1 from 7.05 days to just 1.7 hours with a low sampling error of 10.7%.

DOI: 10.1145/3613424.3623773

Rigorous Evaluation of Computer Processors with Statistical Model Checking

作者: Mazurek, Filip and Tschand, Arya and Wang, Yu and Pajic, Miroslav and Sorin, Daniel
关键词: statistical model checking, evaluation, confidence intervals

Abstract

Experiments with computer processors must account for the inherent variability in executions. Prior work has shown that real systems exhibit variability, and random effects must be injected into simulators to account for it. Thus, we can run multiple executions of a given benchmark and generate a distribution of results. Prior work uses standard statistical techniques that are not suitable. While the result distributions may take any forms that are unknown a priori, many works naively assume they are Gaussian, which can be far from the truth. To allow rigorous evaluation for arbitrary result distributions, we introduce statistical model checking (SMC) to the world of computer architecture. SMC is a statistical technique that is used in research communities that depend heavily on statistical guarantees. SMC provides a rigorous mathematical methodology that employs experimental sampling for probabilistic evaluation of properties of interest, such that one can determine with a desired confidence whether a property (e.g., System X is 1.1x faster than System Y) is true or not. SMC alone is not enough for computer architects to draw conclusions based on their data. We create an end-to-end framework called SMC for Processor Analysis (SPA) which utilizes SMC techniques to provide insightful conclusions given experimental data.

DOI: 10.1145/3613424.3623785

TeAAL： A Declarative Framework for Modeling Sparse Tensor Accelerators

作者: Nayak, Nandeeka and Odemuyiwa, Toluwanimi O. and Ugare, Shubham and Fletcher, Christopher and Pellauer, Michael and Emer, Joel
关键词: No keywords

Abstract

Over the past few years, the explosion in sparse tensor algebra workloads has led to a corresponding rise in domain-specific accelerators to service them. Due to the irregularity present in sparse tensors, these accelerators employ a wide variety of novel solutions to achieve good performance. At the same time, prior work on design-flexible sparse accelerator modeling does not express this full range of design features, making it difficult to understand the impact of each design choice and compare or extend the state-of-the-art. To address this, we propose TeAAL: a language and simulator generator for the concise and precise specification and evaluation of sparse tensor algebra accelerators. We use TeAAL to represent and evaluate four disparate state-of-the-art accelerators—ExTensor, Gamma, OuterSPACE, and SIGMA—and verify that it reproduces their performance with high accuracy. Finally, we demonstrate the potential of TeAAL as a tool for designing new accelerators by showing how it can be used to speed up vertex-centric programming accelerators—achieving 1.9 \texttimes{

DOI: 10.1145/3613424.3623791

TileFlow： A Framework for Modeling Fusion Dataflow via Tree-based Analysis

作者: Zheng, Size and Chen, Siyuan and Gao, Siyuan and Jia, Liancheng and Sun, Guangyu and Wang, Runsheng and Liang, Yun
关键词: Tensor Programs, Simulation and modeling, Fusion, Accelerator

Abstract

With the increasing size of DNN models and the growing discrepancy between compute performance and memory bandwidth, fusing multiple layers together to reduce off-chip memory access has become a popular approach in dataflow design. However, designing such dataflows requires flexible and accurate performance models to facilitate evaluation, architecture analysis, and design space exploration. Unfortunately, current state-of-the-art performance models are limited to the dataflows of single operator acceleration, making them inapplicable to operator fusion dataflows. In this paper, we propose a framework called TileFlow that models dataflows for operator fusion. We first characterize the design space of fusion dataflows as a 3D space encompassing compute ordering, resource binding, and loop tiling. We then introduce a tile-centric notation to express dataflow designs within this space. Inspired by the tiling structure of fusion dataflows, we present a tree-based approach to analyze two critical performance metrics: data movement volume within the accelerator memory hierarchy and accelerator compute/memory resource usage. Finally, we leverage these metrics to calculate latency and energy consumption. Our evaluation validates TileFlow’s modeling accuracy against both real hardware and state-of-the-art performance models. We use TileFlow to aid in fusion dataflow design and analysis, and it helps us discover fusion dataflows that achieve an average runtime speedup of 1.85 \texttimes{

DOI: 10.1145/3613424.3623792

Learning to Drive Software-Defined Solid-State Drives

作者: Li, Daixuan and Sun, Jinghan and Huang, Jian
关键词: Solid State Drive, Software-Defined Hardware, Machine Learning for Systems, Learning-Based Storage

Abstract

Thanks to the mature manufacturing techniques, flash-based solid-state drives (SSDs) are highly customizable for applications today, which brings opportunities to further improve their storage performance and resource utilization. However, the SSD efficiency is usually determined by many hardware parameters, making it hard for developers to manually tune them and determine the optimized SSD hardware configurations. In this paper, we present an automated learning-based SSD hardware configuration framework, named AutoBlox, that utilizes both supervised and unsupervised machine learning (ML) techniques to drive the tuning of hardware configurations for SSDs. AutoBlox automatically extracts the unique access patterns of a new workload using its block I/O traces, maps the workload to previous workloads for utilizing the learned experiences, and recommends an optimized SSD configuration based on the validated storage performance. AutoBlox accelerates the development of new SSD devices by automating the hardware parameter configurations and reducing the manual efforts. We develop AutoBlox with simple yet effective learning algorithms that can run efficiently on multi-core CPUs. Given a target storage workload, our evaluation shows that AutoBlox can deliver an optimized SSD configuration that can improve the performance of the target workload by 1.30 \texttimes{

DOI: 10.1145/3613424.3614281

Session details： Session 9A： Accelerators in Processors

作者: Liu, Sihang
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637202

Cambricon-R： A Fully Fused Accelerator for Real-Time Learning of Neural Scene Representation

作者: Song, Xinkai and Wen, Yuanbo and Hu, Xing and Liu, Tianbo and Zhou, Haoxuan and Han, Husheng and Zhi, Tian and Du, Zidong and Li, Wei and Zhang, Rui and Zhang, Chen and Gao, Lin and Guo, Qi and Chen, Tianshi
关键词: neural scene representation, hardware accelerator;

Abstract

Neural scene representation (NSR) initiates a new methodology of encoding a 3D scene with neural networks by learning from dozens of photos taken from different camera positions. NSR not only achieves significant improvement in the quality of novel view synthesis and 3D reconstruction but also reduces the camera cost from the expensive laser cameras to the cheap color cameras on the shelf. However, performing 3D scene encoding using NSR is far from real-time due to the extremely low hardware utilization (only utilization of hardware peak performance), which greatly limits its applications in real-time AR/VR interactions In this paper, we propose Cambricon-R, a fully fused on-chip processing architecture for real-time NSR learning. Initially, by performing a thorough characterization of the computing model of NSR on a GPU, we find that the extremely low hardware utilization is mainly caused by the fragmentary stages and heavy irregular memory accesses. To address these issues, we propose Cambricon-R architecture with a novel fully-fused ray-based execution model to eliminate the computing and memory inefficiencies for real-time NSR learning. Concretely, Cambricon-R features a ray-level fused architecture that not only eliminates the intermediate memory traffics but also leverages the point sparsity in scenes to eliminate unnecessary computations. Additionally, a high throughput on-chip memory system based on Auto-Interpolation Bank Array (AIBA) is proposed to efficiently handle a large volume of irregular memory accesses. We evaluate Cambricon-R on 12 commonly-used datasets with the state-of-the-art representative algorithm, instant-ngp. The result shows that Cambricon-R achieves PE utilization of , on average. Compared to the state-of-the-art solution on A100 GPU, Cambricon-R achieves 373.8 \texttimes{

DOI: 10.1145/3613424.3614250

Strix： An End-to-End Streaming Architecture with Two-Level Ciphertext Batching for Fully Homomorphic Encryption with Programmable Bootstrapping

作者: Putra, Adiwena and Prasetiyo and Chen, Yi and Kim, John and Kim, Joo-Young
关键词: programmable bootstrapping, fully homomorphic encryption, ciphertext batching, accelerator

Abstract

Homomorphic encryption (HE) is a type of cryptography that allows computations to be performed on encrypted data. The technique relies on learning with errors problem, where data is hidden under noise for security. To avoid excessive noise, bootstrapping is used to reset the noise level in the ciphertext, but it requires a large key and is computationally expensive. The fully homomorphic encryption over the torus (TFHE) scheme offers a faster and programmable bootstrapping (PBS) algorithm, which is crucial for many privacy-focused applications. Nonetheless, the current TFHE scheme does not support ciphertext packing, resulting in low-throughput performance. To the best of our knowledge, this is the first work that thoroughly analyzes TFHE bootstrapping, identifies the TFHE acceleration bottleneck in GPUs, and proposes a hardware TFHE accelerator to solve the bottleneck. We begin by identifying the TFHE acceleration bottleneck in GPUs due to the blind rotation fragmentation problem. This can be improved by increasing the batch size in PBS. We propose a two-level batching approach to enhance the batch size in PBS. To implement this solution efficiently, we introduce Strix, utilizing a streaming and fully pipelined architecture with specialized units to accelerate ciphertext processing in TFHE. Specifically, we propose a novel microarchitecture for decomposition in TFHE, suitable for processing streaming data at high throughput. We also employ a fully-pipelined FFT microarchitecture to address the memory access bottleneck and improve its performance through a folding scheme, achieving 2 \texttimes{

DOI: 10.1145/3613424.3614264

A Tensor Marshaling Unit for Sparse Tensor Algebra on General-Purpose Processors

作者: Siracusa, Marco and Soria-Pardos, V'{\i
关键词: vectorization, tensor merging, sparse tensor algebra, parallel tensor traversal, Dataflow accelerator

Abstract

This paper proposes the Tensor Marshaling Unit (TMU), a near-core programmable dataflow engine for multicore architectures that accelerates tensor traversals and merging, the most critical operations of sparse tensor workloads running on today’s computing infrastructures. The TMU leverages a novel multi-lane design that enables parallel tensor loading and merging, which naturally produces vector operands that are marshaled into the core for efficient SIMD computation. The TMU supports all the necessary primitives to be tensor-format and tensor-algebra complete. We evaluate the TMU on a simulated multicore system using a broad set of tensor algebra workloads, achieving 3.6 \texttimes{

DOI: 10.1145/3613424.3614284

Tailors： Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity

作者: Xue, Zi Yu and Wu, Yannan Nellie and Emer, Joel S. and Sze, Vivienne
关键词: No keywords

Abstract

Sparse tensor algebra is a challenging class of workloads to accelerate due to low arithmetic intensity and varying sparsity patterns. Prior sparse tensor algebra accelerators have explored tiling sparse data to increase exploitable data reuse and improve throughput, but typically allocate tile size in a given buffer for the worst-case data occupancy. This severely limits the utilization of available memory resources and reduces data reuse. Other accelerators employ complex tiling during preprocessing or at runtime to determine the exact tile size based on its occupancy. This paper proposes a speculative tensor tiling approach, called overbooking, to improve buffer utilization by taking advantage of the distribution of nonzero elements in sparse tensors to construct larger tiles with greater data reuse. To ensure correctness, we propose a low-overhead hardware mechanism, Tailors, that can tolerate data overflow by design while ensuring reasonable data reuse. We demonstrate that Tailors can be easily integrated into the memory hierarchy of an existing sparse tensor algebra accelerator. To ensure high buffer utilization with minimal tiling overhead, we introduce a statistical approach, Swiftiles, to pick a tile size so that tiles usually fit within the buffer’s capacity, but can potentially overflow, i.e., it overbooks the buffers. Across a suite of 22 sparse tensor algebra workloads, we show that our proposed overbooking strategy introduces an average speedup of 52.7 \texttimes{

DOI: 10.1145/3613424.3623793

Session details： Session 9B： ML Compiler Optimizations/ Reconfigurable Architectures

作者: Huang, Jian
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637203

Grape： Practical and Efficient Graphed Execution for Dynamic Deep Neural Networks on GPUs

作者: Zheng, Bojian and Yu, Cody Hao and Wang, Jie and Ding, Yaoyao and Liu, Yizhi and Wang, Yida and Pekhimenko, Gennady
关键词: machine learning compilers, dynamic neural networks, CUDA graphs

Abstract

Achieving high performance in machine learning workloads is a crucial yet difficult task. To achieve high runtime performance on hardware platforms such as GPUs, graph-based executions such as CUDA graphs are often used to eliminate CPU runtime overheads by submitting jobs in the granularity of multiple kernels. However, many machine learning workloads, especially dynamic deep neural networks (DNNs) with varying-sized inputs or data-dependent control flows, face challenges when directly using CUDA graphs to achieve optimal performance. We observe that the use of graph-based executions poses three key challenges in terms of efficiency and even practicability: (1) Extra data movements when copying input values to graphs’ placeholders. (2) High GPU memory consumption due to the numerous CUDA graphs created to efficiently support dynamic-shape workloads. (3) Inability to handle data-dependent control flows. To address those challenges, we propose Grape, a new graph compiler that enables practical and efficient graph-based executions for dynamic DNNs on GPUs. Grape comprises three key components: (1) an alias predictor that automatically removes extra data movements by leveraging code positions at the Python frontend, (2) a metadata compressor that efficiently utilizes the data redundancy in CUDA graphs’ memory regions by compressing them, and (3) a predication rewriter that safely replaces control flows with predication contexts while preserving programs’ semantics. The three components improve the efficiency and broaden the optimization scope of graph-based executions while allowing machine learning practitioners to program dynamic DNNs at the Python level with minimal source code changes. We evaluate Grape on state-of-the-art text generation (GPT-2, GPT-J) and speech recognition (Wav2Vec2) workloads, which include both training and inference, using real systems with modern GPUs. Our evaluation shows that Grape achieves up to 36.43 \texttimes{

DOI: 10.1145/3613424.3614248

PockEngine： Sparse and Efficient Fine-tuning in a Pocket

作者: Zhu, Ligeng and Hu, Lanxiang and Lin, Ji and Chen, Wei-Ming and Wang, Wei-Chen and Gan, Chuang and Han, Song
关键词: sparse update, on-device training, neural network, efficient finetuning

Abstract

On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15 \texttimes{

DOI: 10.1145/3613424.3614307

Towards Efficient Control Flow Handling in Spatial Architecture via Architecting the Control Flow Plane

作者: Deng, Jinyi and Tang, Xinru and Zhang, Jiahao and Li, Yuxuan and Zhang, Linyun and Han, Boxiao and He, Hongjun and Tu, Fengbin and Liu, Leibo and Wei, Shaojun and Hu, Yang and Yin, Shouyi
关键词: spatial architecture, control plane, control flow, coarse-grained reconfigurable array

Abstract

Spatial architecture is a high-performance architecture that uses control flow graphs and data flow graphs as the computational model and producer/consumer models as the execution models. However, existing spatial architectures suffer from control flow handling challenges. Upon categorizing their PE execution models, we find that they lack autonomous, peer-to-peer, and temporally loosely-coupled control flow handling capability. This leads to limited performance in intensive control programs. A spatial architecture, Marionette, is proposed, with an explicit-designed control flow plane. The Control Flow Plane enables autonomous, peer-to-peer and temporally loosely-coupled control flow handling. The Proactive PE Configuration ensures computation-overlapped and timely configuration to improve handling Branch Divergence. The Agile PE Assignment enhance the pipeline performance of Imperfect Loops. We develop full stack of Marionette (ISA, compiler, simulator, RTL) and demonstrate that in a variety of challenging intensive control programs, compared to state-of-the-art spatial architectures, Marionette outperforms Softbrain, TIA, REVEL, and RipTide by geomean 2.88\texttimes{

DOI: 10.1145/3613424.3614246

Pipestitch： An energy-minimal dataflow architecture with lightweight threads

作者: Serafin, Nathan and Ghosh, Souradip and Desai, Harsh and Beckmann, Nathan and Lucia, Brandon
关键词: No keywords

Abstract

Computing at the extreme edge allows systems with high-resolution sensors to be pushed well outside the reach of traditional communication and power delivery, requiring high-performance, high-energy-efficiency architectures to run complex ML, DSP, image processing, etc. Recent work has demonstrated the suitability of CGRAs for energy-minimal computation, but has focused strictly on energy optimization, neglecting performance. Pipestitch is an energy-minimal CGRA architecture that adds lightweight hardware threads to ordered dataflow, exploiting abundant, untapped parallelism in the complex workloads needed to meet the demands of emerging sensing applications. Pipestitch introduces a programming model, control-flow operator, and synchronization network to allow lightweight hardware threads to pipeline on the CGRA fabric. Across 5 important sparse workloads, Pipestitch achieves a 3.49 \texttimes{

DOI: 10.1145/3613424.3614283

Session details： Session 9C： Domain Specific Genomics

作者: Bose, Pradip
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3637204

CASA： An Energy-Efficient and High-Speed CAM-based SMEM Seeding Accelerator for Genome Alignment

作者: Huang, Yi and Kong, Lingkun and Chen, Dibei and Chen, Zhiyu and Kong, Xiangyu and Zhu, Jianfeng and Mamouras, Konstantinos and Wei, Shaojun and Yang, Kaiyuan and Liu, Leibo
关键词: SMEM Seeding, Genome Alignment, Filtering, CAM

Abstract

Genome analysis is a critical tool in medical and bioscience research, clinical diagnostics and treatment, and disease control and prevention. Seed and extension-based alignment is the main approach in the genome analysis pipeline, and BWA-MEM2, a widely acknowledged tool for genome alignment, performs seeding by searching for super maximal exact match (SMEM). The computation of SMEM searching requires high memory bandwidth and energy consumption, which becomes the main performance bottleneck in BWA-MEM2. State-of-the-Art designs like ERT and GenAx have achieved impressive speed-ups of SMEM-based genome alignment. However, they are constrained by frequent DRAM fetches or computationally intensive intersection calculations for all possible k-mers at every read position. We present a CAM-based SMEM seeding accelerator for genome alignment (CASA), which circumvents the major throughput and power bottlenecks brought by data fetches and frequent position intersections through the co-design of a novel CAM-based computing architecture and a new SMEM search algorithm. CASA mainly consists of a pre-seeding filter table and a SMEM computing unit. The former expands the k-mer size to 19 using limited on-chip memory, which enables the efficient filtration of non-SMEM pivots. The latter applies a new algorithm to filter out disposable SMEMs that are already contained in other SMEMs. We evaluated a 28nm CASA implementation using the human and mouse genome references. CASA achieves 1.2 \texttimes{

DOI: 10.1145/3613424.3614313

Swordfish： A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors

作者: Shahroodi, Taha and Singh, Gagandeep and Zahedi, Mahdi and Mao, Haiyu and Lindegger, Joel and Firtina, Can and Wong, Stephan and Mutlu, Onur and Hamdioui, Said
关键词: processing in memory (PIM), non-ideality, memristors, memory systems, genome analysis, deep neural networks (DNN s), computation in memory (CIM), basecalling

Abstract

Basecalling, an essential step in many genome analysis studies, relies on large Deep Neural Network s (DNN s) to achieve high accuracy. Unfortunately, these DNN s are computationally slow and inefficient, leading to considerable delays and resource constraints in the sequence analysis process. A Computation-In-Memory (CIM) architecture using memristors can significantly accelerate the performance of DNN s. However, inherent device non-idealities and architectural limitations of such designs can greatly degrade the basecalling accuracy, which is critical for accurate genome analysis. To facilitate the adoption of memristor-based CIM designs for basecalling, it is important to (1) conduct a comprehensive analysis of potential CIM architectures and (2) develop effective strategies for mitigating the possible adverse effects of inherent device non-idealities and architectural limitations. This paper proposes Swordfish, a novel hardware/software co-design framework that can effectively address the two aforementioned issues. Swordfish incorporates seven circuit and device restrictions or non-idealities from characterized real memristor-based chips. Swordfish leverages various hardware/software co-design solutions to mitigate the basecalling accuracy loss due to such non-idealities. To demonstrate the effectiveness of Swordfish, we take Bonito, the state-of-the-art (i.e., accurate and fast), open-source basecaller as a case study. Our experimental results using Swordfish show that a CIM architecture can realistically accelerate Bonito for a wide range of real datasets by an average of 25.7 \texttimes{

DOI: 10.1145/3613424.3614252

DASH-CAM： Dynamic Approximate SearcH Content Addressable Memory for genome classification

作者: Jahshan, Zuher and Merlin, Itay and Garz'{o
关键词: Pathogen detection, Pathogen classification, GC-eDRAM, Dynamic approximate search, Content Addressable Memory, Approximate search

Abstract

We propose a novel dynamic storage-based approximate search content addressable memory (DASH-CAM) for computational genomics applications, particularly for identification and classification of viral pathogens of epidemic significance. DASH-CAM provides 5.5 \texttimes{

DOI: 10.1145/3613424.3614262

GMX： Instruction Set Extensions for Fast, Scalable, and Efficient Genome Sequence Alignment

作者: Doblas, Max and Lostes-Cazorla, Oscar and Aguado-Puig, Quim and Cebry, Nick and Fontova-Must'{e
关键词: sequence alignment, microarchitecture, hardware acceleration, genomics, edit-distance, bioinformatics, ISA extensions

Abstract

Sequence alignment remains a fundamental problem in computer science with practical applications ranging from pattern matching to computational biology. The ever-increasing volumes of genomic data produced by modern DNA sequencers motivate improved software and hardware sequence alignment accelerators that scale with longer sequence lengths and high error rates without losing accuracy. Furthermore, the wide variety of use cases requiring sequence alignment demands flexible and efficient solutions that can match or even outperform expensive application-specific accelerators. To address these challenges, we propose GMX, a set of ISA extensions that enable efficient sequence alignment computations based on dynamic programming (DP). GMX extensions provide the basic building-block operations to perform fast tile-wise computations of the DP matrix, reducing the memory footprint and allowing easy integration into widely-used algorithms and tools. Furthermore, we provide an efficient hardware implementation that integrates GMX extensions in a RISC-V-based edge system-on-chip (SoC). Compared to widely-used software implementations, our hardware-software co-design leveraging GMX extensions obtains speed-ups from 25–265 \texttimes{

DOI: 10.1145/3613424.3614306

Hermes： Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

作者: Bera, Rahul and Kanellopoulos, Konstantinos and Balachandran, Shankar and Novo, David and Olgun, Ataberk and Sadrosadati, Mohammad and Mutlu, Onur
关键词: No keywords

Abstract

Long-latency load requests continue to limit the performance of modern high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: (1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and (2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy to solely determine that it needs to go off-chip.The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: (1) accurately predict which load requests might go off-chip, and (2) speculatively fetch the data required by the predicted off-chip loads directly from the main memory, while also concurrently accessing the cache hierarchy for such loads.To enable Hermes, we develop a new lightweight, perceptron-based off-chip load prediction technique that learns to identify off-chip load requests using multiple program features (e.g., sequence of program counters, byte offset of a load request). For every load request generated by the processor, the predictor observes a set of program features to predict whether or not the load would go off-chip. If the load is predicted to go off-chip, Hermes issues a speculative load request directly to the main memory controller once the load’s physical address is generated. If the prediction is correct, the load eventually misses the cache hierarchy and waits for the ongoing speculative load request to finish, and thus Hermes completely hides the on-chip cache hierarchy access latency from the critical path of the correctly-predicted off-chip load. Our extensive evaluation using a wide range of workloads shows that Hermes provides consistent performance improvement on top of a state-of-the-art baseline system across a wide range of configurations with varying core count, main memory bandwidth, high-performance data prefetchers, and on-chip cache hierarchy access latencies, while incurring only modest storage overhead. The source code of Hermes is freely available at: https://github.com/CMU-SAFARI/Hermes.

DOI: 10.1109/MICRO56248.2022.00015

Whisper： Profile-Guided Branch Misprediction Elimination for Data Center Applications

作者: Khan, Tanvir Ahmed and Ugur, Muhammed and Nathella, Krishnendra and Sunwoo, Dam and Litz, Heiner and Jim'{e
关键词: No keywords

Abstract

Modern data center applications experience frequent branch mispredictions - degrading performance, increasing cost, and reducing energy efficiency in data centers. Even the state-of-the-art branch predictor, TAGE-SC-L, suffers from an average branch Mispredictions Per Kilo Instructions (branch-MPKI) of 3.0 (0.5–7.2) for these applications since their large code footprints exhaust TAGE-SC-L’s intended capacity.In this work, we propose Whisper, a novel profile-guided mechanism to avoid branch mispredictions. Whisper investigates the in-production profile of data center applications to identify precise program contexts that lead to branch mispredictions. Corresponding prediction hints are then inserted into code to strategically avoid those mispredictions during program execution. Whisper presents three novel profile-guided techniques: (1) hashed history correlation which efficiently encodes hard-to-predict correlations in branch history using lightweight Boolean formulas, (2) randomized formula testing which selects a locally-optimal Boolean formula from a randomly selected subset of possible formulas to predict a branch, and (3) the extension of Read-Once Monotone Boolean Formulas with Implication and Converse Non-Implication to improve the branch history coverage of these formulas with minimal overhead.We evaluate Whisper on 12 widely-used data center applications and demonstrate that Whisper enables traditional branch predictors to achieve a speedup close to that of an ideal branch predictor. Specifically, Whisper achieves an average speedup of 2.8% (0.4%-4.6%) by reducing 16.8% (1.7%-32.4%) of branch mispredictions over TAGE-SC-L and outperforms the state-of-the-art profile-guided branch prediction mechanisms by 7.9% on average.

DOI: 10.1109/MICRO56248.2022.00017

OverGen： Improving FPGA Usability through Domain-Specific Overlay Generation

作者: Liu, Sihao and Weng, Jian and Kupsh, Dylan and Sohrabizadeh, Atefeh and Wang, Zhengrong and Guo, Licheng and Liu, Jiuyang and Zhulin, Maxim and Mani, Rishabh and Zhang, Lucheng and Cong, Jason and Nowatzki, Tony
关键词: design automation, CGRA, FPGA, domain-specific accelerators, reconfigurable architectures

Abstract

FPGAs have been proven to be powerful computational accelerators across many types of workloads. The mainstream programming approach is high level synthesis (HLS), which maps high-level languages (e.g. C + #pragmas) to hardware. Unfortunately, HLS leaves a significant programmability gap in terms of reconfigurability, customization and versatility: Although HLS compilation is fast, the downstream physical design takes hours to days; FPGA reconfiguration time limits the time-multiplexing ability of hardware, and tools do not reason about cross-workload flexibility. Overlay architectures mitigate the above by mapping a programmable design (e.g. CPU, GPU, etc.) on top of FPGAs. However, the abstraction gap between overlay and FPGA leads to low efficiency/utilization.Our essential idea is to develop a hardware generation framework targeting a highly-customizable overlay, so that the abstraction gap can be lowered by tuning the design instance to applications of interest. We leverage and extend prior work on customizable spatial architectures, SoC generation, accelerator compilers, and design space explorers to create an end-to-end FPGA acceleration system. Our novel techniques address inefficient networks between on-chip memories and processing elements, as well as improving DSE by reducing the amount of recompilation required.Our framework, OverGen, is highly competitive with fixed-function HLS-based designs, even though the generated designs are programmable with fast reconfiguration. We compared to a state-of-the-art DSE-based HLS framework, AutoDSE. Without kernel-tuning for AutoDSE, OverGen gets 1.2\texttimes{

DOI: 10.1109/MICRO56248.2022.00018

Cambricon-P： A Bitflow Architecture for Arbitrary Precision Computing

作者: Hao, Yifan and Zhao, Yongwei and Liu, Chenxiao and Du, Zidong and Cheng, Shuyao and Li, Xiaqing and Hu, Xing and Guo, Qi and Xu, Zhiwei and Chen, Tianshi
关键词: bit-serial, bitflow architecture, arbitrary precision

Abstract

Arbitrary precision computing (APC), where the digits vary from tens to millions of bits, is fundamental for scientific applications, such as mathematics, physics, chemistry, and biology. APC on existing platforms (e.g., CPUs and GPUs) is achieved by decomposing the original data into small pieces to accommodate to the low-bitwidth (e.g., 32-/64-bit) functional units. However, such fine-grained decomposition inevitably introduces large amounts of intermediates, bringing in intensive on-chip data traffic and long, complex dependency chains, so that causing low hardware utilization.To address this issue, we propose Cambricon-P, a bitflow architecture supporting monolithic large and flexible bitwidth operations for efficient APC processing, which avoids generating large amounts of intermediates from decomposition. Cambricon-P features a tightly-integrated computational architecture for processing different bitflows in parallel, where full bit-serial data paths are deployed. The bit-serial scheme still needs to eliminate the dependency chain of APC for exploiting parallelism within one monolithic large-bitwidth operation. For this purpose, Cambricon-P adopts a carry parallel computing mechanism, which enables recursively transforming the multiplication into smaller inner-products that can be performed in parallel between bit-indexed IPUs (Inner-Product Units). Furthermore, to improve the computing efficiency of APC, Cambricon-P employs a bit-indexed inner-product processing scheme, namely BIPS, to eliminate intra-IPU bit-level redundancy. Compared to Intel Xeon 6134 CPU, Cambricon-P achieves 100.98\texttimes{

DOI: 10.1109/MICRO56248.2022.00016

Revisiting Residue Codes for Modern Memories

作者: Manzhosov, Evgeny and Hastings, Adam and Pancholi, Meghna and Piersma, Ryan and Ziad, Mohamed Tarek Ibn and Sethumadhavan, Simha
关键词: rowhammer, metadata, memory tagging, error correcting codes

Abstract

Residue codes have been traditionally used for compute error correction rather than storage error correction. In this paper, we use these codes for storage error correction with surprising results. We find that adapting residue codes to modern memory systems offers a level of error correction comparable to traditional schemes such as Reed-Solomon with fewer bits of storage. For instance, our adaptation of residue code - MUSE ECC - can offer ChipKill protection using approximately 30% fewer bits. We show that the storage gains can be used to hold metadata needed for emerging security functionality such as memory tagging or to provide better detection capabilities against Rowhammer attacks. Our evaluation shows that memory tagging in a MUSE-enabled system shows a 12% reduction in memory bandwidth utilization while providing the same level of error correction as a traditional ECC baseline without a noticeable loss of performance. Thus, our work demonstrates a new, flexible primitive for co-designing reliability with security and performance.

DOI: 10.1109/MICRO56248.2022.00020

PageORAM ： An Efficient DRAM Page Aware ORAM Strategy

作者: Rajat, Rachit and Wang, Yongqin and Annavaram, Murali
关键词: memory access pattern, security, ORAM, memory

Abstract

Leaking memory access addresses can significantly empower an adversary in several computing usage scenarios - from key extraction to disclosing private information. Oblivious RAM has been proposed as a solution to this problem. Oblivious RAM involves reading multiple blocks instead of a single block for each memory access and changing the address of the data block being read after each access. State-of-the-art ORAMs (PathORAM and RingORAM) consider a tree-based structure for storing the data. However, ORAM designs pay a performance penalty. One reason is the strict requirement to evict stash blocks on a particular path that was previously read from. In tree-based ORAMs, each memory block is assigned a random path number, and when accessing a single block, one must fetch all the blocks in that path. Once the path is fetched into a client, the computation on the block is performed and that block is assigned a new random path number. All the blocks that were fetched into the client should be evicted to the memory. However, the eviction process must place the unmodified blocks on the same path that the prior read has fetched data from. This eviction requirement may cause a block not to be placed back in the ORAM tree due to limited space on a given tree node. As a result, the client must temporarily hold the block in its stash, which is a secure storage. Every fetch request for a block must search the stash before issuing a request to the ORAM. As the stash size grows, the stash search process becomes a substantial latency hurdle. On the other hand, if the stash is small then the client has to issue dummy reads which are useless reads in the tree for the sole purpose of creating more opportunities to place the stash data back in the tree. An alternate approach used in prior works is to embed dummy data blocks to create large bucket sizes at each tree level to enable better stash eviction probability. Neither of the above two solutions is palatable in practice. Dummy reads increase memory access latency, while dummy blocks increase the fetch bandwidth to bring large buckets from each level into the stash. Furthermore, dummy blocks also decrease the effective memory size available.To solve this problem we propose PageORAM, a novel block eviction and placement strategy. PageORAM makes the critical observation that DRAM is accessed at the granularity of a page (also referred to as row buffer), which is at least an order magnitude larger than the tree node size. Thus, a page may hold data blocks from multiple sub-trees. Hence, when fetching a path, PageORAM fetches a few additional sub-paths from the tree that are already present in an open DRAM page. These additional fetches vastly increase stash eviction options by opening up exponentially more data block placement choices. Thus, PageORAM enables a dramatic reduction in stash size without increasing page access counts in DRAM. While this observation may be counter-intuitive, we note that PageORAM reduces the overall bandwidth even after accounting for the increased fetches along the sub-paths. The reason is that by vastly improving stash block placement possibilities, PageORAM can significantly reduce the bucket size of the tree.Our implementation of PageORAM demonstrates an order of magnitude slower stash growth, increased bucket occupancy with useful data, and correspondingly improved memory access latency and reduced memory bandwidth. In our experiments, we find that PageORAM can either reduce the memory space requirement of the tree-based ORAMs by up to 40% compared to baseline tree-based ORAM or give a performance improvement of up to 7.8x for the same structured tree-based ORAM.

DOI: 10.1109/MICRO56248.2022.00021

AQUA： Scalable Rowhammer Mitigation by Quarantining Aggressor Rows at Runtime

作者: Saxena, Anish and Saileshwar, Gururaj and Nair, Prashant J. and Qureshi, Moinuddin
关键词: isolation, rowhammer, security, DRAM

Abstract

Rowhammer allows an attacker to induce bit flips in a row by rapidly accessing neighboring rows. Rowhammer is a severe security threat as it can be used to escalate privilege or break confidentiality. Moreover, the threshold of activations needed to induce Rowhammer continues to reduce and new attacks like Half-Double break existing solutions that refresh victim rows. The recently proposed Randomized Row-Swap (RRS) scheme is resilient to Half-Double as it provides mitigation by swapping an aggressor row with a random row. However, to ensure security, the threshold for triggering a row-swap must be set much lower than the Rowhammer threshold, leading to a significant performance loss of 20% on average, at a Rowhammer threshold of 1K. Furthermore, the SRAM overhead for storing the indirection table of RRS becomes prohibitively large - 2.4MB per rank at a Rowhammer threshold of 1K. Our goal is to develop a scalable Rowhammer mitigation that incurs negligible performance and storage overheads.To this end, we propose AQUA, a Rowhammer mitigation that breaks the spatial correlation between aggressor and victim rows by dynamically quarantining the aggressor row in a dedicated region of memory. AQUA allows for an effective row migration threshold much higher than in RRS, leading to an order of magnitude less slowdown and SRAM. As the security of AQUA is not reliant on keeping the destination row a secret, we further reduce the SRAM overheads of the indirection table by storing it in DRAM, and accessing it on-demand. We derive the size of the quarantine region required to ensure security for AQUA and show that reserving about 1% of DRAM is sufficient to mitigate Rowhammer at a threshold of 1K. Our evaluations show that AQUA incurs an average slowdown of 2% and an SRAM overhead (for mapping and migration) of only 41KB per rank at a Rowhammer threshold of 1K.

DOI: 10.1109/MICRO56248.2022.00022

Cronus： Fault-Isolated, Secure and High-Performance Heterogeneous Computing for Trusted Execution Environment

作者: Jiang, Jianyu and Qi, Ji and Shen, Tianxiang and Chen, Xusheng and Zhao, Shixiong and Wang, Sen and Chen, Li and Zhang, Gong and Luo, Xiapu and Cui, Heming
关键词: security isolation, fault isolation, GPU, accelerator, trusted execution environment, ARM TrustZone

Abstract

With the trend of processing a large volume of sensitive data on PaaS services (e.g., DNN training), a TEE architecture that supports general heterogeneous accelerators, enables spatial sharing on one accelerator, and enforces strong isolation across accelerators is highly desirable. However, none of the existing TEE solutions meet all three requirements.In this paper, we propose Cronus, the first TEE architecture that achieves the three crucial requirements. The key idea of Cronus is to partition heterogeneous computation into isolated TEE enclaves, where each enclave encapsulates only one kind of computation (e.g., GPU computation), and multiple enclaves can spatially share an accelerator. Then, Cronus constructs heterogeneous computing using remote procedure calls (RPCs) among enclaves. With Cronus, each accelerator’s hardware and its software stack are strongly isolated from others’, and each enclave trusts only its own hardware. To tackle the security challenge caused by inter-enclave interactions, we design a new streaming remote procedure call abstraction to enable secure RPCs with high performance. Cronus is software-based, making it general to diverse accelerators. We implemented Cronus on ARM TrustZone. Evaluation on diverse workloads with CPUs, GPUs and NPUs shows that, Cronus achieves less than 7.1% extra computation time compared to native (unprotected) executions.

DOI: 10.1109/MICRO56248.2022.00019

Reconstructing Out-of-Order Issue Queue

作者: Jeong, Ipoom and Lee, Jiwon and Yoon, Myung Kuk and Ro, Won Woo
关键词: steering, data dependence, dynamic scheduling

Abstract

Out-of-order cores provide high performance at the cost of energy efficiency. Dynamic scheduling is one of the major contributors to this: generating highly optimized issue schedules considering both data dependences and underlying execution resources, but relying heavily on complex wakeup and select operations of an out-of-order issue queue (IQ). For decades, researchers have proposed several complexity-effective dynamic scheduling schemes by leveraging the energy efficiency of an in-order IQ. However, they are either costly or not capable of delivering sufficient performance to substitute for a conventional wide-issue out-of-order IQ.In this work, we revisit two previous designs: one classical dependence-based design and the other state-of-the-art readiness-based design. We observe that they are complementary to each other, and thus their synergistic integration has the potential to be a good alternative to an out-of-order IQ. We first combine these two designs, and further analyze the main architectural bottlenecks that incur the underutilization of aggregate issue capability, thereby limiting the exploitation of instruction-level and memory-level parallelisms: 1) memory dependences not exposed by the register-based dependence analysis and 2) wide and shallow nature of dynamic dependence chains due to the long-latency memory accesses. To this end, we propose Ballerino, a novel microarchitecture that performs balanced and cache-miss-tolerable dynamic scheduling via a complementary combination of cascaded and clustered in-order IQs. Ballerino is built upon three key functionalities: 1) speculatively filtering out ready-at-dispatch instructions, 2) eliminating wasteful wakeup operations via a simple steering technique leveraging the awareness of memory dependences, and 3) reacting to program phase changes by allowing different load-dependent chains to share a single IQ while guaranteeing their out-of-order issue. The net effect is minimal scheduling energy consumption per instruction while providing comparable scheduling performance to a fully out-of-order IQ. In our analysis, Ballerino achieves comparable performance to an 8-wide out-of-order core by using twelve in-order IQs, improving core-wide energy efficiency by 20%.

DOI: 10.1109/MICRO56248.2022.00023

Speculative Code Compaction： Eliminating Dead Code via Speculative Microcode Transformations

作者: Moody, Logan and Qi, Wei and Sharifi, Abdolrasoul and Berry, Layne and Rudek, Joey and Gaur, Jayesh and Parkhurst, Jeff and Subramoney, Sreenivas and Skadron, Kevin and Venkat, Ashish
关键词: optimization, speculation, microarchitecture

Abstract

The computing landscape has been increasingly characterized by processor architectures with increasing core counts, while a majority of the software applications remain inherently sequential. Although state-of-the-art compilers feature sophisticated optimizations, a significant chunk of wasteful computation persists due to the presence of data-dependent operations and irregular control-flow patterns that are unpredictable at compile-time. This work presents speculative code compaction (SCC), a novel microarchitectural technique that significantly enhances the capabilities of the microcode engine to aggressively and speculatively eliminate dead code from hot code regions resident in the micro-op cache, and further generate a compact stream of micro-ops, based on dynamically predicted machine code invariants. SCC also extends existing micro-op cache designs to co-host multiple versions of unoptimized and speculatively optimized micro-op sequences, providing the fetch engine with significant flexibility to dynamically choose from and stream the appropriate set of micro-ops, as and when deemed profitable.SCC is a minimally-invasive technique that can be implemented at the processor front-end using a simple ALU and a register context table, and is yet able to substantially accelerate the performance of already compile-time optimized and machine-tuned code by an average of 6% (and as much as 30%), with an average of 12% (and as much as 24%) savings in energy consumption, while eliminating the need for profiling and offering increased adaptability to changing datasets and workload patterns.

DOI: 10.1109/MICRO56248.2022.00024

big.VLITTLE： On-Demand Data-Parallel Acceleration for Mobile Systems on Chip

作者: Ta, Tuan and Al-Hawaj, Khalid and Cebry, Nick and Ou, Yanghui and Hall, Eric and Golden, Courtney and Batten, Christopher
关键词: No keywords

Abstract

Single-ISA heterogeneous multi-core architectures offer a compelling high-performance and high-efficiency solution to executing task-parallel workloads in mobile systems on chip (SoCs). In addition to task-parallel workloads, many data-parallel applications, such as machine learning, computer vision, and data analytics, increasingly run on mobile SoCs to provide real-time user interactions. Next-generation scalable vector architectures, such as the RISC-V Vector Extension and Arm SVE, have recently emerged as unified vector abstractions for both large- and small-scale systems. In this paper, we propose novel area-efficient high-performance architectures called big.VLITTLE that support next-generation vector architectures to efficiently accelerate data-parallel workloads in conventional big.LITTLE systems. big.VLITTLE architectures reconfigure multiple little cores on demand to work as a decoupled vector engine when executing data-parallel workloads. Our results show that a big.VLITTLE system can achieve 1.6\texttimes{

DOI: 10.1109/MICRO56248.2022.00025

Exploring Instruction Fusion Opportunities in General Purpose Processors

作者: Singh, Sawan and Perais, Arthur and Jimborean, Alexandra and Ros, Alberto
关键词: instruction fusion, microarchitecture, general purpose

Abstract

The Complex Instruction Set Computer (CISC) paradigm has led to the introduction of instruction cracking in which an architectural instruction is divided into multiple microarchitectural instructions (μ-ops). However, the dual concept, instruction fusion is also prevalent in modern microarchitectures to maximize resource utilization. In essence, some architectural instructions are too complex to be executed as a unit, so they should be cracked, while others are too simple to waste resources on executing them as a unit, so they should be fused with others.In this paper, we focus on instruction fusion and explore opportunities for fusing additional instructions in a high-performance general purpose pipeline. We show that enabling fusion for common RISC-V idioms improves performance by 7%. Then, we determine experimentally that enabling fusion only for memory instructions achieves 86% of the potential of fusion in this particular case. Finally, we propose the Helios microarchitecture, able to fuse non-consecutive and noncontiguous memory instructions, and discuss microarchitectural changes required to do so efficiently while preserving correctness. Helios allows to fuse an additional 5.5% of dynamic instructions, yielding a 14.2% performance uplift over no fusion (8.2% over baseline fusion).

DOI: 10.1109/MICRO56248.2022.00026

DTexL： Decoupled Raster Pipeline for Texture Locality

作者: Joseph, Diya and Arag'{o
关键词: low-power, texture locality, scheduling, graphics, caches, GPU

Abstract

Contemporary GPU architectures have multiple shader cores and a scheduler that distributes work (threads) among them, focusing on load balancing. These load balancing techniques favor thread distributions that are detrimental to texture memory locality for graphics applications in the L1 Texture Caches. Texture memory accesses make up the majority of the traffic to the memory hierarchy in typical low power graphics architectures. This paper focuses on improving the L1 Texture cache locality by focusing on a new workload scheduler by exploring various methods to group the threads, assign the groups to shader cores and also to reorder threads without violating the correctness of the pipeline. To overcome the resulting load imbalance, we also propose a minor modification in the GPU architecture that helps translate the improvement in cache locality to an improvement in the GPU’s performance. We propose DTexL that envelops these ideas and evaluate it over a benchmark suite of ten commercial games, to obtain a 46.8% decrease in L2 Accesses, a 19.3% increase in performance and a 6.3% decrease in total GPU energy. All this with a negligible overhead.

DOI: 10.1109/MICRO56248.2022.00028

Morpheus： Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources

作者: Darabi, Sina and Sadrosadati, Mohammad and Akbarzadeh, Negar and Lindegger, Jo"{e
关键词: No keywords

Abstract

Graphics Processing Units (GPUs) are widely-used accelerators for data-parallel applications. In many GPU applications, GPU memory bandwidth bottlenecks performance, causing underutilization of GPU cores. Hence, disabling many cores does not affect the performance of memory-bound workloads. While simply power-gating unused GPU cores would save energy, prior works attempt to better utilize GPU cores for other applications (ideally compute-bound), which increases the GPU’s total throughput.In this paper, we introduce Morpheus, a new hardware/software co-designed technique to boost the performance of memory-bound applications. The key idea of Morpheus is to exploit unused core resources to extend the GPU last level cache (LLC) capacity. In Morpheus, each GPU core has two execution modes: compute mode and cache mode. Cores in compute mode operate conventionally and run application threads. However, for the cores in cache mode, Morpheus invokes a software helper kernel that uses the cores’ on-chip memories (i.e., register file, shared memory, and L1) in a way that extends the LLC capacity for a running memory-bound workload. Morpheus adds a controller to the GPU hardware to forward LLC requests to either the conventional LLC (managed by hardware) or the extended LLC (managed by the helper kernel). Our experimental results show that Morpheus improves the performance and energy efficiency of a baseline GPU architecture by an average of 39% and 58%, respectively, across several memory-bound workloads. Morpheus’ performance is within 3% of a GPU design that has a quadruple-sized conventional LLC. Morpheus can thus contribute to reducing the hardware dedicated to a conventional LLC by exploiting idle cores’ on-chip memory resources as additional cache capacity.

DOI: 10.1109/MICRO56248.2022.00029

Featherweight Soft Error Resilience for GPUs

作者: Zhang, Yida and Jung, Changhee
关键词: No keywords

Abstract

This paper presents Flame, a hardware/software co-designed resilience scheme for protecting GPUs against soft errors. For low-cost yet high-performance resilience, Flame uses acoustic sensors and idempotent processing for error detection and recovery, respectively. That is, Flame seeks to correct any sensor-detected errors by re-executing the idempotent region where they occurred. To achieve this, it is essential for each idempotent region to ensure the absence of errors before moving on to the next region. This is so-called soft error verification that takes sensors’ worst-case detection latency (WCDL) to verify each region finished. Rather than waiting for WCDL at each region end, which incurs too much performance overhead, Flame proposes WCDL-aware warp scheduling that can hide the error verification delay (i.e., WCDL) with GPU’s inherent massive warp-level parallelism. When a warp hits each idempotent region boundary, Flame deschedules the warp and switches to one of the other ready warps—as if the region boundary were a regular long-latency operation triggering the warp switching. By leveraging GPU’s inherent ability for the latency hiding, Flame can completely eliminate the verification delay without significant hardware modification. The experimental results demonstrate that the performance overhead of Flame is near zero, i.e., 0.6% on average for 34 GPU benchmark applications.

DOI: 10.1109/MICRO56248.2022.00030

Vulkan-Sim： A GPU Architecture Simulator for Ray Tracing

作者: Saed, Mohammadreza and Chou, Yuan Hsi and Liu, Lufei and Nowicki, Tyler and Aamodt, Tor M.
关键词: modeling and simulation, computer graphics, ray tracing, GPU

Abstract

Ray tracing can generate photorealistic images with more convincing visual effects compared to rasterization. Recent hardware advances have enabled ray tracing to be applied in real-time. Current GPUs feature a dedicated ray tracing acceleration unit, and game developers have started to make use of ray tracing APIs to bring more realistic graphics to their players. Industry cooperatively contributed to Vulkan, which recently introduced an open-standard API for ray tracing. However, little has been disclosed about the mapping of this API to hardware. In this paper, we introduce Vulkan-Sim, a detailed cycle-level simulator for enabling architecture research for ray tracing. We extend GPGPU-Sim, integrating it with Mesa, an open-source graphics library to support the Vulkan API, and add dedicated ray traversal and intersection units. We also demonstrate an explicit mapping of the Vulkan ray tracing pipeline to a modern GPU using a technique we call delayed intersection and any-hit execution. Additionally we evaluate several ray tracing workloads with Vulkan-Sim, identifying bottlenecks and inefficiencies of the ray tracing hardware we model. To demonstrate the utility of Vulkan-Sim we conduct two case studies evaluating techniques recently proposed or deployed by industry targeting enhanced ray tracing performance.

DOI: 10.1109/MICRO56248.2022.00027

Pushing Point Cloud Compression to the Edge

作者: Ying, Ziyu and Zhao, Shulin and Bhuyan, Sandeepa and Mishra, Cyan Subhra and Kandemir, Mahmut T. and Das, Chita R.
关键词: energy-efficiency, video processing, edge computing, point cloud compression

Abstract

As Point Clouds (PCs) gain popularity in processing millions of data points for 3D rendering in many applications, efficient data compression becomes a critical issue. This is because compression is the primary bottleneck in minimizing the latency and energy consumption of existing PC pipelines. Data compression becomes even more critical as PC processing is pushed to edge devices with limited compute and power budgets. In this paper, we propose and evaluate two complementary schemes, intra-frame compression and inter-frame compression, to speed up the PC compression, without losing much quality or compression efficiency. Unlike existing techniques that use sequential algorithms, our first design, intra-frame compression, exploits parallelism for boosting the performance of both geometry and attribute compression. The proposed parallelism brings around 43.7\texttimes{

DOI: 10.1109/MICRO56248.2022.00031

Automatic Domain-Specific SoC Design for Autonomous Unmanned Aerial Vehicles

作者: Krishnan, Srivatsan and Wan, Zishen and Bhardwaj, Kshitij and Whatmough, Paul and Faust, Aleksandra and Neuman, Sabrina and Wei, Gu-Yeon and Brooks, David and Reddi, Vijay Janapa
关键词: mobile systems, IoT and edge computing, autonomous machines, ML for systems, domain-specific architectures, robotics

Abstract

Building domain-specific accelerators is becoming increasingly paramount to meet the high-performance requirements under stringent power and real-time constraints. However, emerging application domains like autonomous vehicles are complex systems with constraints extending beyond the computing stack. Manually selecting and navigating the design space to design custom and efficient domain-specific SoCs (DSSoC) is tedious and expensive. Hence, there is a need for automated DSSoC design methodologies. In this paper, we use agile and autonomous UAVs as a case study to understand how to automate domain-specific SoCs design for autonomous vehicles. Architecting a UAV DSSoC requires consideration of parameters such as sensor rate, compute throughput, and other physical characteristics (e.g., payload weight, thrust-to-weight ratio) that affect overall performance. Iterating over several component choices results in a combinatorial explosion of the number of possible combinations: from tens of thousands to billions, depending on implementation details. To navigate the DSSoC design space efficiently, we introduce AutoPilot, a systematic methodology for automatically designing DSSoC for autonomous UAVs. AutoPilot uses machine learning to navigate the large DSSoC design space and automatically select a combination of autonomy algorithm and hardware accelerator while considering the cross-product effect across different UAV components. AutoPilot consistently outperforms general-purpose hardware selections like Xavier NX and Jetson TX2, as well as dedicated hardware accelerators built for autonomous UAVs. DSSoC designs generated by AutoPilot increase the number of missions on average by up to 2.25\texttimes{

DOI: 10.1109/MICRO56248.2022.00033

An Architectural Charge Management Interface for Energy-Harvesting Systems

作者: Ruppel, Emily and Surbatovich, Milijana and Desai, Harsh and Maeng, Kiwan and Lucia, Brandon
关键词: equivalent series resistance, energy-harvesting power system, intermittent computing

Abstract

Energy-harvesting devices eliminate batteries, instead collecting their operating energy from environmental sources. A device stores energy into a capacitor, drawing energy to perform tasks and powering off to recharge when the energy is exhausted. State-of-the-art charge management systems for these devices aim to avoid power failure during task execution by reasoning about task energy cost. We identify that the innate equivalent series resistance (ESR) in energy storage capacitors breaks energy-based systems’ guarantees; running high current load on a high-ESR capacitor causes a substantial voltage drop that rebounds once the load is removed. This voltage drop is disregarded by systems that only reason about energy. If the drop lowers the voltage below the system’s operating threshold, however, the device powers off while stored energy remains. Though ESR is well understood in hardware design, this is the first work to argue that software for batteryless devices must also be aware of ESR.This work presents Culpeo, a hardware/software mechanism and architectural interface to relay the effect of ESR in the power system to software. We develop static and dynamic implementations of Culpeo and demonstrate on real batteryless devices that considering ESR restores correctness guarantees broken by energy-only charge management. We then demonstrate how to integrate Culpeo’s safe voltage into state-of-the-art schedulers, restoring task deadline guarantees for applications with predictable energy harvesting. Finally, we propose an on-chip Culpeo hardware implementation that allows for runtime monitoring of the effects of ESR to respond to changes in harvestable power.

DOI: 10.1109/MICRO56248.2022.00034

ROG： A High Performance and Robust Distributed Training System for Robotic IoT

作者: Guan, Xiuxian and Sun, Zekai and Deng, Shengliang and Chen, Xusheng and Zhao, Shixiong and Zhang, Zongyuan and Duan, Tianyang and Wang, Yuexuan and Wu, Chenshu and Cui, Yong and Zhang, Libo and Wu, Yanjun and Wang, Rui and Cui, Heming
关键词: energy efficient, robust, training throughput, wireless networks, distributed training

Abstract

Critical robotic tasks such as rescue and disaster response are more prevalently leveraging ML (Machine Learning) models deployed on a team of wireless robots, on which data parallel (DP) training over Internet of Things of these robots (robotic IoT) can harness the distributed hardware resources to adapt their models to changing environments as soon as possible. Unfortunately, due to the need for DP synchronization across all robots, the instability in wireless networks (i.e., fluctuating bandwidth due to occlusion and varying communication distance) often leads to severe stall of robots, which affects the training accuracy within a tight time budget and wastes energy stalling. Existing methods to cope with the instability of datacenter networks are incapable of handling such straggler effect. That is because they are conducting model-granulated transmission scheduling, which is much more coarse-grained than the granularity of transient network instability in real-world robotic IoT networks, making a previously reached schedule mismatch with the varying bandwidth during transmission.We present Rog, the first ROw-Granulated distributed training system optimized for ML training over unstable wireless networks. Rog confines the granularity of transmission and synchronization to each row of a layer’s parameters and schedules the transmission of each row adaptively to the fluctuating bandwidth. In this way the ML training process can update partial and the most important gradients of a stale robot to avoid triggering stalls, while provably guaranteeing convergence. The evaluation shows that, given the same training time, Rog achieved about 4.9%~6.5% training accuracy gain compared with the baselines and saved 20.4%~50.7% of the energy to achieve the same training accuracy.

DOI: 10.1109/MICRO56248.2022.00032

ASSASIN： Architecture Support for Stream Computing to Accelerate Computational Storage

作者: Zou, Chen and Chien, Andrew A.
关键词: general-purpose, memory wall, memory hierarhcy, stream computing, ssd, computational storage

Abstract

Computational storage adds computing to storage devices, providing potential benefits in offload, data-reduction, and lower energy. Successful computational SSD architectures should match growing flash bandwidth, which in turn requires high SSD DRAM memory bandwidth. This creates a memory wall scaling problem, resulting from SSDs’ stringent power and cost constraints.A survey of recent computational SSD research shows that many computational storage offloads are suited to stream computing. To exploit this opportunity, we propose a novel general-purpose computational SSD and core architecture, called ASSASIN (Architecture Support for Stream computing to Accelerate computatIoNal Storage). ASSASIN provides a unified set of compute engines between SSD DRAM and the flash array. This eliminates the SSD DRAM bottleneck by enabling direct computing on flash data streams. ASSASIN further employs a crossbar to achieve performance even when flash data layout is uneven and preserve independence for page layout decisions in the flash translation layer. With stream buffers and scratchpad memories, ASSASIN core’s memory hierarchy and instruction set extensions provide superior low-latency access at low-power and effectively keep streaming flash data out of the in-SSD cache-DRAM memory hierarchy, thereby solving the memory wall.Evaluation shows that ASSASIN delivers 1.5x - 2.4x speedup for offloaded functions compared to state-of-the-art computational SSD architectures. Further, ASSASIN’s streaming approach yields 2.0x power efficiency and 3.2x area efficiency improvement. And these performance benefits at the level of computational SSDs translate to 1.1x - 1.5x end-to-end speedups on data analytics workloads.

DOI: 10.1109/MICRO56248.2022.00035

DaxVM： Stressing the Limits of Memory as a File Interface

作者: Alverti, Chloe and Karakostas, Vasileios and Kunati, Nikhita and Goumas, Georgios and Swift, Michael
关键词: file systems, persistent memory, virtual memory

Abstract

Persistent memory (PMem) is a low-latency storage technology connected to the processor memory bus. The Direct Access (DAX) interface promises fast access to PMem, mapping it directly to processes’ virtual address spaces. However, virtual memory operations (e.g., paging) limit its performance and scalability. Through an analysis of Linux/x86 memory mapping, we find that current systems fall short of what hardware can provide due to numerous software inefficiencies stemming from OS assumptions that memory mapping is for DRAM.In this paper we propose DaxVM, a design that extends the OS virtual memory and file system layers leveraging persistent memory attributes to provide a fast and scalable DAX-mmap interface. DaxVM eliminates paging costs through pre-populated file page tables, supports faster and scalable virtual address space management for ephemeral mappings, performs unmappings asynchronously, bypasses kernel-space dirty-page tracking support, and adopts asynchronous block pre-zeroing. We implement DaxVM in Linux and the ext4 file system targeting x86-64 architecture. DaxVM mmap achieves 4.9\texttimes{

DOI: 10.1109/MICRO56248.2022.00037

Networked SSD： Flash Memory Interconnection Network for High-Bandwidth SSD

作者: Kim, Jiho and Kang, Seokwon and Park, Yongjun and Kim, John
关键词: garbage collection, interconnection networks, solid state drive

Abstract

As the flash memory performance increases with more bandwidth, the flash memory channel or the interconnect is becoming a bigger bottleneck to enable high performance SSD system. However, the bandwidth of the flash memory interconnect is not increasing at the same rate as the flash memory. In addition, current flash memory bus is based on dedicated signaling where separate control signals are used for communication between the flash channel controller and the flash memory chip. In this work, we propose to exploit packetized communication to improve the effective flash memory interconnect bandwidth and propose packetized SSD (pSSD) system architecture. We first show how packetized communication can be exploited and the microarchitectural changes required. We then propose the Omnibus topology for flash memory interconnect to enable a packetized network SSD (pnSSD) among the flash memory - a 2D bus-based organization that maintains a “bus” organization for the interconnect while enabling direct communication between the flash memory chips. The pnSSD architecture enables a new type of garbage collection that we refer to as spatial garbage collection that significantly reduces the interference between I/O requests and garbage collection. Our detailed evaluation of pnSSD shows 82% improvement in I/O latency with no garbage collection (GC) while improving I/O latency by 9.71\texttimes{

DOI: 10.1109/MICRO56248.2022.00038

Designing Virtual Memory System of MCM GPUs

作者: B, Pratheek and Jawalkar, Neha and Basu, Arkaprava
关键词: chiplet, multi-chip module, graphics processing units, translation look-aside buffers, page table walkers, address translation, virtual memory

Abstract

Multi-Chip Module (MCM) designs have emerged as a key technique to scale up a GPU’s compute capabilities in the face of slowing transistor technology. However, the disaggregated nature of MCM GPUs with many chiplets connected via in-package interconnects leads to non-uniformity.We explore the implications of MCM’s non-uniformity on the GPU’s virtual memory. We quantitatively demonstrate that an MCM-aware virtual memory system should aim to 1 leverage aggregate TLB capacity across chiplets while limiting accesses to L2 TLB on remote chiplets, 2 reduce accesses to page table entries resident on a remote chiplet’s memory during page walks. We propose MCM-aware GPU virtual memory (MGvm) that leverages static analysis techniques, previously used for thread and data placement, to map virtual addresses to chiplets and to place the page tables. At runtime, MGvm balances its objective of limiting the number of remote L2 TLB lookups with that of reducing the number of remote page table accesses to achieve good speedups (52%, on average) across diverse application behaviors.

DOI: 10.1109/MICRO56248.2022.00036

Altocumulus： Scalable Scheduling for Nanosecond-Scale Remote Procedure Calls

作者: Zhao, Jiechen and Uwizeyimana, Iris and Ganesan, Karthik and Jeffrey, Mark C. and Jerger, Natalie Enright
关键词: queuing theory, migration, load balancing, networked systems, datacenters, scheduling, remote procedure calls

Abstract

Online services in modern datacenters use Remote Procedure Calls (RPCs) to communicate between different software layers. Despite RPCs using just a few small functions, inefficient RPC handling can cause delays to propagate across the system and degrade end-to-end performance. Prior work has reduced RPC processing time to less than 1 μs, which now shifts the bottleneck to the scheduling of RPCs. Existing RPC schedulers suffer from either high overheads, inability to effectively utilize high core-count CPUs or do not adaptively fit different traffic patterns. To address these shortcomings, we present Altocumulus,1 a scalable, software-hardware co-design to schedule RPCs at nanosecond scales. Altocumulus provides a proactive scheduling scheme and low-overhead messaging mechanism on top of a decentralized user runtime. Altocumulus also offers direct access from the user space to a set of simple hardware primitives to quickly migrate long-latency RPCs. We evaluate Altocumulus with synthetic workloads and an end-to-end in-memory key-value store application under real-world traffic patterns. Altocumulus improves throughput by 1.3–24.6\texttimes{

DOI: 10.1109/MICRO56248.2022.00039

SIMR： Single Instruction Multiple Request Processing for Energy-Efficient Data Center Microservices

作者: Khairy, Mahmoud and Alawneh, Ahmad and Barnes, Aaron and Rogers, Timothy G.
关键词: GPU, microservices, data center, SIMT

Abstract

Contemporary data center servers process thousands of similar, independent requests per minute. In the interest of programmer productivity and ease of scaling, workloads in data centers have shifted from single monolithic processes toward a micro and nanoservice software architecture. As a result, single servers are now packed with many threads executing the same, relatively small task on different data.State-of-the-art data centers run these microservices on multi-core CPUs. However, the flexibility offered by traditional CPUs comes at an energy-efficiency cost. The Multiple Instruction Multiple Data execution model misses opportunities to aggregate the similarity in contemporary microservices. We observe that the Single Instruction Multiple Thread execution model, employed by GPUs, provides better thread scaling and has the potential to reduce frontend and memory system energy consumption. However, contemporary GPUs are ill-suited for the latency-sensitive microservice space.To exploit the similarity in contemporary microservices, while maintaining acceptable latency, we propose the Request Processing Unit (RPU). The RPU combines elements of out-of-order CPUs with lockstep thread aggregation mechanisms found in GPUs to execute microservices in a Single Instruction Multiple Request (SIMR) fashion. To complement the RPU, we also propose a SIMR-aware software stack that uses novel mechanisms to batch requests based on their predicted control-flow, split batches based on predicted latency divergence and map per-request memory allocations to maximize coalescing opportunities. Our resulting RPU system processes 5.7\texttimes{

DOI: 10.1109/MICRO56248.2022.00040

Patching up Network Data Leaks with Sweeper

作者: Vemmou, Marina and Cho, Albert and Daglis, Alexandros
关键词: No keywords

Abstract

Datacenters have witnessed a staggering evolution in networking technologies, driven by insatiable application demands for larger datasets and inter-server data transfers. Modern NICs can already handle 100s of Gbps of traffic, a bandwidth capability equivalent to several memory channels. Direct Cache Access mechanisms like DDIO that contain network traffic inside the CPU’s caches are therefore essential to effectively handle growing network traffic rates. However, a growing body of work reveals instances of a critical DDIO weakness known as “leaky DMA”, occurring when a significant fraction of network traffic leaks from the CPU’s caches to memory. We find that such network data leaks cap the network bandwidth a server can effectively utilize.We identify that a major culprit for such network data leaks are evictions of already consumed dirty network buffers. Our key insight is that buffers already consumed by the application typically need not be written back to memory, as their next reuse will be a full overwrite with new network data by the NIC. We introduce Sweeper, a hardware extension and API that allows applications to mark such consumed network buffers. Hardware then skips writing marked buffers back to memory, drastically reducing memory bandwidth consumption and mitigating the performance penalty of network data leaks. Sweeper boosts a 24-core server’s peak sustainable network bandwidth by up to 2.6\texttimes{

DOI: 10.1109/MICRO56248.2022.00041

IDIO： Network-Driven, Inbound Network Data Orchestration on Server Processors

作者: Alian, Mohammad and Agarwal, Siddharth and Shin, Jongmin and Patel, Neel and Yuan, Yifan and Kim, Daehoon and Wang, Ren and Kim, Nam Sung
关键词: datacenter network, DDIO, non-inclusive cache

Abstract

High-bandwidth network interface cards (NICs), each capable of transferring 100s of Gigabits per second, are making inroads into the servers of next-generation datacenters. Such unprecedented data delivery rates impose immense pressure, especially on the server’s memory subsystem, as NICs first transfer network data to DRAM before processing. To alleviate the pressure, the cache hierarchy has evolved, supporting a direct data I/O (DDIO) technology to directly place network data in the last-level cache (LLC). Subsequently, various policies have been explored to manage such LLC and have proven to effectively reduce service latency and memory bandwidth consumption of network applications. However, the more recent evolution of the cache hierarchy decreased the size of LLC per core but significantly increased that of mid-level cache (MLC) with a non-inclusive policy. This calls for a re-examination of the aforementioned DDIO technology and management policies.In this paper, first, we identify three shortcomings of the current static data placement policy placing network data to LLC first and the non-inclusive policy with a commercial server system: (1) ineffectively using large MLC, (2) suffering from high rates of writebacks from MLC to LLC, and (3) breaking the isolation between application and network data enforced by limiting cache ways for DDIO. Second, to tackle the three shortcomings, we propose an intelligent direct I/O (IDIO) technology that extends DDIO to MLC and provides three synergistic mechanisms: (1) self-invalidating I/O buffer, (2) network-driven MLC prefetching, and (3) selective direct DRAM access. Our detailed experiments using a full-system simulator — capable of running modern DPDK userspace network functions while sustaining 100Gbps+ network bandwidth — show that IDIO significantly reduces data movement (up to 84% MLC and LLC writeback reduction), provides LLC isolation (up to 22% performance improvement), and improves tail latency (up to 38% reduction in 99th latency) for receive-intensive network applications.

DOI: 10.1109/MICRO56248.2022.00042

Treebeard： An Optimizing Compiler for Decision Tree Based ML Inference

作者: Prasad, Ashwin and Rajendra, Sampath and Rajan, Kaushik and Govindarajan, R and Bondhugula, Uday
关键词: machine learning, vectorization, decision tree inference, decision tree ensemble, optimizing compiler

Abstract

Decision tree ensembles are among the most commonly used machine learning models. These models are used in a wide range of applications and are deployed at scale. Decision tree ensemble inference is usually performed with libraries such as XGBoost, LightGBM, and Sklearn. These libraries incorporate a fixed set of optimizations for the hardware targets they support. However, maintaining these optimizations is prohibitively expensive with the evolution of hardware. Further, they do not specialize the inference code to the model being used, leaving significant performance on the table.This paper presents Treebeard, an optimizing compiler that progressively lowers the inference computation to optimized CPU code through multiple intermediate abstractions. By applying model-specific optimizations at the higher levels, tree walk optimizations at the middle level, and machine-specific optimizations lower down, Treebeard can specialize inference code for each model on each supported CPU target. Treebeard combines several novel optimizations at various abstraction levels to mitigate architectural bottlenecks and enable SIMD vectorization of tree walks.We implement Treebeard using the MLIR compiler infrastructure and demonstrate its utility by evaluating it on a diverse set of benchmarks. Treebeard is significantly faster than state-of-the-art systems, XGBoost, Treelite and Hummingbird, by 2.6\texttimes{

DOI: 10.1109/MICRO56248.2022.00043

GCD2： A Globally Optimizing Compiler for Mapping DNNs to Mobile DSPs

作者: Niu, Wei and Guan, Jiexiong and Shen, Xipeng and Wang, Yanzhi and Agrawal, Gagan and Ren, Bin
关键词: mobile devices, deep neural network, compiler optimization, VLIW instruction packing

Abstract

More specialized chips are exploiting available high transistor density to expose parallelism at a large scale with more intricate instruction sets. This paper reports on a compilation system GCD2, developed to support complex Deep Neural Network (DNN) workloads on mobile DSP chips. We observe several challenges in fully exploiting this architecture, related to SIMD width, more complex SIMD/vector instructions, and VLIW pipeline with the notion of soft dependencies. GCD2 comprises the following contributions: 1) development of matrix layout formats that support the use of different novel SIMD instructions, 2) formulation and solution of a global optimization problem related to choosing the best instruction (and associated layout) for implementation of each operator in a complete DNN, and 3) SDA, an algorithm for packing instructions with consideration for soft dependencies. These solutions are incorporated in a complete compilation system that is extensively evaluated against other systems using 10 large DNN models. Evaluation results show that GCD2 outperforms two product-level state-of-the-art end-to-end DNN execution frameworks (TFLite and Qualcomm SNPE) that support mobile DSPs by up to 6.0\texttimes{

DOI: 10.1109/MICRO56248.2022.00044

OCOLOS： Online COde Layout OptimizationS

作者: Zhang, Yuxuan and Khan, Tanvir Ahmed and Pokam, Gilles and Kasikci, Baris and Litz, Heiner and Devietti, Joseph
关键词: No keywords

Abstract

The processor front-end has become an increasingly important bottleneck in recent years due to growing application code footprints, particularly in data centers. First-level instruction caches and branch prediction engines have not been able to keep up with this code growth, leading to more front-end stalls and lower Instructions Per Cycle (IPC). Profile-guided optimizations performed by compilers represent a promising approach, as they rearrange code to maximize instruction cache locality and branch prediction efficiency along a relatively small number of hot code paths. However, these optimizations require continuous profiling and rebuilding of applications to ensure that the code layout matches the collected profiles. If an application’s code is frequently updated, it becomes challenging to map profiling data from a previous version onto the latest version, leading to ignored profiling data and missed optimization opportunities.In this paper, we propose Ocolos, the first online code layout optimization system for unmodified applications written in unmanaged languages. Ocolos allows profile-guided optimization to be performed on a running process, instead of being performed offline and requiring the application to be relaunched. By running online, profile data is always relevant to the current execution and always maps perfectly to the running code. Ocolos demonstrates how to achieve robust online code replacement in complex multithreaded applications like MySQL and MongoDB, without requiring any application changes. Our experiments show that Ocolos can accelerate MySQL by up to 1.41\texttimes{

DOI: 10.1109/MICRO56248.2022.00045

RipTide： A Programmable, Energy-Minimal Dataflow Compiler and Architecture

作者: Gobieski, Graham and Ghosh, Souradip and Heule, Marijn and Mowry, Todd and Nowatzki, Tony and Beckmann, Nathan and Lucia, Brandon
关键词: compiler, dataflow, CGRA, reconfigurable, general-purpose, programmable, ultra-low-power, energy-minimal

Abstract

Emerging sensing applications create an unprecedented need for energy efficiency in programmable processors. To achieve useful multi-year deployments on a small battery or energy harvester, these applications must avoid off-device communication and instead process most data locally. Recent work has proven coarse-grained reconfigurable arrays (CGRAs) as a promising architecture for this domain. Unfortunately, nearly all prior CGRAs support only computations with simple control flow and no memory aliasing (e.g., affine inner loops), causing an Amdahl efficiency bottleneck as non-trivial fractions of programs must run on an inefficient von Neumann core.RipTide is a co-designed compiler and CGRA architecture that achieves both high programmability and extreme energy efficiency, eliminating this bottleneck. RipTide provides a rich set of control-flow operators that support arbitrary control flow and memory access on the CGRA fabric. RipTide implements these primitives without tagged tokens to save energy; this requires careful ordering analysis in the compiler to guarantee correctness. RipTide further saves energy and area by offloading most control operations into its programmable on-chip network, where they can re-use existing network switches. RipTide’s compiler is implemented in LLVM, and its hardware is synthesized in Intel 22FFL. RipTide compiles applications written in C while saving 25% energy v. the state-of-the-art energy-minimal CGRA and 6.6\texttimes{

DOI: 10.1109/MICRO56248.2022.00046

Skipper： Enabling Efficient SNN Training through Activation-Checkpointing and Time-Skipping

作者: Singh, Sonali and Sarma, Anup and Lu, Sen and Sengupta, Abhronil and Kandemir, Mahmut T. and Neftci, Emre and Narayanan, Vijaykrishnan and Das, Chita R.
关键词: compute and memory, BPTT, SNN

Abstract

Spiking neural networks (SNNs) are a highly efficient signal processing mechanism in biological systems that have inspired a plethora of research efforts aimed at translating their energy efficiency to computational platforms. Efficient training approaches are critical for the successful deployment of SNNs. Compared to mainstream deep neural networks (ANNs), training SNNs is far more challenging due to complex neural dynamics that evolve with time and their discrete, binary computing paradigm. Back-propagation-through-time (BPTT) with surrogate gradients has recently emerged as an effective technique to train deep SNNs directly. SNN-BPTT, however, has a major drawback in that it has a high memory requirement that increases with the number of timesteps. SNNs generally result from the discretization of Ordinary Differential Equations, due to which the sequence length must be typically longer than RNNs, compounding the time dependence problem. It, therefore, becomes hard to train deep SNNs on a single or multi-GPU setup with sufficiently large batch sizes or time-steps, and extended periods of training are required to achieve reasonable network performance.In this work, we reduce the memory requirements of BPTT in SNNs to enable the training of deeper SNNs with more timesteps (T). For this, we leverage the notion of activation re-computation in the context of SNN training that enables the GPU memory to scale sub-linearly with increasing time-steps. We observe that naively deploying the re-computation based approach leads to a considerable computational overhead. To solve this, we propose a time-skipped BPTT approximation technique, called Skipper, for SNNs, that not only alleviates this computation overhead, but also lowers memory consumption further with little to no loss of accuracy. We show the efficacy of our proposed technique by comparing it against a popular method for memory footprint reduction during training. Our evaluations on 5 state-of-the-art networks and 4 datasets show that for a constant batch size and time-steps, skipper reduces memory usage by 3.3\texttimes{

DOI: 10.1109/MICRO56248.2022.00047

Going Further with Winograd Convolutions： Tap-Wise Quantization for Efficient Inference on 4x4 Tiles

作者: Andri, Renzo and Bussolino, Beatrice and Cipolletta, Antonio and Cavigelli, Lukas and Wang, Zhe
关键词: ML system design, winograd convolution, machine learning acceleration

Abstract

Most of today’s computer vision pipelines are built around deep neural networks, where convolution operations require most of the generally high compute effort. The Winograd convolution algorithm computes convolutions with fewer multiply-accumulate operations (MACs) compared to the standard algorithm, reducing the operation count by a factor of 2.25\texttimes{

DOI: 10.1109/MICRO56248.2022.00048

Adaptable Butterfly Accelerator for Attention-Based NNs via Hardware and Algorithm Co-Design

作者: Fan, Hongxiang and Chau, Thomas and Venieris, Stylianos I. and Lee, Royson and Kouris, Alexandros and Luk, Wayne and Lane, Nicholas D. and Abdelfattah, Mohamed S.
关键词: algorithm and hardware co-design, butterfly sparsity, attention-based neural networks, adaptable butterfly accelerator

Abstract

Attention-based neural networks have become pervasive in many AI tasks. Despite their excellent algorithmic performance, the use of the attention mechanism and feedforward network (FFN) demands excessive computational and memory resources, which often compromises their hardware performance. Although various sparse variants have been introduced, most approaches only focus on mitigating the quadratic scaling of attention on the algorithm level, without explicitly considering the efficiency of mapping their methods on real hardware designs. Furthermore, most efforts only focus on either the attention mechanism or the FFNs but without jointly optimizing both parts, causing most of the current designs to lack scalability when dealing with different input lengths. This paper systematically considers the sparsity patterns in different variants from a hardware perspective. On the algorithmic level, we propose FABNet, a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs. On the hardware level, a novel adaptable butterfly accelerator is proposed that can be configured at runtime via dedicated hardware control to accelerate different butterfly layers using a single unified hardware engine. On the Long-Range-Arena dataset, FABNet achieves the same accuracy as the vanilla Transformer while reducing the amount of computation by 10 ~ 66\texttimes{

DOI: 10.1109/MICRO56248.2022.00050

DFX： A Low-Latency Multi-FPGA Appliance for Accelerating Transformer-Based Text Generation

作者: Hong, Seongmin and Moon, Seungjae and Kim, Junsoo and Lee, Sungjae and Kim, Minsub and Lee, Dongsoo and Kim, Joo-Young
关键词: model parallelism, multi-FPGA acceleration, datacenter, text generation, GPT, natural language processing

Abstract

Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation.In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58\texttimes{

DOI: 10.1109/MICRO56248.2022.00051

HARMONY： Heterogeneity-Aware Hierarchical Management for Federated Learning System

作者: Tian, Chunlin and Li, Li and Shi, Zhan and Wang, Jun and Xu, ChengZhong
关键词: mobile device, heterogeneous systems, federated learning

Abstract

Federated learning (FL) enables multiple devices to collaboratively train a shared model while preserving data privacy. However, despite its emerging applications in many areas, real-world deployment of on-device FL is challenging due to wildly diverse training capability and data distribution across heterogeneous edge devices, which highly impact both model performance and training efficiency.This paper proposes Harmony, a high-performance FL framework with heterogeneity-aware hierarchical management of training devices and training data. Unlike previous work that mainly focuses on heterogeneity in either training capability or data distribution, Harmony adopts a hierarchical structure to jointly handle both heterogeneities in a unified manner. Specifically, the two core components of Harmony are a global coordinator hosted by the central server and a local coordinator deployed on each participating device. Without accessing the raw data, the global coordinator first selects the participants, and then further reorganizes their training samples based on the accurate estimation of the runtime training capability and data distribution of each device. The local coordinator keeps monitoring the local training status and conducts efficient training with guidance from the global coordinator. We conduct extensive experiments to evaluate Harmony using both hardware and simulation testbeds on representative datasets. The experimental results show that Harmony improves the accuracy performance by 1.67% – 27.62%. In addition, Harmony effectively accelerates the training process up to 3.29\texttimes{

DOI: 10.1109/MICRO56248.2022.00049

Leaky Way： A Conflict-Based Cache Covert Channel Bypassing Set Associativity

作者: Guo, Yanan and Xin, Xin and Zhang, Youtao and Yang, Jun
关键词: cache security, replacement policy, side channels

Abstract

Modern x86 processors feature many prefetch instructions that developers can use to enhance performance. However, with some prefetch instructions, users can more directly manipulate cache states which may result in powerful cache covert channel and side channel attacks.In this work, we reverse-engineer the detailed cache behavior of PREFETCHNTA on various Intel processors. Based on the results, we first propose a new conflict-based cache covert channel named NTP+NTP. Prior conflict-based channels often require priming the cache set in order to cause cache conflicts. In contrast, in NTP+NTP, the data of the sender and receiver can compete for one specific way in the cache set, achieving cache conflicts without cache set priming for the first time. As a result, NTP+NTP has higher bandwidth than prior conflict-based channels such as Prime+Probe. The channel capacity of NTP+NTP is 302 KB/s. Second, we found that PREFETCHNTA can also be used to boost the performance of existing side channel attacks that utilize cache replacement states, making those attacks much more efficient than before.

DOI: 10.1109/MICRO56248.2022.00053

SwiftDir： Secure Cache Coherence without Overprotection

作者: Miao, Chenlu and Bu, Kai and Li, Mengming and Mao, Shaowu and Jia, Jianwei
关键词: address translation, shared data, timing-channel attack, cache coherence

Abstract

Cache coherence states have recently been exploited to leak secrets through timing-channel attacks. The root cause lies in the fact that shared data in state Exclusive (E) and state Shared (S) are served from different cache layers. The state-of-the-art countermeasure—S-MESI—serves both E- and S-state shared data from the last-level cache (LLC) by explicitly synchronizing the Modified (M) state across private caches and the LLC. This has to sacrifice the silent upgrade feature that MESI introduces for speedup. Moreover, it enforces protection to not only exploitable shared data but also unshared data. This further slows down performance, especially for write-after-read intensive applications.In this paper, we propose SwiftDir to efficiently secure cache coherence against cover-channel attacks without overprotection. SwiftDir fundamentally narrows down the protection scope to write-protected data. Such exploitable shared data can be uniquely identified with the write-protection permission in the memory management unit (MMU) and do not necessarily transit to state M. We validate this idea through tracing system calls of shared libraries on Linux. We then investigate all three commercial cache architectures (i.e., PIPT, VIPT, and VIVT) and find it feasible to hitchhike the address translation process to transmit the write-protection information from the MMU to the coherence controller. Then SwiftDir enforces protection over only write-protected data by serving all requests toward them directly from the LLC with a constant latency. This not only simplifies how MESI handles write-protected data but also avoids how S-MESI overprotects them. Meanwhile, SwiftDir still preserves silent upgrade for efficient handling of unshared data. Extensive experiments demonstrate that our SwiftDir can secure cache coherence while outperforming not only secure S-MESI but also unprotected MESI.

DOI: 10.1109/MICRO56248.2022.00052

Self-Reinforcing Memoization for Cryptography Calculations in Secure Memory Systems

作者: Wang, Xin and Talapkaliyev, Daulet and Hicks, Matthew and Jian, Xun
关键词: memoization, memory subsystem, counter-mode AES, memory confidentiality and integrity

Abstract

Modern memory systems use encryption and message authentication codes to ensure confidentiality and integrity. Encryption and integrity verification rely on cryptography calculations, which are slow. To hide the latency of cryptography calculations, prior works exploit the fact that many cryptography steps only require a memory block’s write counter (i.e., a value that increases whenever the block is written to memory), but not the block itself. As such, memory controller (MC) caches counters so that MC can start calculating before missing blocks arrive from memory.Irregular workloads suffer from high counter miss rates, however, just like they suffer from high miss rates of page table entries. Many prior works have looked at the problem of page table entry misses for irregular workloads, but not the problem of counter misses for the irregular workloads.This paper addresses the memory latency overheads that irregular workloads suffer due to their high counter miss rate.We observe many (e.g., unlimited number of) counters can have the same value. As such, we propose memoizing cryptography calculations for hot counter values. When a counter arrives from memory, MC can use the counter value to look up a memoization table to quickly obtain the counter’s memoized results instead of slowly recalculating them.To maximize memoization table hit rate, we observe whenever writing a block to memory, increasing its counter to any value higher than the current counter value can satisfy the security requirement of always using different counter values to encrypt the same block. As such, we also propose a memoization-aware counter update: when writing a block to memory, increase its counter to a value whose cryptography calculation is currently memoized.We refer to memoizing the calculation results of counters and the corresponding memoization-aware counter update collectively as Self-Reinforcing Memoization for Cryptography Calculations (RMCC).Our evaluations show that RMCC improves average performance by 6% compared to the state-of-the-art. On average across the lifetimes of different workloads, RMCC accelerates decryption and verification for 92% of counter misses.

DOI: 10.1109/MICRO56248.2022.00055

Eager Memory Cryptography in Caches

作者: Wang, Xin and Kotra, Jagadish B. and Jian, Xun
关键词: network-on-chip, cache hierarchy, counter-mode AES, memory encryption and verification

Abstract

To protect memory values from adversaries with physical access to data centers, secure memory systems ensure memory confidentiality and integrity via memory encryption and verification. The corresponding cryptography calculations require a memory block’s write counter as input. As such, CPUs today cache counters in the memory controller (MC).Due to the large memory footprint and irregular access patterns of many real-world applications, MC’s counter cache is too small to achieve high hit rate. A promising solution is also caching counters in the much bigger Last Level cache (LLC). As such, many prior works use LLC as a second level cache for counters to back up the smaller counter cache in MC.Caching counters in LLC introduces a new problem, however. Modern server CPUs have a long LLC access latency that not only can diminish the benefit of caching counters in LLC, but also can sometimes significantly increase counter access latency compared to not caching counters in LLC.We note the problem lies with MC sitting behind LLC; due to its physical location, MC can only see LLC misses and, therefore, can only serially access and use counters after data miss in LLC has completed. However, prior designs without caching counters in LLC can access and use counters in parallel with accessing data. If a block’s counter misses in MC’s counter cache, MC can fetch the counter from DRAM in parallel with data; if the counter hits in MC’s counter cache, MC can use counters for cryptography calculation in parallel with data traveling from DRAM to MC.To parallelize the access and use of counters with data access while caching counters in LLC, we observe that in modern CPUs, L2 is typically the first place that caches data from DRAM (i.e., L2 and L3 are non-inclusive); as such, data from DRAM need not be decrypted and verified until they reach L2. So it is possible to offload some decryption and verification tasks from MC to L2. Since L2 sits before L3, L2 can access counter and data in parallel from L3; L2 can also use counters for cryptography calculation in parallel with data traveling from DRAM to L2, instead of just from DRAM to MC. As such, we propose caching and using counters directly in L2 and refer to this idea as Eager Memory Cryptography in Caches (EMCC). Our evaluation shows that when applied to the state-of-the-art baseline, EMCC improves performance of large and/or irregular workloads by 7%, on average.

DOI: 10.1109/MICRO56248.2022.00054

GenPIP： In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

作者: Mao, Haiyu and Alser, Mohammed and Sadrosadati, Mohammad and Firtina, Can and Baranwal, Akanksha and Cali, Damla Senol and Manglik, Aditya and Alserr, Nour Almadhoun and Mutlu, Onur
关键词: No keywords

Abstract

Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome into raw electrical signals at low cost. Nanopore sequencing requires two computationally-costly processing steps for accurate downstream genome analysis. The first step, basecalling, translates the raw electrical signals into nucleotide bases (i.e., A, C, G, T). The second step, read mapping, finds the correct location of a read in a reference genome. In existing genome analysis pipelines, basecalling and read mapping are executed separately. We observe in this work that such separate execution of the two most time-consuming steps inherently leads to (1) significant data movement and (2) redundant computations on the data, slowing down the genome analysis pipeline.This paper proposes GenPIP, an in-memory genome analysis accelerator that tightly integrates basecalling and read mapping. GenPIP improves the performance of the genome analysis pipeline with two key mechanisms: (1) in-memory fine-grained collaborative execution of the major genome analysis steps in parallel; (2) a new technique for early-rejection of low-quality and unmapped reads to timely stop the execution of genome analysis for such reads, reducing inefficient computation. Our experiments show that, for the execution of the genome analysis pipeline, GenPIP provides 41.6\texttimes{

DOI: 10.1109/MICRO56248.2022.00056

BEACON： Scalable Near-Data-Processing Accelerators for Genome Analysis near Memory Pool with the CXL Support

作者: Huangfu, Wenqin and Malladi, Krishna T. and Chang, Andrew and Xie, Yuan
关键词: memory dis-aggregation, accelerator, software-hardware co-design, near-data-processing, genome analysis

Abstract

Genome analysis benefits precise medical care, wildlife conservation, pandemic treatment (e.g., COVID-19), and so on. Unfortunately, in genome analysis, the speed of data processing lags far behind the speed of data generation. Thus, hardware acceleration turns out to be necessary.As many applications in genome analysis are memory-bound, Processing-In-Memory (PIM) and Near-Data-Processing (NDP) solutions have been explored to tackle this problem. In particular, the Dual-Inline-Memory-Module (DIMM) based designs are very promising due to their non-invasive feature to the cost-sensitive DRAM dies. However, they have two critical limitations, i.e., performance bottle-necked by communication and the limited potential for memory expansion.In this paper, we address these two limitations by designing novel DIMM based accelerators located near the dis-aggregated memory pool with the support from the Compute Express Link (CXL), aiming to leverage the abundant memory within the memory pool and the high communication bandwidth provided by CXL. We propose BEACON, Scalable Near-Data-Processing Accelerators for Genome Analysis near Memory Pool with the CXL Support. BEACON adopts a software-hardware co-design approach to tackle the above two limitations. The BEACON architecture builds the foundation for efficient communication and memory expansion by reducing data movement and leveraging the high communication bandwidth provided by CXL. Based on the BEACON architecture, we propose a memory management framework to enable memory expansion with unmodified CXL-DIMMs and further optimize communication by improving data locality. We also propose algorithm-specific optimizations to further boost the performance of BEACON. In addition, BEACON provides two design choices, i.e., BEACON-D and BEACON-S. BEACON-D and BEACON-S perform the computation within the enhanced CXL-DIMMs and enhanced CXL-Switches, respectively. Experimental results show that compared with state-of-the-art DIMM based NDP accelerators, on average, BEACON-D and BEACON-S improve the performance by 4.70x and 4.13x, respectively.

DOI: 10.1109/MICRO56248.2022.00057

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

作者: Yazdanbakhsh, Amir and Moradifirouzabadi, Ashkan and Li, Zheng and Kang, Mingu
关键词: hardware-software co-design, deep learning, ReRAM, neural processing units, in-memory computing, model compression, sparsity, self-attention, attention mechanism, transformer

Abstract

As its core computation, a self-attention mechanism gauges pairwise correlations across the entire input sequence. Despite favorable performance, calculating pairwise correlations is prohibitively costly. While recent work has shown the benefits of runtime pruning of elements with low attention scores, the quadratic complexity of self-attention mechanisms and their on-chip memory capacity demands are overlooked. This work addresses these constraints by architecting an accelerator, called Sprint1, which leverages the inherent parallelism of ReRAM crossbar arrays to compute attention scores in an approximate manner. Our design prunes the low attention scores using a lightweight analog thresholding circuitry within ReRAM, enabling Sprint to fetch only a small subset of relevant data to on-chip memory. To mitigate potential negative repercussions for model accuracy, Sprint re-computes the attention scores for the few fetched data in digital. The combined in-memory pruning and on-chip recompute of the relevant attention scores enables Sprint to transform quadratic complexity to a merely linear one. In addition, we identify and leverage a dynamic spatial locality between the adjacent attention operations even after pruning, which eliminates costly yet redundant data fetches. We evaluate our proposed technique on a wide range of state-of-the-art transformer models. On average, Sprint yields 7.5\texttimes{

DOI: 10.1109/MICRO56248.2022.00059

ICE： An Intelligent Cognition Engine with 3D NAND-Based In-Memory Computing for Vector Similarity Search Acceleration

作者: Hu, Han-Wen and Wang, Wei-Chen and Chang, Yuan-Hao and Lee, Yung-Chun and Lin, Bo-Rong and Wang, Huai-Mu and Lin, Yen-Po and Huang, Yu-Ming and Lee, Chong-Ying and Su, Tzu-Hsiang and Hsieh, Chih-Chang and Hu, Chia-Ming and Lai, Yi-Ting and Chen, Chung-Kuang and Chen, Han-Sung and Li, Hsiang-Pang and Kuo, Tei-Wei and Chang, Meng-Fan and Wang, Keh-Chung and Hung, Chun-Hsiung and Lu, Chih-Yuan
关键词: unstructured data search, vector similarity search, in-memory computing, 3D NAND

Abstract

Vector similarity search (VSS) for unstructured vectors generated via machine learning methods is a promising solution for many applications, such as face search. With increasing awareness and concern about data security requirements, there is a compelling need to store data and process VSS applications locally on edge devices rather than send data to servers for computation. However, the explosive amount of data movement from NAND storage to DRAM across memory hierarchy and data processing of the entire dataset consume enormous energy and require long latency for VSS applications. Specifically, edge devices with insufficient DRAM capacity will trigger data swap and deteriorate the execution performance. To overcome this crucial hurdle, we propose an intelligent cognition engine (ICE) with cognitive 3D NAND, featuring non-volatile in-memory computing (nvIMC) to accelerate the processing, suppress the data movement, and reduce data swap between the processor and storage. This cognitive 3D NAND features digital nvIMC techniques (i.e., ADC/DAC-free approach), high-density 3D NAND, and compatibility with standard 3D NAND products with minor modifications. To facilitate parallel INT8/INT4 vector-vector multiplication (VVM) and mitigate the reliability issue of 3D NAND, we develop a bit-error-tolerance data encoding and a two’s complement-based digital accumulator. VVM can support similarity computations (e.g., cosine similarity and Euclidean distance), which are required to search “the most similar data” right where they are stored. In addition, the proposed solution can be realized on edge storage products, e.g., embedded MultiMedia Card (eMMC). The measured and simulated results on real 3D NAND chips show that ICE enhances the system execution time by 17\texttimes{

DOI: 10.1109/MICRO56248.2022.00058

CORUSCANT： Fast Efficient Processing-in-Racetrack Memories

作者: Ollivier, Sebastien and Longofono, Stephen and Dutta, Prayash and Hu, Jingtong and Bhanja, Sanjukta and Jones, Alex K.
关键词: machine learning, novel memories, domain-wall memory, processing-in-memory

Abstract

The growth in data needs of modern applications has created significant challenges for modern systems leading to a “memory wall.” Spintronic Domain-Wall Memory (DWM), provides near-SRAM read/write performance, energy savings and non-volatility, potential for extremely high storage density, and does not have significant endurance limitations. However, DWM’s benefits cannot directly address data access latency and throughput limitations of memory bus bandwidth. Processing-in-memory (PIM) is a popular solution to reduce the demands of memory-to-processor communication by offloading computation directly to the memory. PIM has been proposed in multiple technologies including DRAM, Phase-change memory (PCM), resistive memory (ReRAM), and Spin-Transfer Torque Memory (STT-MRAM). DRAM PIM provides solutions for a restricted set of two operand bulk-bitwise operations. PIM in PCM and ReRAM raise concerns about their effective endurance and PIM in STT-MRAM has insufficient density for main-memory applications.We propose CORUSCANT, a DWM-based in-memory computing solution that leverages the properties of DWM nanowires and allows them to serve as polymorphic gates. While normally DWM is accessed by applying spin polarized currents orthogonal to the nanowire at access points to read individual bits, transverse access along the DWM nanowire allows the differentiation of the aggregate resistance of multiple bits in the nanowire, akin to a multi-level cell. CORUSCANT leverages this transverse reading to directly provide multi-operand bulk-bitwise logic. Leveraging this multi-operand concept enabled by transverse access, CORUS-CANT provides techniques to conduct multi-operand addition and two operand multiplication much more efficiently than prior digital PIM solutions. CORUSCANT provides a 1.6\texttimes{

DOI: 10.1109/MICRO56248.2022.00060

IDLD： Instantaneous Detection of Leakage and Duplication of Identifiers Used for Register Renaming

作者: Sazeides, Yiannakis and Gerber, Alex and Gabor, Ron and Bramnik, Arkady and Papadimitriou, George and Gizopoulos, Dimitris and Nicopoulos, Chrysostomos and Dimitrakopoulos, Giorgos and Patsidis, Karyofyllis
关键词: design bugs, pipeline, microarchitecture, merged register file, register renaming, post-silicon validation

Abstract

In this paper, we propose a cost-effective microarchitectural technique capable of Instantaneously Detecting the Leakage and Duplication (IDLD) of the physical register identifiers used for register renaming in modern out-of-order processor cores. Leakage occurs when a physical register identifier disappears, whereas duplication occurs when the physical register identifier appears twice throughout the renaming logic. IDLD checks each cycle that a code calculated by xoring the physical register identifiers read from and written to arrays, used for managing physical registers allocation, renaming and reclamation, is zero. This invariance is intrinsic to the register renaming subsystem functionality and allows detecting an identifier leakage and duplication instantaneously. Detection of bugs in the complex register renaming subsystem is challenging, since: (a) its operation is not directly observable in program or architectural visible locations, (b) it lies in time-critical paths in the heart of every modern out-of-order core, and © it is often the target for optimizations in new core designs, and thus, more susceptible to bugs than legacy subsystems. We demonstrate that bugs in the renaming logic can be very difficult to root cause because, for numerous cases, it takes excessive time, e.g., millions of cycles, for a duplication or leakage to become an architecturally observable error. Even worse, activations of such bugs, depending on microarchitectural state, are often masked by subsequent hardware operations. Hence, an activation of a rarely occurring leakage or duplication bug during post-silicon validation can go undetected and escape in the field. The difficulty of root-causing register identifier duplication and leakage without IDLD is demonstrated using detailed bug modeling at the microarchitecture level, whereas the low overhead of IDLD is confirmed using RTL design analysis.

DOI: 10.1109/MICRO56248.2022.00061

HiRA： Hidden Row Activation for Reducing Refresh Latency of Off-the-Shelf DRAM Chips

作者: Ya\u{g
关键词: No keywords

Abstract

DRAM is the building block of modern main memory systems. DRAM cells must be periodically refreshed to prevent data loss. Refresh operations degrade system performance by interfering with memory accesses. As DRAM chip density increases with technology node scaling, refresh operations also increase because: 1) the number of DRAM rows in a chip increases; and 2) DRAM cells need additional refresh operations to mitigate bit failures caused by RowHammer, a failure mechanism that becomes worse with technology node scaling. Thus, it is critical to enable refresh operations at low performance overhead. To this end, we propose a new operation, Hidden Row Activation (HiRA), and the HiRA Memory Controller (HiRA-MC) to perform HiRA operations.HiRA hides a refresh operation’s latency by refreshing a row concurrently with accessing or refreshing another row within the same bank. Unlike prior works, HiRA achieves this parallelism without any modifications to off-the-shelf DRAM chips. To do so, it leverages the new observation that two rows in the same bank can be activated without data loss if the rows are connected to different charge restoration circuitry. We experimentally demonstrate on 56 real off-the-shelf DRAM chips that HiRA can reliably parallelize a DRAM row’s refresh operation with refresh or activation of any of the 32% of the rows within the same bank. By doing so, HiRA reduces the overall latency of two refresh operations by 51.4%.HiRA-MC modifies the memory request scheduler to perform HiRA when a refresh operation can be performed concurrently with a memory access or another refresh. Our system-level evaluations show that HiRA-MC increases system performance by 12.6% and 3.73\texttimes{

DOI: 10.1109/MICRO56248.2022.00062

AgileWatts： An Energy-Efficient CPU Core Idle-State Architecture for Latency-Sensitive Server Applications

作者: Yahya, Jawad Haj and Volos, Haris and Bartolini, Davide B. and Antoniou, Georgia and Kim, Jeremie S. and Wang, Zhe and Kalaitzidis, Kleovoulos and Rollet, Tom and Chen, Zhirui and Geng, Ye and Mutlu, Onur and Sazeides, Yiannakis
关键词: No keywords

Abstract

User-facing applications running in modern datacenters exhibit irregular request patterns and are implemented using a multitude of services with tight latency requirements (30–250μs). These characteristics render existing energy-conserving techniques ineffective when processors are idle due to the long transition time (order of 100μs) from a deep CPU core idle power state (C-state). While prior works propose management techniques to mitigate this inefficiency, we tackle it at its root with AgileWatts (AW): a new deep CPU core C-state architecture optimized for datacenter server processors targeting latency-sensitive applications.AW drastically reduces the transition latency from deep CPU core idle power states while retaining most of their power savings based on three key ideas. First, AW eliminates the latency (several microseconds) of saving/restoring the core context when powering-off/-on the core in a deep idle state by i) implementing medium-grained power-gates, carefully distributed across the CPU core, and ii) retaining context in the power-ungated domain. Second, AW eliminates the flush latency (several tens of microseconds) of the L1/L2 caches when entering a deep idle state by keeping L1/L2 content power-ungated. A small control logic also remains ungated to serve cache coherence traffic. AW implements cache sleep-mode and leakage reduction for the power-ungated domain by lowering a core’s voltage to the minimum operational level. Third, using a state-of-the-art power efficient all-digital phase-locked loop (ADPLL) clock generator, AW keeps the PLL active and locked during the idle state, cutting microseconds of wake-up latency at negligible power cost.Our evaluation with an accurate industrial-grade simulator calibrated against an Intel Skylake server shows that AW reduces the energy consumption of Memcached by up to 71% (35% on average) with <1% end-to-end performance degradation. We observe similar trends for other evaluated services (MySQL and Kafka). AW’s new deep C-states C6A and C6AE reduce transition-time by up to 900\texttimes{

DOI: 10.1109/MICRO56248.2022.00063

AgilePkgC： An Agile System Idle State Architecture for Energy Proportional Datacenter Servers

作者: Antoniou, Georgia and Volos, Haris and Bartolini, Davide B. and Rollet, Tom and Sazeides, Yiannakis and Yahya, Jawad Haj
关键词: No keywords

Abstract

Modern user-facing applications deployed in data-centers use a distributed system architecture that exacerbates the latency requirements of their constituent microservices (30–250μs). Existing CPU power-saving techniques degrade the performance of these applications due to the long transition latency (order of 100μs) to wake up from a deep CPU idle state (C-state). For this reason, server vendors recommend only enabling shallow core C-states (e.g., CC1) for idle CPU cores, thus preventing the system from entering deep package C-states (e.g., PC6) when all CPU cores are idle. This choice, however, impairs server energy proportionality since power-hungry resources (e.g., IOs, uncore, DRAM) remain active even when there is no active core to use them. As we show, it is common for all cores to be idle due to the low average utilization (e.g., 5 – 20%) of datacenter servers running user-facing applications.We propose to reap this opportunity with AgilePkgC (APC), a new package C-state architecture that improves the energy proportionality of server processors running latency-critical applications. APC implements PC1A (package C1 agile), a new deep package C-state that a system can enter once all cores are in a shallow C-state (i.e., CC1) and has a nanosecond-scale transition latency. PC1A is based on four key techniques. First, a hardware-based agile power management unit (APMU) rapidly detects when all cores enter a shallow core C-state (CC1) and triggers the system-level power savings control flow. Second, an IO Standby Mode (IOSM) places IO interfaces (e.g., PCIe, DMI, UPI, DRAM) in shallow (nanosecond-scale transition latency) low-power modes. Third, a CLM Retention (CLMR) mode rapidly reduces the CLM (Cache-and-home-agent, Last-level-cache, and Mesh network-on-chip) domain’s voltage to its retention level, drastically reducing its power consumption. Fourth, APC keeps all system PLLs active in PC1A to allow nanosecond-scale exit latency by avoiding PLL re-locking overhead.Combining these techniques enables significant power savings while requiring less than 200ns transition latency, >250\texttimes{

DOI: 10.1109/MICRO56248.2022.00065

Realizing Emotional Interactions to Learn User Experience and Guide Energy Optimization for Mobile Architectures

作者: Li, Xueliang and Shi, Zhuobin and Chen, Junyang and Liu, Yepang
关键词: No keywords

Abstract

In the age of AI, mobile architectures such as smartphones are still “cold machines”; machines do not feel. If the architecture is able to feel users’ feelings and runtime user experience (UX), it will accordingly adapt performance/energy to find the optimal system-operating state that consumes the least energy to satisfy users. In this paper, we will utilize users’ facial expressions (FEs) to learn their runtime UX. We know that FEs are the natural and direct way for humans to convey their emotions and feelings. Our study reveals that FEs also reflect UX. Our research for the first time quantifies the link between FEs and UX. Leveraging this link, the architecture will be able to use the front camera to see FEs and feel users’ UX. Based on UX, the architecture can appropriately provision computing resources. We propose Vi-energy system to realize the above idea. Our evaluation shows that Vi-energy reduces energy consumption by 52.9% at maximum and secures UX.

DOI: 10.1109/MICRO56248.2022.00064

FracDRAM： Fractional Values in Off-the-Shelf DRAM

作者: Gao, Fei and Tziantzioulis, Georgios and Wentzlaff, David
关键词: PUF, memory controller, processing-with-memory, PIM, DRAM

Abstract

As one of the cornerstones of computing, dynamic random-access memory (DRAM) is prevalent across digital systems. Over the years, researchers have proposed modifications to DRAM macros or explored alternative uses of existing DRAM chips to extend the functionality of this ubiquitous media. This work expands on the latter, providing new insights and demonstrating new functionalities in unmodified, commodity DRAM. FracDRAM is the first work to show how fractional values can be stored in off-the-shelf DRAM. We propose two primitive operations built with specially timed DRAM command sequences, to either store fractional values to the entire DRAM row or to masked bits in a row. Utilizing fractional values, this work enables more modules to perform the in-memory majority operation, increases the stability of the existing in-memory majority operation, and builds a state-of-the-art DRAM-based PUF with unmodified DRAM. In total, 582 DDR3 chips from seven major vendors are evaluated and characterized under different environments in this work. FracDRAM breaks through the conventional binary abstraction of DRAM logic, and brings new functions to the existing DRAM macro.

DOI: 10.1109/MICRO56248.2022.00066

pLUTo： Enabling Massively Parallel Computation in DRAM via Lookup Tables

作者: Ferreira, Jo~{a
关键词: No keywords

Abstract

Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high performance and energy efficiency, but existing PuM techniques support a limited range of operations. As a result, current PuM architectures cannot efficiently perform some complex operations (e.g., multiplication, division, exponentiation) without large increases in chip area and design complexity.To overcome these limitations of existing PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecture that leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The key idea of pLUTo is to replace complex operations with low-cost, bulk memory reads (i.e., LUT queries) instead of relying on complex extra logic.We evaluate pLUTo across 11 real-world workloads that showcase the limitations of prior PuM approaches and show that our solution outperforms optimized CPU and GPU baselines by an average of 713\texttimes{

DOI: 10.1109/MICRO56248.2022.00067

Multi-Layer In-Memory Processing

作者: Fujiki, Daichi and Khadem, Alireza and Mahlke, Scott and Das, Reetuparna
关键词: GNN, accelerator, processing in memory, in-memory computing

Abstract

In-memory computing provides revolutionary changes to computer architecture by fusing memory and computation, allowing data-intensive computations to reduce data communications. Despite promising results of in-memory computing in each layer of the memory hierarchy, an integrated approach to a system with multiple computable memories has not been examined. This paper presents a holistic and application-driven approach to building Multi-Layer In-Memory Processing (MLIMP) systems, enabling applications with variable computation demands to reap the benefits of heterogeneous compute resources in an integrated MLIMP system. By introducing concurrent task scheduling to MLIMP, we achieve improved performance and energy efficiency for graph neural networks and multiprogramming of data parallel applications.

DOI: 10.1109/MICRO56248.2022.00068

Flash-Cosmos： In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory

作者: Park, Jisung and Azizi, Roknoddin and Oliveira, Geraldo F. and Sadrosadati, Mohammad and Nadig, Rakesh and Novo, David and G'{o
关键词: No keywords

Abstract

Bulk bitwise operations, i.e., bitwise operations on large bit vectors, are prevalent in a wide range of important application domains, including databases, graph processing, genome analysis, cryptography, and hyper-dimensional computing. In conventional systems, the performance and energy efficiency of bulk bitwise operations are bottlenecked by data movement between the compute units (e.g., CPUs and GPUs) and the memory hierarchy. In-flash processing (i.e., processing data inside NAND flash chips) has a high potential to accelerate bulk bitwise operations by fundamentally reducing data movement through the entire memory hierarchy, especially when the processed data does not fit into main memory.We identify two key limitations of the state-of-the-art in-flash processing technique for bulk bitwise operations; (i) it falls short of maximally exploiting the bit-level parallelism of bulk bitwise operations that could be enabled by leveraging the unique cell-array architecture and operating principles of NAND flash memory; (ii) it is unreliable because it is not designed to take into account the highly error-prone nature of NAND flash memory.We propose Flash-Cosmos (Flash Computation with One-Shot Multi-Operand Sensing), a new in-flash processing technique that significantly increases the performance and energy efficiency of bulk bitwise operations while providing high reliability. Flash-Cosmos introduces two key mechanisms that can be easily supported in modern NAND flash chips: (i) Multi-Wordline Sensing (MWS), which enables bulk bitwise operations on a large number of operands (tens of operands) with a single sensing operation, and (ii) Enhanced SLC-mode Programming (ESP), which enables reliable computation inside NAND flash memory. We demonstrate the feasibility of performing bulk bitwise operations with high reliability in Flash-Cosmos by testing 160 real 3D NAND flash chips. Our evaluation shows that Flash-Cosmos improves average performance and energy efficiency by 3.5\texttimes{

DOI: 10.1109/MICRO56248.2022.00069

Page Size Aware Cache Prefetching

作者: Vavouliotis, Georgios and Chacon, Gino and Alvarez, Lluc and Gratz, Paul V. and Jim'{e
关键词: memory wall, memory management, large pages, address translation, virtual memory, hardware, microarchitecture, spatial correlation, prefetching, cache hierarchy

Abstract

The increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system performance due to the disparity between processor and memory speeds. Prefetching data blocks into the cache hierarchy ahead of demand accesses has proven successful at attenuating this bottleneck. However, spatial cache prefetchers operating in the physical address space leave significant performance on the table by limiting their pattern detection within 4KB physical page boundaries when modern systems use page sizes larger than 4KB to mitigate the address translation overheads.This paper exploits the high usage of large pages in modern systems to increase the effectiveness of spatial cache prefetching. We design and propose the Page-size Propagation Module (PPM), a μarchitectural scheme that propagates the page size information to the lower-level cache prefetchers, enabling safe prefetching beyond 4KB physical page boundaries when the accessed blocks reside in large pages, at the cost of augmenting the first-level caches’ Miss Status Holding Register (MSHR) entries with one additional bit. PPM is compatible with any cache prefetcher without implying design modifications. We capitalize on PPM’s benefits by designing a module that consists of two page size aware prefetchers that inherently use different page sizes to drive prefetching. The composite module uses adaptive logic to dynamically enable the most appropriate page size aware prefetcher. Finally, we show that the proposed designs are transparent to which cache prefetcher is used.We apply the proposed page size exploitation techniques to four state-of-the-art spatial cache prefetchers. Our evaluation shows that our proposals improve single-core geomean performance by up to 8.1% (2.1% at minimum) over the original implementation of the considered prefetchers, across 80 memory-intensive workloads. In multi-core contexts, we report geomean speedups up to 7.7% across different cache prefetchers and core configurations.

DOI: 10.1109/MICRO56248.2022.00070

Berti： An Accurate Local-Delta Data Prefetcher

作者: Navarro-Torres, Agust'{\i
关键词: timeliness, accuracy, local deltas, first-level cache, hardware prefetching, data prefetching

Abstract

Data prefetching is a technique that plays a crucial role in modern high-performance processors by hiding long latency memory accesses. Several state-of-the-art hardware prefetchers exploit the concept of deltas, defined as the difference between the cache line addresses of two demand accesses. Existing delta prefetchers, such as best offset prefetching (BOP) and multi-lookahead prefetching (MLOP), train and predict future accesses based on global deltas. We observed that the use of global deltas results in missed opportunities to anticipate memory accesses.In this paper, we propose Berti, a first-level data cache prefetcher that selects the best local deltas, i.e., those that consider only demand accesses issued by the same instruction. Thanks to a high-confidence mechanism that precisely detects the timely local deltas with high coverage, Berti generates accurate prefetch requests. Then, it orchestrates the prefetch requests to the memory hierarchy, using the selected deltas.Our empirical results using ChampSim and SPEC CPU2017 and GAP workloads show that, with a storage overhead of just 2.55 KB, Berti improves performance by 8.5% compared to a baseline IP-stride and 3.5% compared to IPCP, a state-of-the-art prefetcher. Our evaluation also shows that Berti reduces dynamic energy at the memory hierarchy by 33.6% compared to IPCP, thanks to its high prefetch accuracy.

DOI: 10.1109/MICRO56248.2022.00072

Translation-Optimized Memory Compression for Capacity

作者: Panwar, Gagandeep and Laghari, Muhammad and Bears, David and Liu, Yuqing and Jearls, Chandler and Choukse, Esha and Cameron, Kirk W. and Butt, Ali R. and Jian, Xun
关键词: compression ASIC, memory subsystem, address translation, hardware memory compression, memory

Abstract

The demand for memory is ever increasing. Many prior works have explored hardware memory compression to increase effective memory capacity. However, prior works compress and pack/migrate data at a small - memory block-level - granularity; this introduces an additional block-level translation after the page-level virtual address translation. In general, the smaller the granularity of address translation, the higher the translation overhead. As such, this additional block-level translation exacerbates the well-known address translation problem for large and/or irregular workloads.A promising solution is to only save memory from cold (i.e., less recently accessed) pages without saving memory from hot (i.e., more recently accessed) pages (e.g., keep the hot pages uncompressed); this avoids block-level translation overhead for hot pages. However, it still faces two challenges. First, after a compressed cold page becomes hot again, migrating the page to a full 4KB DRAM location still adds another level (albeit page-level, instead of block-level) of translation on top of existing virtual address translation. Second, only compressing cold data require compressing them very aggressively to achieve high overall memory savings; decompressing very aggressively compressed data is very slow (e.g., > 800ns assuming the latest Deflate ASIC in industry).This paper presents Translation-optimized Memory Compression for Capacity (TMCC) to tackle the two challenges above. To address the first challenge, we propose compressing page table blocks in hardware to opportunistically embed compression translations into them in a software-transparent manner to effectively prefetch compression translations during a page walk, instead of serially fetching them after the walk. To address the second challenge, we perform a large design space exploration across many hardware configurations and diverse workloads to derive and implement in HDL an ASIC Deflate that is specialized for memory; for memory pages, it is 4X as fast as the state-of-the art ASIC Deflate, with little to no sacrifice in compression ratio.Our evaluations show that for large and/or irregular workloads, TMCC can either improve performance by 14% without sacrificing effective capacity or provide 2.2x the effective capacity without sacrificing performance compared to a state-of-the-art hardware memory compression for capacity.

DOI: 10.1109/MICRO56248.2022.00073

Merging Similar Patterns for Hardware Prefetching

作者: Jiang, Shizhi and Yang, Qiusong and Ci, Yiwei
关键词: hardware data prefetching, cache

Abstract

One critical challenge of designing an efficient prefetcher is to strike a balance between performance and hardware overhead. Some state-of-the-art prefetchers achieve very high performance at the price of a very large storage requirement, which makes them not amenable to hardware implementations in commercial processors.We argue that merging memory access patterns can be a feasible solution to reducing storage overhead while obtaining high performance, although no existing prefetchers, to the best of our knowledge, have succeeded in doing so because of the difficulty of designing an effective merging strategy. After analysis of a large number of patterns, we find that the address offset of the first access in a certain memory region is a good feature for clustering highly similar patterns. Based on this observation, we propose a novel hardware data prefetcher, named Pattern Merging Prefetcher (PMP), which achieves high performance at a low cost. The storage requirement for storing patterns is largely reduced and, at the same time, the prefetch accuracy is guaranteed by merging similar patterns in the training process. In the prefetching process, a strategy based on access frequencies of prefetch candidates is applied to accurately extract prefetch targets from merged patterns. According to the experimental results on a wide range of various workloads, PMP outperforms the enhanced Bingo by 2.6% with 30\texttimes{

DOI: 10.1109/MICRO56248.2022.00071

AutoComm： A Framework for Enabling Efficient Communication in Distributed Quantum Programs

作者: Wu, Anbang and Zhang, Hezi and Li, Gushu and Shabani, Alireza and Xie, Yuan and Ding, Yufei
关键词: quantum compiler, quantum computing

Abstract

Distributed quantum computing (DQC) is a promising approach to extending the computational power of near-term quantum hardware. However, the non-local quantum communication between quantum nodes is much more expensive and error-prone than the local quantum operation within each quantum device. Previous DQC compilers focus on optimizing the implementation of each non-local gate and adopt similar compilation designs to single-node quantum compilers. The communication patterns in distributed quantum programs remain unexplored, leading to a far-from-optimal communication cost. In this paper, we identify burst communication, a specific qubit-node communication pattern that widely exists in various distributed quantum programs and can be leveraged to guide communication overhead optimization. We then propose AutoComm, an automatic compiler framework to extract burst communication patterns from input programs and then optimize the communication steps of burst communication discovered. Compared to state-of-the-art DQC compilers, experimental results show that our proposed AutoComm can reduce the communication resource consumption and the program latency by 72.9% and 69.2% on average, respectively.

DOI: 10.1109/MICRO56248.2022.00074

Let Each Quantum Bit Choose Its Basis Gates

作者: Lin, Sophia Fuhui and Sussman, Sara and Duckering, Casey and Mundada, Pranav S. and Baker, Jonathan M. and Kumar, Rohan S. and Houck, Andrew A. and Chong, Frederic T.
关键词: two-qubit gates, quantum computing

Abstract

Near-term quantum computers are primarily limited by errors in quantum operations (or gates) between two quantum bits (or qubits). A physical machine typically provides a set of basis gates that include primitive 2-qubit (2Q) and 1-qubit (1Q) gates that can be implemented in a given technology. 2Q entangling gates, coupled with some 1Q gates, allow for universal quantum computation. In superconducting technologies, the current state of the art is to implement the same 2Q gate between every pair of qubits (typically an XX-or XY-type gate). This strict hardware uniformity requirement for 2Q gates in a large quantum computer has made scaling up a time and resource-intensive endeavor in the lab.We propose a radical idea - allow the 2Q basis gate(s) to differ between every pair of qubits, selecting the best entangling gates that can be calibrated between given pairs of qubits. This work aims to give quantum scientists the ability to run meaningful algorithms with qubit systems that are not perfectly uniform. Scientists will also be able to use a much broader variety of novel 2Q gates for quantum computing. We develop a theoretical framework for identifying good 2Q basis gates on “nonstandard” Cartan trajectories that deviate from “standard” trajectories like XX. We then introduce practical methods for calibration and compilation with nonstandard 2Q gates, and discuss possible ways to improve the compilation. To demonstrate our methods in a case study, we simulated both standard XY-type trajectories and faster, nonstandard trajectories using an entangling gate architecture with far-detuned transmon qubits. We identify efficient 2Q basis gates on these nonstandard trajectories and use them to compile a number of standard benchmark circuits such as QFT and QAOA. Our results demonstrate an 8x improvement over the baseline 2Q gates with respect to speed and coherence-limited gate fidelity.

DOI: 10.1109/MICRO56248.2022.00075

COMPAQT： Compressed Waveform Memory Architecture for Scalable Qubit Control

作者: Maurya, Satvik and Tannu, Swamit
关键词: quantum control hardware, quantum computer architecture, qubit control

Abstract

On superconducting architectures, the state of a qubit is manipulated by using microwave pulses. Typically, the pulses are stored in the waveform memory and then streamed to the Digital-to-Analog Converter (DAC) to synthesize the gate operations. The waveform memory requires tens of Gigabytes per second of bandwidth to manipulate the qubit. Unfortunately, the required memory bandwidth grows linearly with the number of qubits. As a result, the bandwidth demand limits the number of qubits we can control concurrently. For example, on current RFSoCs-based qubit control platforms, we can control less than 40 qubits. In addition, the high memory bandwidth for cryogenic ASIC controllers designed to operate within a tight power budget translates to significant power dissipation, thus limiting scalability.In this paper, we show that waveforms are highly compressible, and we leverage this property to enable a scalable and efficient microarchitecture COMPAQT - Compressed Waveform Memory Architecture for Qubit Control. Waveform memory is read-only and COMPAQT leverages this to compress waveforms at compile time and store the compressed waveform in the on-chip memory. To generate the pulse, COMPAQT decompresses the waveform at runtime and then streams the decompressed waveform to the DACs. Using the hardware-efficient discrete cosine transform, COMPAQT can achieve, on average, 5x increase in the waveform memory bandwidth, which can enable 5x increase in the total number of qubits controlled in an RFSoC setup. Moreover, COMPAQT microarchitecture for cryogenic CMOS ASIC controllers can result in a 2.5x power reduction over uncompressed baseline. We also propose an adaptive compression scheme to further reduce the power consumed by the decompression engine, enabling up to 4x power reduction.Qubits are sensitive, and even a slight change in the control waveform can increase the gate error rate. We evaluate the impact of COMPAQT on the gate and circuit fidelity using IBM quantum computers. We see less than 0.1% degradation in fidelity when using COMPAQT.

DOI: 10.1109/MICRO56248.2022.00076

Qubit Mapping and Routing via MaxSAT

作者: Molavi, Abtin and Xu, Amanda and Diges, Martin and Pick, Lauren and Tannu, Swamit and Albarghouthi, Aws
关键词: qubit mapping, quantum computing

Abstract

Near-term quantum computers will operate in a noisy environment, without error correction. A critical problem for near-term quantum computing is laying out a logical circuit onto a physical device with limited connectivity between qubits. This is known as the qubit mapping and routing (QMR) problem, an intractable combinatorial problem. It is important to solve QMR as optimally as possible to reduce the amount of added noise, which may render a quantum computation useless. In this paper, we present a novel approach for optimally solving the QMR problem via a reduction to maximum satisfiability (MAXSAT). Additionally, we present two novel relaxation ideas that shrink the size of the MAXSAT constraints by exploiting the structure of a quantum circuit. Our thorough empirical evaluation demonstrates (1) the scalability of our approach compared to state-of-the-art optimal QMR techniques (solves more than 3x benchmarks with 40x speedup), (2) the significant cost reduction compared to state-of-the-art heuristic approaches (an average of ~5x swap reduction), and (3) the power of our proposed constraint relaxations.

DOI: 10.1109/MICRO56248.2022.00077

Scaling Superconducting Quantum Computers with Chiplet Architectures

作者: Smith, Kaitlin N. and Ravi, Gokul Subramanian and Baker, Jonathan M. and Chong, Frederic T.
关键词: superconducting quantum computers, quantum architecture, quantum computing

Abstract

Fixed-frequency transmon quantum computers (QCs) have advanced in coherence times, addressability, and gate fidelities. Unfortunately, these devices are restricted by the number of on-chip qubits, capping processing power and slowing progress toward fault-tolerance. Although emerging transmon devices feature over 100 qubits, building QCs large enough for meaningful demonstrations of quantum advantage requires overcoming many design challenges. For example, today’s transmon qubits suffer from significant variation due to limited precision in fabrication. As a result, barring significant improvements in current fabrication techniques, scaling QCs by building ever larger individual chips with more qubits is hampered by device variation. Severe device variation that degrades QC performance is referred to as a defect. Here, we focus on a specific defect known as a frequency collision.When transmon frequencies collide, their difference falls within a range that limits two-qubit gate fidelity. Frequency collisions occur with greater probability on larger QCs, causing collision-free yields to decline as the number of on-chip qubits increases. As a solution, we propose exploiting the higher yields associated with smaller QCs by integrating quantum chiplets within quantum multi-chip modules (MCMs). Yield, gate performance, and application-based analysis show the feasibility of QC scaling through modularity. Our results demonstrate that chiplet architectures, relative to monolithic designs, benefit from average yield improvements ranging from 9.6 – 92.6\texttimes{

DOI: 10.1109/MICRO56248.2022.00078

Q3DE： A Fault-Tolerant Quantum Computer Architecture for Multi-Bit Burst Errors by Cosmic Rays

作者: Suzuki, Yasunari and Sugiyama, Takanori and Arai, Tomochika and Liao, Wang and Inoue, Koji and Tanimoto, Teruo
关键词: fault-tolerant quantum computing, quantum error correction, quantum computing

Abstract

Demonstrating small error rates by integrating quantum error correction (QEC) into an architecture of quantum computing is the next milestone towards scalable fault-tolerant quantum computing (FTQC). Encoding logical qubits with superconducting qubits and surface codes is considered a promising candidate for FTQC architectures. In this paper, we propose an FTQC architecture, which we call Q3DE, that enhances the tolerance to multi-bit burst errors (MBBEs) by cosmic rays with moderate changes and overhead. There are three core components in Q3DE: in-situ anomaly DEtection, dynamic code DEformation, and optimized error DEcoding. In this architecture, MBBEs are detected only from syndrome values for error correction. The effect of MBBEs is immediately mitigated by dynamically increasing the encoding level of logical qubits and re-estimating probable recovery operation with the rollback of the decoding process. We investigate the performance and overhead of the Q3DE architecture with quantum-error simulators and demonstrate that Q3DE effectively reduces the period of MBBEs by 1000 times and halves the size of their region. Therefore, Q3DE significantly relaxes the requirement of qubit density and qubit chip size to realize FTQC. Our scheme is versatile for mitigating MBBEs, i.e., temporal variations of error properties, on a wide range of physical devices and FTQC architectures since it relies only on the standard features of topological stabilizer codes.

DOI: 10.1109/MICRO56248.2022.00079

RemembERR： Leveraging Microprocessor Errata for Design Testing and Validation

作者: Solt, Flavien and Jattke, Patrick and Razavi, Kaveh
关键词: No keywords

Abstract

Microprocessors are constantly increasing in complexity, but to remain competitive, their design and testing cycles must be kept as short as possible. This trend inevitably leads to design errors that eventually make their way into commercial products. Major microprocessor vendors such as Intel and AMD regularly publish and update errata documents describing these errata after their microprocessors are launched. The abundance of errata suggests the presence of significant gaps in the design testing of modern microprocessors.We argue that while a specific erratum provides information about only a single issue, the aggregated information from the body of existing errata can shed light on existing design testing gaps. Unfortunately, errata documents are not systematically structured. We formalize that each erratum describes, in human language, a set of triggers that, when applied in specific contexts, cause certain observations that pertain to a particular bug. We present RemembERR, the first large-scale database of microprocessor errata collected among all Intel Core and AMD microprocessors since 2008, comprising 2,563 individual errata. Each RemembERR entry is annotated with triggers, contexts, and observations, extracted from the original erratum. To generalize these properties, we classify them on multiple levels of abstraction that describe the underlying causes and effects.We then leverage RemembERR to study gaps in design testing by making the key observation that triggers are conjunctive, while observations are disjunctive: to detect a bug, it is necessary to apply all triggers and sufficient to observe only a single deviation. Based on this insight, one can rely on partial information about triggers across the entire corpus to draw consistent conclusions about the best design testing and validation strategies to cover the existing gaps. As a concrete example, our study shows that we need testing tools that exert power level transitions under MSR-determined configurations while operating custom features.

DOI: 10.1109/MICRO56248.2022.00081

Datamime： Generating Representative Benchmarks by Automatically Synthesizing Datasets

作者: Lee, Hyun Ryong and Sanchez, Daniel
关键词: workload generation, benchmarking

Abstract

Benchmarks that closely match the behavior of production workloads are crucial to design and provision computer systems. However, current approaches fall short: First, open-source benchmarks use public datasets that cause different behavior from production workloads. Second, black-box workload cloning techniques generate synthetic code that imitates the target workload, but the resulting program fails to capture most workload characteristics, such as microarchitectural bottlenecks or time-varying behavior.Generating code that mimics a complex application is an extremely hard problem. Instead, we propose a different and easier approach to benchmark synthesis. Our key insight is that, for many production workloads, the program is publicly available or there is a reasonably similar open-source program. In this case, generating the right dataset is sufficient to produce an accurate benchmark.Based on this observation, we present Datamime, a profile-guided approach to generate representative benchmarks for production workloads. Datamime uses the performance profiles of a target workload to generate a dataset that, when used by a benchmark program, behaves very similarly to the target workload in terms of its microarchitectural characteristics.We evaluate Datamime on several datacenter workloads. Datamime generates synthetic benchmarks that closely match the microarchitectural features of these workloads, with a mean absolute percentage error of 3.2% on IPC. Microarchitectural behavior stays close across processor types. Finally, time-varying behaviors are also replicated, making these benchmarks useful to e.g. characterize and optimize tail latency.

DOI: 10.1109/MICRO56248.2022.00082

An Architecture Interface and Offload Model for Low-Overhead, Near-Data, Distributed Accelerators

作者: Baskaran, Saambhavi and Kandemir, Mahmut Taylan and Sampson, Jack
关键词: heterogeneous architecture interface, energy efficiency, near-data offload, distributed accelerator

Abstract

The performance and energy costs of coordinating and performing data movement have led to proposals adding compute units and/or specialized access units to the memory hierarchy. However, current on-chip offload models are restricted to fixed compute and access pattern types, which limits software-driven optimizations and the applicability of such an offload interface to heterogeneous accelerator resources. This paper presents a computation offload interface for multi-core systems augmented with distributed on-chip accelerators. With energy-efficiency as the primary goal, we define mechanisms to identify offload partitioning, create a low-overhead execution model to sequence these fine-grained operations, and evaluate a set of workloads to identify the complexity needed to achieve distributed near-data execution.We demonstrate that our model and interface, combining features of dataflow in parallel with near-data processing engines, can be profitably applied to memory hierarchies augmented with either specialized compute substrates or lightweight near-memory cores. We differentiate the benefits stemming from each of elevating data access semantics, near-data computation, inter-accelerator coordination, and compute/access logic specialization. Experimental results indicate a geometric mean (energy efficiency improvement; speedup; data movement reduction) of (3.3; 1.59; 2.4)\texttimes{

DOI: 10.1109/MICRO56248.2022.00083

Towards Developing High Performance RISC-V Processors Using Agile Methodology

作者: Xu, Yinan and Yu, Zihao and Tang, Dan and Chen, Guokai and Chen, Lu and Gou, Lingrui and Jin, Yue and Li, Qianruo and Li, Xin and Li, Zuojun and Lin, Jiawei and Liu, Tong and Liu, Zhigang and Tan, Jiazhan and Wang, Huaqiang and Wang, Huizhe and Wang, Kaifan and Zhang, Chuanqi and Zhang, Fawang and Zhang, Linjuan and Zhang, Zifei and Zhao, Yangyang and Zhou, Yaoyang and Zhou, Yike and Zou, Jiangrui and Cai, Ye and Huan, Dandan and Li, Zusong and Zhao, Jiye and Chen, Zihao and He, Wei and Quan, Qiyuan and Liu, Xingwu and Wang, Sa and Shi, Kan and Sun, Ninghui and Bao, Yungang
关键词: microarchitecture, open-source hardware, agile development

Abstract

While research has shown that the agile chip design methodology is promising to sustain the scaling of computing performance in a more efficient way, it is still of limited usage in actual applications due to two major obstacles: 1) Lack of tool-chain and developing framework supporting agile chip design, especially for large-scale modern processors. 2) The conventional verification methods are less agile and become a major bottleneck of the entire process. To tackle both issues, we propose MinJie, an open-source platform supporting agile processor development flow. MinJie integrates a broad set of tools for logic design, functional verification, performance modelling, pre-silicon validation and debugging for better development efficiency of state-of-the-art processor designs. We demonstrate the usage and effectiveness of MinJie by building two generations of an open-source superscalar out-of-order RISC-V processor code-named XiangShan using agile methodologies. We quantify the performance of XiangShan using SPEC CPU2006 benchmarks and demonstrate that XiangShan achieves industry-competitive performance.

DOI: 10.1109/MICRO56248.2022.00080

DiVa： An Accelerator for Differentially Private Machine Learning

作者: Park, Beomsik and Hwang, Ranggi and Yoon, Dongho and Choi, Yoonhyuk and Rhu, Minsoo
关键词: deep learning, machine learning, accelerator, differential privacy

Abstract

The widespread deployment of machine learning (ML) is raising serious concerns on protecting the privacy of users who contributed to the collection of training data. Differential privacy (DP) is rapidly gaining momentum in the industry as a practical standard for privacy protection. Despite DP’s importance, however, little has been explored within the computer systems community regarding the implication of this emerging ML algorithm on system designs. In this work, we conduct a detailed workload characterization on a state-of-the-art differentially private ML training algorithm named DP-SGD. We uncover several unique properties of DP-SGD (e.g., its high memory capacity and computation requirements vs. non-private ML), root-causing its key bottlenecks. Based on our analysis, we propose an accelerator for differentially private ML named DiVa, which provides a significant improvement in compute utilization, leading to 2.6\texttimes{

DOI: 10.1109/MICRO56248.2022.00084

Evax： Towards a Practical, Pro-Active & Adaptive Architecture for High Performance & Security

作者: Ajorpaz, Samira Mirbagher and Moghimi, Daniel and Collins, Jeffrey Neal and Pokam, Gilles and Abu-Ghazaleh, Nael and Tullsen, Dean
关键词: zero day attack defense, ML interpretability, linearized neural network, microarchitectural attack detection, automated hardware performance counter engineering, adversarial machine learning attacks, automatic attack sample generation, generative adversarial networks, side channel, hardware security

Abstract

This paper provides an end-to-end solution to defend against known microarchitectural attacks such as speculative execution attacks, fault-injection attacks, covert and side channel attacks, and unknown or evasive versions of these attacks. Current defenses are attack specific and can have unacceptably high performance overhead. We propose an approach that reduces the overhead of state-of-art defenses by over 95%, by applying defenses only when attacks are detected. Many current proposed mitigations are not practical for deployment; for example, InvisiSpec has 27% overhead and Fencing has 74% overhead while protecting against only Spectre attacks. Other mitigations carry similar performance penalties. We reduce the overhead for InvisiSpec to 1.26% and for Fencing to 3.45% offering performance and security for not only spectre attacks but other known transient attacks as well, including the dangerous class of LVI and Rowhammer attacks, as well as covering a large set of future evasive and zero-day attacks.Critical to our approach is an accurate detector that is not fooled by evasive attacks and that can generalize to novel zero-day attacks. We use a novel Generative framework, Evasion Vaccination (Evax) for training ML models and engineering new security-centric performance counters. Evax significantly increases sensitivity to detect and classify attacks in time for mitigation to be deployed with low false positives (4 FPs in every 1M instructions in our experiments). Such performance enables efficient and timely mitigations, enabling the processor to automatically switch between performance and security as needed.

DOI: 10.1109/MICRO56248.2022.00085

ARK： Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter-Operation Key Reuse

作者: Kim, Jongmin and Lee, Gwangho and Kim, Sangpyo and Sohn, Gina and Rhu, Minsoo and Kim, John and Ahn, Jung Ho
关键词: algorithm-architecture co-design, domain-specific architecture, fully homomorphic encryption

Abstract

Homomorphic Encryption (HE) is one of the most promising post-quantum cryptographic schemes that enable privacy-preserving computation on servers. However, noise accumulates as we perform operations on HE-encrypted data, restricting the number of possible operations. Fully HE (FHE) removes this restriction by introducing the bootstrapping operation, which refreshes the data; however, FHE schemes are highly memory-bound. Bootstrapping, in particular, requires loading GBs of evaluation keys and plaintexts from off-chip memory, which makes FHE acceleration fundamentally bottlenecked by the off-chip memory bandwidth.In this paper, we propose ARK, an Accelerator for FHE with Runtime data generation and inter-operation Key reuse. ARK enables practical FHE workloads with a novel algorithm-architecture co-design to accelerate bootstrapping. We first eliminate the off-chip memory bandwidth bottleneck through runtime data generation and inter-operation key reuse. This approach enables ARK to fully exploit on-chip memory by substantially reducing the size of the working set. On top of such algorithmic enhancements, we build ARK microarchitecture that minimizes on-chip data movement through an efficient, alternating data distribution policy based on the data access patterns and a streamlined dataflow organization of the tailored functional units —including base conversion, number-theoretic transform, and automorphism units. Overall, our co-design effectively handles the heavy computation and data movement overheads of FHE, drastically reducing the cost of HE operations, including bootstrapping.

DOI: 10.1109/MICRO56248.2022.00086

Horus： Persistent Security for Extended Persistence-Domain Memory Systems

作者: Han, Xijing and Tuck, James and Awad, Amro
关键词: secure memory, eADR, non-volatile memory

Abstract

Persistent memory presents a great opportunity for crash-consistent computing in large-scale computing systems. The ability to recover data upon power outage or crash events can significantly improve the availability of large-scale systems, while improving the performance of persistent data applications (e.g., database applications). However, persistent memory suffers from high write latency and requires specific programming model (e.g., Intel’s PMDK) to guarantee crash consistency, which results in long latency to persist data. To mitigate these problems, recent standards advocate for sufficient back-up power that can flush the whole cache hierarchy to the persistent memory upon detection of an outage, i.e., extending the persistence domain to include the cache hierarchy.In the secure NVM with extended persistent domain(EPD), in addition to flushing the cache hierarchy, extra actions need to be taken to protect the flushed cache data. These extra actions of secure operation could cause significant burden on energy costs and battery size. We demonstrate that naive implementations could lead to significantly expanding the required power holdup budget (e.g., 10.3x more operations than EPD system without secure memory support). The significant overhead is caused by memory accesses of secure metadata. In this paper, we present Horus, a novel EPD-aware secure memory implementation. Horus reduces the overhead during draining period of EPD system by reducing memory accesses of secure metadata. Experiment result shows that Horus reduces the draining time by 5x, compared with the naive baseline design.

DOI: 10.1109/MICRO56248.2022.00087

Mint： An Accelerator for Mining Temporal Motifs

作者: Talati, Nishil and Ye, Haojie and Vedula, Sanketh and Chen, Kuan-Yu and Chen, Yuhan and Liu, Daniel and Yuan, Yichao and Blaauw, David and Bronstein, Alex and Mudge, Trevor and Dreslinski, Ronald
关键词: hardware accelerator, programming model, temporal motif mining

Abstract

A variety of complex systems, including social and communication networks, financial markets, biology, and neuroscience are modeled using temporal graphs that contain a set of nodes and directed timestamped edges. Temporal motifs in temporal graphs are generalized from subgraph patterns in static graphs in that they also account for edge ordering and time duration, in addition to the graph structure. Mining temporal motifs is a fundamental problem used in several application domains. However, existing software frameworks offer sub-optimal performance due to high algorithmic complexity and irregular memory accesses of temporal motif mining.This paper presents Mint—a novel accelerator architecture and a programming model for mining temporal motifs efficiently. We first divide this workload into three fundamental tasks: search, book-keeping, and backtracking. Based on this, we propose a task-centric programming model that enables decoupled, asynchronous execution. This model unlocks massive opportunities for parallelism, and allows storing task context information on-chip. To best utilize the proposed programming model, we design a domain-specific hardware accelerator using its data path and memory subsystem design to cater to the unique workload characteristics of temporal motif mining. To further improve performance, we propose a novel optimization called search index memoization that significantly reduces memory traffic. We comprehensively compare the performance of Mint with state-of-the-art temporal motif mining software frameworks (both approximate and exact) running on both CPU and GPU, and show 9\texttimes{

DOI: 10.1109/MICRO56248.2022.00089

DPU-v2： Energy-Efficient Execution of Irregular Directed Acyclic Graphs

作者: Shah, Nimish and Meert, Wannes and Verhelst, Marian
关键词: graphs, irregular computation graphs, parallel processor, hardware-software codesign, DAG processing unit, probabilistic circuits, sparse matrix triangular solve, spatial datapath, interconnection network, bank conflicts, design space exploration

Abstract

A growing number of applications like probabilistic machine learning, sparse linear algebra, robotic navigation, etc., exhibit irregular data flow computation that can be modeled with directed acyclic graphs (DAGs). The irregularity arises from the seemingly random connections of nodes, which makes the DAG structure unsuitable for vectorization on CPU or GPU. Moreover, the nodes usually represent a small number of arithmetic operations that cannot amortize the overhead of launching tasks/kernels for each node, further posing challenges for parallel execution.To enable energy-efficient execution, this work proposes DAG processing unit (DPU) version 2, a specialized processor architecture optimized for irregular DAGs with static connectivity. It consists of a tree-structured datapath for efficient data reuse, a customized banked register file, and interconnects tuned to support irregular register accesses. DPU-v2 is utilized effectively through a targeted compiler that systematically maps operations to the datapath, minimizes register bank conflicts, and avoids pipeline hazards. Finally, a design space exploration identifies the optimal architecture configuration that minimizes the energy-delay product. This hardware-software co-optimization approach results in a speedup of 1.4\texttimes{

DOI: 10.1109/MICRO56248.2022.00090

XPGraph： XPline-Friendly Persistent Memory Graph Stores for Large-Scale Evolving Graphs

作者: Wang, Rui and He, Shuibing and Zong, Weixu and Li, Yongkun and Xu, Yinlong
关键词: graph processing, persistent/non-volatile memory, storage systems

Abstract

Traditional in-memory graph storage systems have limited scalability due to the limited capacity and volatility of DRAM. Emerging persistent memory (PMEM), with large capacity and non-volatility, provides us an opportunity to realize the scalable and high-performance graph stores. However, directly moving existing DRAM-based graph storage systems to PMEM would cause serious PMEM access inefficiency issues, including high read and write amplification in PMEM and costly remote PMEM accesses across NUMA nodes, thus leading to the performance bottleneck. In this paper, we propose XPGraph, a PMEM-based graph storage system for managing large-scale evolving graphs, by developing an XPLine-friendly graph access model with vertex-centric graph buffering, hierarchical vertex buffer managing, and NUMA-friendly graph accessing. Experimental results show that XPGraph achieves 3.01\texttimes{

DOI: 10.1109/MICRO56248.2022.00091

A Data-Centric Accelerator for High-Performance Hypergraph Processing

作者: Wang, Qinggang and Zheng, Long and Hu, Ao and Huang, Yu and Yao, Pengcheng and Gui, Chuangyi and Liao, Xiaofei and Jin, Hai and Xue, Jingling
关键词: hypergraph, data-centric, accelerator

Abstract

Hypergraph processing has emerged as a powerful approach for analyzing complex multilateral relationships among multiple entities. Past research on building hypergraph systems suggests that changing the scheduling order of bipartite edge tasks can improve the overlap-induced data locality in hypergraph processing. However, due to the complex intertwined connections between vertices and hyperedges, it is almost impossible to find a locality-optimal scheduling order. Thus, these task-centric hypergraph systems often suffer from substantial off-chip communications.In this paper, we first propose a novel data-centric Load-Trigger-Reduce (LTR) execution model to exploit fully the locality in hypergraph processing. Unlike a task-centric model that loads the required data along with a task, our LTR model invokes tasks as per the data used. Specifically, once the hypergraph data is loaded into the on-chip memory, all of its relevant computation tasks will be triggered simultaneously to output intermediate results, which are finally reduced to update the final results. Our LTR model enables all hypergraph data to be accessed once in each iteration. To fully exploit the LTR performance potential, we further architect an LTR-driven hypergraph accelerator, XuLin, which features with an adaptive data loading mechanism to minimize the loading cost via chunk merging at runtime. XuLin is also equipped with a priority-based differential data reduction scheme to reduce the impact of conflicting updates on performance. We have implemented XuLin both on a Xilinx Alveo U250 FPGA card and using a cycle-accurate simulator. The results show that XuLin outperforms the state-of-the-art hypergraph processing solutions Hygra and ChGraph by 20.47\texttimes{

DOI: 10.1109/MICRO56248.2022.00088

ReGraph： Scaling Graph Processing on HBM-Enabled FPGAs with Heterogeneous Pipelines

作者: Chen, Xinyu and Chen, Yao and Cheng, Feng and Tan, Hongshi and He, Bingsheng and Wong, Weng-fai
关键词: graph processing, FPGA, HBM, heterogeneity

Abstract

The use of FPGAs for efficient graph processing has attracted significant interest. Recent memory subsystem upgrades including the introduction of HBM in FPGAs promise to further alleviate memory bottlenecks. However, modern multi-channel HBM requires much more processing pipelines to fully utilize its bandwidth potential. Due to insufficient resource efficiency, existing designs do not scale well, resulting in underutilization of the HBM facilities even when all other resources are fully consumed.In this paper, we propose ReGraph1, which customizes heterogeneous pipelines for diverse workloads in graph processing, achieving better resource efficiency, instantiating more pipelines and improving performance. We first identify workload diversity exists in processing graph partitions and classify them into two types: dense partitions established with good locality and sparse partitions with poor locality. Subsequently, we design two types of pipelines: Little pipelines with burst memory access technique to process dense partitions and Big pipelines tolerating random memory access latency to handle sparse partitions. Unlike existing monolithic pipeline designs, our heterogeneous pipelines are tailored for more specific workload characteristics and hence more lightweight, allowing the architecture to scale up more effectively with limited resources. We also present a graph-aware task scheduling method that schedules partitions to the right pipeline types, generates the most efficient pipeline combination and balances workloads. ReGraph surpasses state-of-the-art FPGA accelerators by 1.6\texttimes{

DOI: 10.1109/MICRO56248.2022.00092

3D-FPIM： An Extreme Energy-Efficient DNN Acceleration System Using 3D NAND Flash-Based In-Situ PIM Unit

作者: Lee, Hunjun and Kim, Minseop and Min, Dongmoon and Kim, Joonsung and Back, Jongwon and Yoo, Honam and Lee, Jong-Ho and Kim, Jangwoo
关键词: DNN, 3D NAND flash, mixed-signal accelerator

Abstract

The crossbar structure of the nonvolatile memory enables highly parallel and energy-efficient analog matrix-vector-multiply (MVM) operations. To exploit its efficiency, existing works design a mixed-signal deep neural network (DNN) accelerator, which offloads low-precision MVM operations to the memory array. However, they fail to accurately and efficiently support the low-precision networks due to their naive ADC designs. In addition, they cannot be applied to the latest technology nodes due to their premature RRAM-based memory array.In this work, we present 3D-FPIM, an energy-efficient and robust mixed-signal DNN acceleration system. 3D-FPIM is a full-stack 3D NAND flash-based architecture to accurately deploy low-precision networks. We design the hardware stack by carefully architecting a specialized analog-to-digital conversion method and utilizing the three-dimensional structure to achieve high accuracy, energy efficiency, and robustness. To accurately and efficiently deploy the networks, we provide a DNN retraining framework and a customized compiler. For evaluation, we implement an industry-validated circuit-level simulator. The result shows that 3D-FPIM achieves an average of 2.09x higher performance per area and 13.18x higher energy efficiency compared to the baseline 2D RRAM-based accelerator.

DOI: 10.1109/MICRO56248.2022.00093

Sparseloop： An Analytical Approach to Sparse Tensor Accelerator Modeling

作者: Wu, Yannan Nellie and Tsai, Po-An and Parashar, Angshuman and Sze, Vivienne and Emer, Joel S.
关键词: tensor computation, hardware accelerator, analytical modeling

Abstract

In recent years, many accelerators have been proposed to efficiently process sparse tensor algebra applications (e.g., sparse neural networks). However, these proposals are single points in a large and diverse design space. The lack of systematic description and modeling support for these sparse tensor accelerators impedes hardware designers from efficient and effective design space exploration.This paper first presents a unified taxonomy to systematically describe the diverse sparse tensor accelerator design space. Based on the proposed taxonomy, it then introduces Sparseloop, the first fast, accurate, and flexible analytical modeling framework to enable early-stage evaluation and exploration of sparse tensor accelerators. Sparseloop comprehends a large set of architecture specifications, including various dataflows and sparse acceleration features (e.g., elimination of zero-based compute). Using these specifications, Sparseloop evaluates a design’s processing speed and energy efficiency while accounting for data movement and compute incurred by the employed dataflow, including the savings and overhead introduced by the sparse acceleration features using stochastic density models.Across representative accelerator designs and workloads, Sparseloop achieves over 2000\texttimes{

DOI: 10.1109/MICRO56248.2022.00096

DeepBurning-SEG： Generating DNN Accelerators of Segment-Grained Pipeline Architecture

作者: Cai, Xuyi and Wang, Ying and Ma, Xiaohan and Han, Yinhe and Zhang, Lei
关键词: No keywords

Abstract

The growing complexity and diversity of deep neural network (DNN) applications have inspired intensive research on specialized DNN accelerators and also the design automation frameworks. Previous specialized NN acceleratos roughly fall into two categories of implementation, either the no-pipelined architecture that relies on a generic processing unit (PU) to sequentially execute the DNN layers in a layer-wise way, or the fully-pipelined architecture that dedicates interconnected customized PUs to the corresponding DNN layers in the model. Thus, such designs often suffer from either the resource under-utilization issue faced by no-pipelined accelerators or the resource scalability problem brought by the over-deep pipeline designs.In this work, we propose a novel class of design solution for DNN acceleration, segment-grained pipeline architecture (SPA). In the SPA accelerator, the targeted workload of DNN models will be divided into many segments and each segment will be sequentially executed on the shared interconnected PUs in a pipeline manner, so that they will benefit from both the efficiency of pipelined execution and also the flexibility of sharing PUs across different model layers. Particularly, we found that the efficiency of the implemented SPA accelerator significantly depends on the segmentation strategies of the models and the hardware resources assignment policy for PUs. Therefore, we introduce an automated design framework, AutoSeg, that includes a parameterized SPA accelerator template and a co-design engine that will generate the efficient model segmentation solution and hardware pipeline design parameters for the acceleration workload. Experimental results show that the SPA solutions generated by the AutoSeg framework achieve 1.2\texttimes{

DOI: 10.1109/MICRO56248.2022.00094

ANT： Exploiting Adaptive Numerical Data Type for Low-Bit Deep Neural Network Quantization

作者: Guo, Cong and Zhang, Chen and Leng, Jingwen and Liu, Zihan and Yang, Fan and Liu, Yunxin and Guo, Minyi and Zhu, Yuhao
关键词: No keywords

Abstract

Quantization is a technique to reduce the computation and memory cost of DNN models, which are getting increasingly large. Existing quantization solutions use fixed-point integer or floating-point types, which have limited benefits, as both require more bits to maintain the accuracy of original models. On the other hand, variable-length quantization uses low-bit quantization for normal values and high-precision for a fraction of outlier values. Even though this line of work brings algorithmic benefits, it also introduces significant hardware overheads due to variable-length encoding and decoding.In this work, we propose a fixed-length adaptive numerical data type called ANT to achieve low-bit quantization with tiny hardware overheads. Our data type ANT leverages two key innovations to exploit the intra-tensor and inter-tensor adaptive opportunities in DNN models. First, we propose a particular data type, flint, that combines the advantages of float and int for adapting to the importance of different values within a tensor. Second, we propose an adaptive framework that selects the best type for each tensor according to its distribution characteristics. We design a unified processing element architecture for ANT and show its ease of integration with existing DNN accelerators. Our design results in 2.8\texttimes{

DOI: 10.1109/MICRO56248.2022.00095

Ristretto： An Atomized Processing Architecture for Sparsity-Condensed Stream Flow in CNN

作者: Li, Gang and Xu, Weixiang and Song, Zhuoran and Jing, Naifeng and Cheng, Jian and Liang, Xiaoyao
关键词: mixed precision, dual-sided sparsity, condensed streaming computation, atomized processing architecture

Abstract

Low-precision quantization and sparsity have been widely explored in CNN acceleration due to their effectiveness in reducing computational complexity and memory requirements. However, to support variable numerical precision and sparse computation, prior accelerators design flexible multipliers or sparse dataflow separately. A uniform solution that simultaneously exploits mixed-precision and dual-sided irregular sparsity for CNN acceleration is still lacking. Through an in-depth review of existing precision-scalable and sparse accelerators, we observe that a direct combination of low-level multipliers and high-level sparse dataflow from both sides is challenging due to their orthogonal design spaces. To this end, in this paper, we propose condensed streaming computation. By representing non-zero weights and activations as atomized streams, the low-level mixed-precision multiplication and high-level sparse convolution can be unified into a shared dataflow through hierarchical data reuse. Based on the condensed streaming computation, we propose Ristretto, an atomized architecture that exploits both mixed-precision and dual-sided irregular sparsity for CNN inference. We implement Ristretto in a 28nm technology node. Extensive evaluations show that Ristretto consistently outperforms three state-of-the-art CNN accelerators, including Bit Fusion, Laconic, and SparTen, in terms of performance and energy efficiency.

DOI: 10.1109/MICRO56248.2022.00097

Session details： Best Paper Session

Abstract

Clockhands： Rename-free Instruction Set Architecture for Out-of-order Processors

Abstract

Decoupled Vector Runahead

Abstract

CryptoMMU： Enabling Scalable and Secure Access Control of Third-Party Accelerators

Abstract

Phantom： Exploiting Decoder-detectable Mispredictions

Abstract

Session details： Session 1A： Accelerators Based on HW/SW Co-Design Accelerators for Matrix Processing

Abstract

AuRORA： Virtualized Accelerator Orchestration for Multi-Tenant Workloads

Abstract

UNICO： Unified Hardware Software Co-Optimization for Robust Neural Network Acceleration

Abstract

Spatula： A Hardware Accelerator for Sparse Matrix Factorization

Abstract

Session details： Session 1B： Architectural Support/ Programming Languages, Case Study

Abstract

Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices

Abstract

Memento： Architectural Support for Ephemeral Memory Management in Serverless Environments

Abstract

Simultaneous and Heterogenous Multithreading

Abstract

Session details： Session 1C： Design Automation, Synthesis, Hardware Generation

Abstract

Accelerating RTL Simulation with Hardware-Software Co-Design

Abstract

Fast, Robust and Transferable Prediction for Hardware Logic Synthesis

Abstract

Khronos： Fusing Memory Access for Improved Hardware RTL Simulation

Abstract

Session details： Session 2A： ML Design Space ExplorationGeneration

Abstract

SecureLoop： Design Space Exploration of Secure DNN Accelerators

Abstract

DOSA： Differentiable Model-Based One-Loop Search for DNN Accelerators

Abstract

TorchSparse++： Efficient Training and Inference Framework for Sparse Convolution on GPUs

Abstract

Session details： Session 2B： Microarchitecture

Abstract

Branch Target Buffer Organizations

Abstract

Warming Up a Cold Front-End with Ignite

Abstract

ArchExplorer： Microarchitecture Exploration Via Bottleneck Analysis

Abstract

Session details： Session 2C： Accelerators for Graphs, Robotics

Abstract

DF-GAS： a Distributed FPGA-as-a-Service Architecture towards Billion-Scale Graph-based Approximate Nearest Neighbor Search

Abstract

Dadu-RBD： Robot Rigid Body Dynamics Accelerator with Multifunctional Pipelines

Abstract

MEGA Evolving Graph Accelerator

Abstract

Session details： Session 3A： ML Sparsity

Abstract

Eureka： Efficient Tensor Cores for One-sided Unstructured Sparsity in DNN Inference

Abstract

RM-STC： Row-Merge Dataflow Inspired GPU Sparse Tensor Core for Energy-Efficient Sparse Acceleration

Abstract

Sparse-DySta： Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads

Abstract

Session details： Session 3B： GPUs

Abstract

MAD MAcce： Supporting Multiply-Add Operations for Democratizing Matrix-Multiplication Accelerators

Abstract

Path Forward Beyond Simulators： Fast and Accurate GPU Execution Time Prediction for DNN Workloads

Abstract

G10： Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations

Abstract

Session details： Session 4A： ML Architecture

Abstract

MAICC ： A Lightweight Many-core Architecture with In-Cache Computing for Multi-DNN Parallel Inference

Abstract

Cambricon-U： A Systolic Random Increment Memory Architecture for Unary Computing

Abstract

MAICC ： A Lightweight Many-core Architecture with In-Cache Computing for Multi-DNN Parallel Inference

Micro-Armed Bandit： Lightweight & Reusable Reinforcement Learning for Microarchitecture Decision-Making