MICRO 2021

Session details： Session 1： Best Paper Session

作者: Wilkerson, Chris
关键词: No keywords

Abstract

No abstract available.

APOLLO： An Automated Power Modeling Framework for Runtime Power Introspection in High-Volume Commercial Microprocessors

作者: Xie, Zhiyao and Xu, Xiaoqing and Walker, Matt and Knebel, Joshua and Palaniswamy, Kumaraguru and Hebert, Nicolas and Hu, Jiang and Yang, Huanrui and Chen, Yiran and Das, Shidhartha
关键词: voltage droop, on-chip power meter, machine learning, commercial microprocessors, Power modeling and estimation

Abstract

Accurate power modeling is crucial for energy-efficient CPU design and runtime management. An ideal power modeling framework needs to be accurate yet fast, achieve high temporal resolution (ideally cycle-accurate) yet with low runtime computational overheads, and easily extensible to diverse designs through automation. Simultaneously satisfying such conflicting objectives is challenging and largely unattained despite significant prior research. In this paper, we propose APOLLO, an automated per-cycle power modeling framework that serves as the basis for both a design-time power estimator and a low-overhead runtime on-chip power meter (OPM). APOLLO uses the minimax concave penalty (MCP)-based feature selection algorithm to automatically select less than 0.05% of RTL signals as power proxies. The power estimation achieves R2 > 0.95 on Arm Neoverse N1 [3] and R2 > 0.94 on Arm Cortex-A77 [2] microprocessors, respectively. When integrated with an emulator-assisted flow, APOLLO finishes per-cycle power estimation on millions-of-cycles benchmark in minutes for million-gate industrial CPU designs. Furthermore, the power model is synthesized and integrated into the microprocessor implementation as a runtime OPM. APOLLO’s accuracy further improves when coarse-grained temporal resolution is preferred. To our best knowledge, this is the first runtime OPM that simultaneously achieves per-cycle temporal resolution and area/power overhead without compromising accuracy, which is validated on high-performance, out-of-order industrial CPU designs.

DOI: 10.1145/3466752.3480064

TIP： Time-Proportional Instruction Profiling

作者: Gottschall, Bj"{o
关键词: No keywords

Abstract

A fundamental part of developing software is to understand what the application spends time on. This is typically determined using a performance profiler which essentially captures how execution time is distributed across the instructions of a program. At the same time, the highly parallel execution model of modern high-performance processors means that it is difficult to reliably attribute time to instructions — resulting in performance analysis being unnecessarily challenging. In this work, we first propose the Oracle profiler which is a golden reference for performance profilers. Oracle is golden because (i) it accounts every clock cycle and every dynamic instruction, and (ii) it is time-proportional, i.e., it attributes a clock cycle to the instruction(s) that the processor exposes the latency of. We use Oracle to, for the first time, quantify the error of software-level profiling, the dispatch-tagging heuristic used in AMD IBS and Arm SPE, the Last-Committing Instruction (LCI) heuristic used in external monitors, and the Next-Committing Instruction (NCI) heuristic used in Intel PEBS, resulting in average instruction-level profile errors of 61.8%, 53.1%, 55.4%, and 9.3%, respectively. The reason for these errors is that all existing profilers have cases in which they systematically attribute execution time to instructions that are not the root cause of performance loss. To overcome this issue, we propose Time-Proportional Instruction Profiling (TIP) which combines Oracle’s time attribution policies with statistical sampling to enable practical implementation. We implement TIP within the Berkeley Out-of-Order Machine (BOOM) and find that TIP is highly accurate. More specifically, TIP’s instruction-level profile error is only 1.6% on average (maximally 5.0%) versus 9.3% on average (maximally 21.0%) for state-of-the-art NCI. TIP’s improved accuracy matters in practice, as we exemplify by using TIP to identify a performance problem in the SPEC CPU2017 benchmark Imagick that, once addressed, improves performance by 1.93 \texttimes{

DOI: 10.1145/3466752.3480058

NDS： N-Dimensional Storage

作者: Liu, Yu-Chia and Tseng, Hung-Wei
关键词: storage interface, heterogeneous computing, hardware accelerators, data storage systems

Abstract

Demands for efficient computing among applications that use high-dimensional datasets have led to multi-dimensional computers—computers that leverage heterogeneous processors/accelerators offering various processing models to support multi-dimensional compute kernels. Yet the front-end for these processors/accelerators is inefficient, as memory/storage systems often expose only entrenched linear-space abstractions to an application, and they often ignore the benefits of modern memory/storage systems, such as support for multi-dimensionality through different types of parallel access. This paper presents N-Dimensional Storage (NDS), a novel, multi-dimensional memory/storage system that fulfills the demands of modern hardware accelerators and applications. NDS abstracts memory arrays as native storage that applications can use to describe data locations and uses coordinates in any application-defined multi-dimensional space, thereby avoiding the software overhead associated with data-object transformations. NDS gauges the application demand underlying memory-device architectures in order to intelligently determine the physical data layout that maximizes access bandwidth and minimizes the overhead of presenting objects for arbitrary applications. This paper demonstrates an efficient architecture in supporting NDS. We evaluate a set of linear/tensor algebra workloads along with graph and data-mining algorithms on custom-built systems using each architecture. Our result shows a 5.73 \texttimes{

DOI: 10.1145/3466752.3480122

作者: Muthukrishnan, Harini and Lustig, Daniel and Nellans, David and Wenisch, Thomas
关键词: strong scaling, multi-GPU, heterogeneous systems, communication, GPU memory management, GPGPU

Abstract

Suboptimal management of memory and bandwidth is one of the primary causes of low performance on systems comprising multiple GPUs. Existing memory management solutions like Unified Memory (UM) offer simplified programming but come at the cost of performance: applications can even exhibit slowdown with increasing GPU count due to their inability to leverage system resources effectively. To solve this challenge, we propose GPS, a HW/SW multi-GPU memory management technique that efficiently orchestrates inter-GPU communication using proactive data transfers. GPS offers the programmability advantage of multi-GPU shared memory with the performance of GPU-local memory. To enable this, GPS automatically tracks the data accesses performed by each GPU, maintains duplicate physical replicas of shared regions in each GPU’s local memory, and pushes updates to the replicas in all consumer GPUs. GPS is compatible within the existing NVIDIA GPU memory consistency model but takes full advantage of its relaxed nature to deliver high performance. We evaluate GPS in the context of a 4-GPU system with varying interconnects and show that GPS achieves an average speedup of 3.0 \texttimes{

DOI: 10.1145/3466752.3480088

Session details： Session 2A： Non-Volatile Memory

作者: Hu, Xing
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492783

ParaBit： Processing Parallel Bitwise Operations in NAND Flash Memory based SSDs

作者: Gao, Congming and Xin, Xin and Lu, Youyou and Zhang, Youtao and Yang, Jun and Shu, Jiwu
关键词: near data processing, in-storage computing, flash memory, bitwise operation

Abstract

Processing-in-memory (PIM) and in-storage-computing (ISC) architectures have been constructed to implement computation inside memory and near storage, respectively. While effectively mitigating the overhead of data movement from memory and storage to the processor, due to the limited bandwidth of existing systems, these architectures still suffer from the large data movement overhead between storage and memory, in particular, if the amount of required data is large. It has become a major constraint for further improving the computation efficiency in PIM and ISC architectures. In this paper, we propose ParaBit, a scheme that enables Parallel Bitwise operations in NAND flash storage where data reside. By adjusting the latching circuit control and the sequence of sensing operations, ParaBit enables in-flash bitwise operation with no or little extra hardware, which effectively reduces the overhead of data movement between storage and memory. We exploit the massive parallelism in NAND flash based SSDs to mitigate the long latency of flash operations. Our experimental results show that the proposed ParaBit design achieves significant performance improvements over the state-of-the-art PIM and ISC architectures.

DOI: 10.1145/3466752.3480078

Distributed Data Persistency

作者: Kokolis, Apostolos and Psistakis, Antonis and Reidys, Benjamin and Huang, Jian and Torrellas, Josep
关键词: Non-volatile memory, Memory persistency, Distributed architecture, Data consistency

Abstract

Distributed applications such as key-value stores and databases avoid frequent writes to secondary storage devices to minimize performance degradation. They provide fault tolerance by replicating variables in the memories of different nodes, and using data consistency protocols to ensure consistency across replicas. Unfortunately, the reduced data durability guarantees provided can cause data loss or slow data recovery. In this environment, non-volatile memory (NVM) offers the ability to attain both high performance and data durability in distributed applications. However, it is unclear how to tie NVM memory persistency models to the existing data consistency frameworks, and what are the durability guarantees that the combination will offer to distributed applications. In this paper, we propose the concept of Distributed Data Persistency (DDP) model, which is the binding of the memory persistency model with the data consistency model in a distributed system. We reason about the interaction between consistency and persistency by using the concepts of Visibility Point and Durability Point. We design low-latency distributed protocols for DDP models that combine five consistency models with five persistency models. For the resulting DDP models, we investigate the trade-offs between performance, durability, and intuition provided to the programmer.

DOI: 10.1145/3466752.3480060

COSPlay： Leveraging Task-Level Parallelism for High-Throughput Synchronous Persistence

作者: Vemmou, Marina and Daglis, Alexandros
关键词: task-level parallelism, persistent memory, persist ordering, crash consistency, coroutines

Abstract

A key challenge in programming crash-consistent applications for Persistent Memory (PM) is achieving high performance while controlling the order of PM updates. Managing persist ordering from the CPU typically requires frequent synchronization points, which expose the PM’s high persist latency on the execution’s critical path. To mitigate this overhead, prior proposals relax the persistency model and decouple persistence from the program’s volatile execution, delegating persistence ordering to specialized hardware mechanisms such that persistent state lags behind volatile state. In this work, we identify the opportunity to mitigate the effect of persist latency by leveraging the task-level parallelism available in many PM applications, while preserving the stricter semantics of synchronous persistence and the familiar x86 persistency model. We introduce COSPlay, a software-hardware co-design that employs coroutines and rapid userspace context switching to hide persist latency by overlapping persist operations across concurrent tasks. Modest CPU extensions enable the hardware to fully overlap persists of different contexts, while preserving intra-context ordering to meet crash consistency requirements. COSPlay boosts the throughput of crash-consistent applications by up to 1.7 \texttimes{

DOI: 10.1145/3466752.3480075

RACER： Bit-Pipelined Processing Using Resistive Memory

作者: Truong, Minh S. Q. and Chen, Eric and Su, Deanyone and Shen, Liting and Glass, Alexander and Carley, L. Richard and Bain, James A. and Ghose, Saugata
关键词: No keywords

Abstract

To combat the high energy costs of moving data between main memory and the CPU, recent works have proposed to perform processing-using-memory (PUM), a type of processing-in-memory where operations are performed on data in situ (i.e., right at the memory cells holding the data). Several common and emerging memory technologies offer the ability to perform bitwise Boolean primitive functions by having interconnected cells interact with each other, eliminating the need to use discrete CMOS compute units for several common operations. Recent PUM architectures extend upon these Boolean primitives to perform bit-serial computation using memory. Unfortunately, several practical limitations of the underlying memory devices restrict how large emerging memory arrays can be, which hinders the ability of conventional bit-serial computation approaches to deliver high performance in addition to large energy savings. In this paper, we propose RACER, a cost-effective PUM architecture that delivers high performance and large energy savings using small arrays of resistive memories. RACER makes use of a bit-pipelining execution model, which can pipeline bit-serial w-bit computation across w small tiles. We fully design efficient control and peripheral circuitry, whose area can be amortized over small memory tiles without sacrificing memory density, and we propose an ISA abstraction for RACER to allow for easy program/compiler integration. We evaluate an implementation of RACER using NOR-capable ReRAM cells across a range of microbenchmarks extracted from data-intensive applications, and find that RACER provides 107 \texttimes{

DOI: 10.1145/3466752.3480071

LADDER： Architecting Content and Location-aware Writes for Crossbar Resistive Memories

作者: Chowdhuryy, Md Hafizul Islam and Rashed, Muhammad Rashedul Haq and Awad, Amro and Ewetz, Rickard and Yao, Fan
关键词: RESET Latency, Performance Optimization, Non-volatile Memory, Metadata Management, Crossbar ReRAM, Architecture Support

Abstract

Resistive memories (ReRAM) organized in the form of crossbars are promising for main memory integration. While offering high cell density, crossbar-based ReRAMs suffer from variable write latency requirement for RESET operations due to the varying impact of IR drop, which jointly depends on the data pattern of the crossbar and the location of target cells being RESET. The exacerbated worst-case RESET latencies can significantly limit system performance. In this paper, we propose LADDER, an effective and low-cost processor-side framework that performs writes with variable latency by exploiting both content and location dependencies. To enable content awareness, LADDER incorporates a novel scheme that maintains metadata for per-row data pattern (i.e., number of 1’s) in memory, and performs efficient metadata management and caching through the memory controller. LADDER does not require hardware changes to the ReRAM chip. We design several optimizations that further boost the performance of LADDER, including LRS-metadata estimation that eliminates stale memory block reads, intra-line bit-level shifting that reduces the worst-case LRS-counter values and multi-granularity LRS-metadata design that optimizes the number of counters to maintain. We evaluate the efficacy of LADDER using 16 single- and multi-programmed workloads. Our results show that LADDER exhibits on average 46% performance improvement as compared to a baseline scheme and up to 33% over state-of-the-art designs. Furthermore, LADDER achieves 28.8% average dynamic memory energy saving compared to the existing architecture schemes and has less than 3% impact on device lifetime.

DOI: 10.1145/3466752.3480054

Session details： Session 2B： Energy Efficiency & Low Power

作者: Ozer, Emre
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492784

GreenDIMM： OS-assisted DRAM Power Management for DRAM with a Sub-array Granularity Power-Down State

作者: Lee, Seunghak and Kang, Ki-Dong and Lee, Hwanjun and Park, Hyungwon and Son, Younghoon and Kim, Nam Sung and Kim, Daehoon
关键词: memory off-lining, DRAM power management

Abstract

Power and energy consumed by DRAM comprising main memory of data-center servers have increased substantially as the capacity and bandwidth of memory increase. Especially, the fraction of DRAM background power in DRAM total power is already high, and it will continue to increase with the decelerating DRAM technology scaling as we will have to plug more DRAM modules in servers or stack more DRAM dies in a DRAM package to provide necessary DRAM capacity in the future. To reduce the background power, we may exploit low average utilization of the DRAM capacity in data-center servers (i.e., 40–60%) for DRAM power management. Nonetheless, the current DRAM power management supports low-power states only at the rank granularity, which becomes ineffective with memory interleaving techniques devised to disperse memory requests across ranks. That is, ranks need to be frequently woken up from low-power states with aggressive power management, which can significantly degrade system performance, or they do not get a chance to enter low-power states with conservative power management. To tackle such limitations of the current DRAM power management, we propose GreenDIMM, OS-assisted DRAM power management. Specifically, GreenDIMM first takes a memory block in physical address space mapped to a group of DRAM sub-arrays across every channel, rank, and bank as a unit of DRAM power management. This facilitates fine-grained DRAM power management while keeping the benefit of memory interleaving techniques. Second, GreenDIMM exploits memory on-/off-lining operations of the modern OS to dynamically remove/add memory blocks from/to the physical address space, depending on the utilization of memory capacity at run-time. Third, GreenDIMM implements a deep power-down state at the sub-array granularity to reduce the background power of the off-lined memory blocks. As the off-lined memory blocks are removed from the physical address space, the sub-arrays will not receive any memory request and stay in the power-down state until the memory blocks are explicitly on-lined by the OS. Our evaluation with a commercial server running diverse workloads shows that GreenDIMM can reduce DRAM and system power by 36% and 20%, respectively, with ∼ 1% performance degradation.

DOI: 10.1145/3466752.3480089

NMAP： Power Management Based on Network Packet Processing Mode Transition for Latency-Critical Workloads

作者: Kang, Ki-Dong and Park, Gyeongseo and Kim, Hyosang and Alian, Mohammad and Kim, Nam Sung and Kim, Daehoon
关键词: Tail latency, Power management, Dynamic voltage and frequency scaling, Data-center server

Abstract

Processor power management exploiting Dynamic Voltage and Frequency Scaling (DVFS) plays a crucial role in improving the data-center’s energy efficiency. However, we observe that current power management policies in Linux (i.e., governors) often considerably increase tail response time (i.e., violate a given Service Level Objective (SLO)) and energy consumption of latency-critical applications. Furthermore, the previously proposed SLO-aware power management policies oversimplify network request processing and ignore the fact that network requests arrive at the application layer in bursts. Considering the complex interplay between the OS and network devices, we propose a power management framework exploiting network packet processing mode transitions in the OS to quickly react to the processing demands from the received network requests. Our proposed power management framework tracks the transitions between polling and interrupt in the network software stack to detect excessive packet processing on the cores and immediately react to the load changes by updating the voltage and frequency (V/F) states. Our experimental results show that our framework does not violate SLO and reduces energy consumption by up to 35.7% and 14.8% compared to Linux governors and state-of-the-art SLO-aware power management techniques, respectively.

DOI: 10.1145/3466752.3480098

BurstLink： Techniques for Energy-Efficient Video Display for Conventional and Virtual Reality Systems

作者: Haj-Yahya, Jawad and Park, Jisung and Bera, Rahul and G'{o
关键词: video streaming, video display, mobile systems, memory, energy efficiency, display panels, data movement, DRAM

Abstract

Conventional planar video streaming is the most popular application in mobile systems. The rapid growth of 360° video content and virtual reality (VR) devices is accelerating the adoption of VR video streaming. Unfortunately, video streaming consumes significant system energy due to high power consumption of major system components (e.g., DRAM, display interfaces, and display panel) involved in the video streaming process. For example, in conventional planar video streaming, the video decoder (in the processor) decodes video frames and stores them in the DRAM main memory before the display controller (in the processor) transfers decoded frames from DRAM to the display panel. This system architecture causes large amount of data movement to/from DRAM as well as high DRAM bandwidth usage. As a result, DRAM by itself consumes more than 30% of the video streaming energy. We propose BurstLink, a novel system-level technique that improves the energy efficiency of planar and VR video streaming. BurstLink is based on two key ideas. First, BurstLink directly transfers a decoded video frame from the video decoder or the GPU to the display panel, completely bypassing the host DRAM. To this end, we extend the display panel with a double remote frame buffer (DRFB) instead of DRAM’s double frame buffer so that the system can directly update the DRFB with a new frame while updating the display panel’s pixels with the current frame stored in the DRFB. Second, BurstLink transfers a complete decoded frame to the display panel in a single burst, using the maximum bandwidth of modern display interfaces. Unlike conventional systems where the frame transfer rate is limited by the pixel-update throughput of the display panel, BurstLink can always take full advantage of the high bandwidth of modern display interfaces by decoupling the frame transfer from the pixel update as enabled by the DRFB. This direct and burst frame transfer of capability BurstLink significantly reduces energy consumption of video display by 1) reducing accesses to DRAM, 2) increasing system’s residency at idle power states, and 3) enabling temporal power gating of several system components after quickly transferring each frame into the DRFB. BurstLink can be easily implemented in modern mobile systems with minimal changes to the video display pipeline. We evaluate BurstLink using an analytical power model that we rigorously validate on an Intel Skylake mobile system. Our evaluation shows that BurstLink reduces system energy consumption for 4K planar and VR video streaming by 41% and 33%, respectively. BurstLink provides an even higher energy reduction in future video streaming systems with higher display resolutions and/or display refresh rates.

DOI: 10.1145/3466752.3480085

ReplayCache： Enabling Volatile Cachesfor Energy Harvesting Systems

作者: Zeng, Jianping and Choi, Jongouk and Fu, Xinwei and Shreepathi, Ajay Paddayuru and Lee, Dongyoon and Min, Changwoo and Jung, Changhee
关键词: No keywords

Abstract

Energy harvesting systems have shown their unique benefit of ultra-long operation time without maintenance and are expected to be more prevalent in the era of Internet of Things. However, due to the batteryless nature, they suffer unpredictable frequent power outages. They thus require a lightweight mechanism for crash consistency since saving/restoring checkpoints across the outages can limit forward progress by consuming hard-won energy. For the reason, energy harvesting systems have been designed with a non-volatile memory (NVM) only. The use of a volatile data cache has been assumed to be not viable or at least challenging due to the difficulty to ensure cacheline persistence. In this paper, we propose ReplayCache, a software-only crash consistency scheme that enables commodity energy harvesting systems to exploit a volatile data cache. ReplayCache does not have to ensure the persistence of dirty cachelines or record their logs at run time. Instead, ReplayCache recovery runtime re-executes the potentially unpersisted stores in the wake of power failure to restore the consistent NVM state, from which interrupted program can safely resume. To support store replay during recovery, ReplayCache partitions program into a series of regions in a way that store operand registers remain intact within each region, and checkpoints all registers just before power failure using the crash consistency mechanism of the commodity systems. For performance, ReplayCache enables region-level persistence that allows the stores in a region to be asynchronously persisted until the region ends, exploiting ILP. The evaluation with 23 benchmark applications show that compared to the baseline with no caches, ReplayCache can achieve about 10.72x and 8.5x-8.9x speedup (on geometric mean) for the scenarios without and with power outages, respectively.

DOI: 10.1145/3466752.3480102

AutoFL： Enabling Heterogeneity-Aware Energy Efficient Federated Learning

作者: Kim, Young Geun and Wu, Carole-Jean
关键词: reinforcement learning, mobile devices, heterogeneity, energy efficiency, Federate learning

Abstract

Federated learning enables a cluster of decentralized mobile devices at the edge to collaboratively train a shared machine learning model, while keeping all the raw training samples on device. This decentralized training approach is demonstrated as a practical solution to mitigate the risk of privacy leakage. However, enabling efficient FL deployment at the edge is challenging because of non-IID training data distribution, wide system heterogeneity and stochastic-varying runtime effects in the field. This paper jointly optimizes time-to-convergence and energy efficiency of state-of-the-art FL use cases by taking into account the stochastic nature of edge execution. We propose AutoFL by tailor-designing a reinforcement learning algorithm that learns and determines which K participant devices and per-device execution targets for each FL model aggregation round in the presence of stochastic runtime variance, system and data heterogeneity. By considering the unique characteristics of FL edge deployment judiciously, AutoFL achieves 3.6 times faster model convergence time and 4.7 and 5.2 times higher energy efficiency for local clients and globally over the cluster of K participants, respectively.

DOI: 10.1145/3466752.3480129

Session details： Session 3A： Security & Privacy I

作者: Szefer, Jakub
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492785

IceClave： A Trusted Execution Environment for In-Storage Computing

作者: Kang, Luyi and Xue, Yuqi and Jia, Weiwei and Wang, Xiaohao and Kim, Jongryool and Youn, Changhwan and Kang, Myeong Joon and Lim, Hyung Jin and Jacob, Bruce and Huang, Jian
关键词: Trusted Execution Environment, Security Isolation, In-Storage Computing, ARM TrustZone

Abstract

In-storage computing with modern solid-state drives (SSDs) enables developers to offload programs from the host to the SSD. It has been proven to be an effective approach to alleviate the I/O bottleneck. To facilitate in-storage computing, many frameworks have been proposed. However, few of them treat the in-storage security as the first citizen. Specifically, since modern SSD controllers do not have a trusted execution environment, an offloaded (malicious) program could steal, modify, and even destroy the data stored in the SSD. In this paper, we first investigate the attacks that could be conducted by offloaded in-storage programs. To defend against these attacks, we build a lightweight trusted execution environment, named IceClave for in-storage computing. IceClave enables security isolation between in-storage programs and flash management functions that include flash address translation, data access control, and garbage collection, with TrustZone extensions. IceClave also achieves security isolation between in-storage programs by enforcing memory integrity verification of in-storage DRAM with low overhead. To protect data loaded from flash chips, IceClave develops a lightweight data encryption/decryption mechanism in flash controllers. We develop IceClave with a full system simulator. We evaluate IceClave with a variety of data-intensive applications such as databases. Compared to state-of-the-art in-storage computing approaches, IceClave introduces only 7.6% performance overhead, while enforcing security isolation in the SSD controller with minimal hardware cost. IceClave still keeps the performance benefit of in-storage computing by delivering up to 2.31 \texttimes{

DOI: 10.1145/3466752.3480109

DarKnight： An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware

作者: Hashemi, Hanieh and Wang, Yongqin and Annavaram, Murali
关键词: trusted execution environment, neural networks, deep learning, data privacy, data encoding, Intel SGX

Abstract

Privacy and security-related concerns are growing as machine learning reaches diverse application domains. The data holders want to train or infer with private data while exploiting accelerators, such as GPUs, that are hosted in the cloud. Cloud systems are vulnerable to attackers that compromise the privacy of data and integrity of computations. Tackling such a challenge requires unifying theoretical privacy algorithms with hardware security capabilities. This paper presents DarKnight, a framework for large DNN training while protecting input privacy and computation integrity. DarKnight relies on cooperative execution between trusted execution environments (TEE) and accelerators, where the TEE provides privacy and integrity verification, while accelerators perform the bulk of the linear algebraic computation to optimize the performance. In particular, DarKnight uses a customized data encoding strategy based on matrix masking to create input obfuscation within a TEE. The obfuscated data is then offloaded to GPUs for fast linear algebraic computation. DarKnight’s data obfuscation strategy provides provable data privacy and computation integrity in the cloud servers. While prior works tackle inference privacy and cannot be utilized for training, DarKnight’s encoding scheme is designed to support both training and inference.

DOI: 10.1145/3466752.3480112

2-in-1 Accelerator： Enabling Random Precision Switch for Winning Both Adversarial Robustness and Efficiency

作者: Fu, Yonggan and Zhao, Yang and Yu, Qixuan and Li, Chaojian and Lin, Yingyan
关键词: precision-scalable accelerators, neural networks, model robustness

Abstract

The recent breakthroughs of deep neural networks (DNNs) and the advent of billions of Internet of Things (IoT) devices have excited an explosive demand for intelligent IoT devices equipped with domain-specific DNN accelerators. However, the deployment of DNN accelerator enabled intelligent functionality into real-world IoT devices still remains particularly challenging. First, powerful DNNs often come at prohibitive complexities, whereas IoT devices often suffer from stringent resource constraints. Second, while DNNs are vulnerable to adversarial attacks especially on IoT devices exposed to complex real-world environments, many IoT applications require strict security. Existing DNN accelerators mostly tackle only one of the two aforementioned challenges (i.e., efficiency or adversarial robustness) while neglecting or even sacrificing the other. To this end, we propose a 2-in-1 Accelerator, an integrated algorithm-accelerator co-design framework aiming at winning both the adversarial robustness and efficiency of DNN accelerators. Specifically, we first propose a Random Precision Switch (RPS) algorithm that can effectively defend DNNs against adversarial attacks by enabling random DNN quantization as an in-situ model switch during training and inference. Furthermore, we propose a new precision-scalable accelerator featuring (1) a new precision-scalable MAC unit architecture which spatially tiles the temporal MAC units to boost both the achievable efficiency and flexibility and (2) a systematically optimized dataflow that is searched by our generic accelerator optimizer. Extensive experiments and ablation studies validate that our 2-in-1 Accelerator can not only aggressively boost both the adversarial robustness and efficiency of DNN accelerators under various attacks, but also naturally support instantaneous robustness-efficiency trade-offs adapting to varied resources without the necessity of DNN retraining. We believe our 2-in-1 Accelerator has opened up an exciting perspective for robust and efficient accelerator design.

DOI: 10.1145/3466752.3480082

F1： A Fast and Programmable Accelerator for Fully Homomorphic Encryption

作者: Samardzic, Nikola and Feldmann, Axel and Krastev, Aleksandar and Devadas, Srinivas and Dreslinski, Ronald and Peikert, Christopher and Sanchez, Daniel
关键词: hardware acceleration., fully homomorphic encryption

Abstract

Fully Homomorphic Encryption (FHE) allows computing on encrypted data, enabling secure offloading of computation to untrusted servers. Though it provides ideal security, FHE is expensive when executed in software, 4 to 5 orders of magnitude slower than computing on unencrypted data. These overheads are a major barrier to FHE’s widespread adoption. We present F1, the first FHE accelerator that is programmable, i.e., capable of executing full FHE programs. F1 builds on an in-depth architectural analysis of the characteristics of FHE computations that reveals acceleration opportunities. F1 is a wide-vector processor with novel functional units deeply specialized to FHE primitives, such as modular arithmetic, number-theoretic transforms, and structured permutations. This organization provides so much compute throughput that data movement becomes the key bottleneck. Thus, F1 is primarily designed to minimize data movement. Hardware provides an explicitly managed memory hierarchy and mechanisms to decouple data movement from execution. A novel compiler leverages these mechanisms to maximize reuse and schedule off-chip and on-chip data movement. We evaluate F1 using cycle-accurate simulation and RTL synthesis. F1 is the first system to accelerate complete FHE programs, and outperforms state-of-the-art software implementations by gmean 5,400 \texttimes{

DOI: 10.1145/3466752.3480070

Cryptographic Capability Computing

作者: LeMay, Michael and Rakshit, Joydeep and Deutsch, Sergej and Durham, David M. and Ghosh, Santosh and Nori, Anant and Gaur, Jayesh and Weiler, Andrew and Sultana, Salmin and Grewal, Karanvir and Subramoney, Sreenivas
关键词: memory safety, memory encryption, capabilities

Abstract

Capability architectures for memory safety have traditionally required expanding pointers and radically changing microarchitectural structures throughout processors, while only providing superficial hardening. We hence propose Cryptographic Capability Computing (C3) - the first memory safety mechanism that is stateless to avoid requiring extra metadata storage. C3 retains 64-bit pointer sizes providing legacy binary compatibility while imposing minimal touchpoints. Pointers are encrypted to unforgeably (within cryptographic bounds) reference each object. Data is encrypted even in caches and entangled with pointers for both spatial and temporal object-granular protection. Pointers become like unique keys for each allocation. C3 deploys a novel form of prediction for address translation that mitigates performance overheads even when addresses are partially encrypted. Use of a low-latency, low-area cipher from the NIST Lightweight Cryptography project avoids delaying loads by readying a data keystream by the time data is returned from the L1 cache. C3 is compatible with legacy binaries. Simulated performance overhead on SPEC CPU2006 is negligible with no memory overhead, which is a big leap forward compared to the overheads imposed by past memory safety approaches. C3 effectively replaces inefficient metadata with efficient cryptography.

DOI: 10.1145/3466752.3480076

Session details： Session 3B： Processing In/Near Memory

作者: Alameldeen, Alaa
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492786

TRiM： Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory

作者: Park, Jaehyun and Kim, Byeongho and Yun, Sungmin and Lee, Eojin and Rhu, Minsoo and Ahn, Jung Ho
关键词: near-data processing, main memory, Memory system, DRAM

Abstract

Personalized recommendation systems are gaining significant traction due to their industrial importance. An important building block of recommendation systems consists of the embedding layers, which exhibit a highly memory-intensive characteristic. A fundamental primitive of embedding layers is the embedding vector gathers followed by vector reductions, exhibiting low arithmetic intensity and becoming bottlenecked by the memory throughput. To tackle such a challenge, recent proposals employ a near-data processing (NDP) solution at the DRAM rank-level, achieving impressive performance speedups. We observe that prior rank-level-parallelism-based NDP solutions leave significant performance potential on the table as they do not fully reap the abundant transfer throughput inherent in DRAM datapaths. We propose TRiM, an NDP architecture for accelerating recommendation systems. Based on the observation that the DRAM datapath has a hierarchical tree structure, TRiM augments the DRAM datapath with “in-DRAM” reduction units at the DDR4/5 rank/bank-group/bank level. We modify the interface of DRAM to provide commands effectively to multiple reduction units running in parallel. We also propose a host-side architecture with hot embedding-vector replication to alleviate the load imbalance that arises across the reduction units. An optimal TRiM design based on DDR5 achieves up to a 7.7 \texttimes{

DOI: 10.1145/3466752.3480080

SISA： Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems

作者: Besta, Maciej and Kanakagiri, Raghavendra and Kwasniewski, Grzegorz and Ausavarungnirun, Rachata and Ber'{a
关键词: Subgraph Isomorphism, Processing Near Memory, Processing In Memory, Parallel Graph Algorithms, Instruction Set Architecture, Graph Pattern Matching, Graph Mining, Graph Learning, Graph Accelerators, Clique Mining, Clique Listing, Clique Enumeration

Abstract

Simple graph algorithms such as PageRank have been the target of numerous hardware accelerators. Yet, there also exist much more complex graph mining algorithms for problems such as clustering or maximal clique listing. These algorithms are memory-bound and thus could be accelerated by hardware techniques such as Processing-in-Memory (PIM). However, they also come with non-straightforward parallelism and complicated memory access patterns. In this work, we address this problem with a simple yet surprisingly powerful observation: operations on sets of vertices, such as intersection or union, form a large part of many complex graph mining algorithms, and can offer rich and simple parallelism at multiple levels. This observation drives our cross-layer design, in which we (1) expose set operations using a novel programming paradigm, (2) express and execute these operations efficiently with carefully designed set-centric ISA extensions called SISA, and (3) use PIM to accelerate SISA instructions. The key design idea is to alleviate the bandwidth needs of SISA instructions by mapping set operations to two types of PIM: in-DRAM bulk bitwise computing for bitvectors representing high-degree vertices, and near-memory logic layers for integer arrays representing low-degree vertices. Set-centric SISA-enhanced algorithms are efficient and outperform hand-tuned baselines, offering more than 10 \texttimes{

DOI: 10.1145/3466752.3480133

OrderLight： Lightweight Memory-Ordering Primitive for Efficient Fine-Grained PIM Computations

作者: Nag, Anirban and Balasubramonian, Rajeev
关键词: Processing-in-Memory, PIM Taxonomy, Memory-centric Ordering, Fine-grain Offload, Fine-grain Arbitration

Abstract

Modern workloads such as neural networks, genomic analysis, and data analytics exhibit significant data-intensive phases (low compute to byte ratio) and, as such, stand to gain considerably by using processing-in-memory (PIM) solutions along with more traditional accelerators. While PIM has been researched extensively, the granularity of computation offload to PIM and the granularity of memory access arbitration between host and PIM, as well as their implications, have received relatively little attention. In this work, we first introduce a taxonomy to study the design space whilst considering these two aspects. Based on this taxonomy, we observe that much of PIM research to date has largely relied on coarse-grained approaches which, we argue, have steep costs (incompatibility with mainstream memory interfaces, prohibition of concurrent host accesses, and more). To this end, we believe that better support for fine-grained approaches is warranted in accelerators coupled with PIM-enabled memories. A key challenge in the adoption of fine-grained PIM approaches is enforcing memory ordering. We discuss how existing memory ordering primitives (fences) are not only insufficient but their large overheads render them impractical to support fine-grain computation offloads and arbitration. To address this challenge, we make the key observation that the core-centric nature of memory ordering is unnecessary for PIM computations. We propose a novel lightweight memory ordering primitive for PIM use cases, OrderLight, which moves away from core-centric ordering enforcement and considerably reduces the overheads of enforcing correctness. For a suite of key computations from machine learning, data analytics, and genomics, we demonstrate that OrderLight delivers 5.5 \texttimes{

DOI: 10.1145/3466752.3480103

Sunder： Enabling Low-Overhead and Scalable Near-Data Pattern Matching Acceleration

作者: Sadredini, Elaheh and Rahimi, Reza and Imani, Mohsen and Skadron, Kevin
关键词: reconfigurable computing, pattern matching, near-data computing, in-SRAM processing, hardware accelerator, Automata processing

Abstract

Automata processing is an efficient computation model for regular expressions and other forms of sophisticated pattern matching. The demand for high-throughput and real-time pattern matching in many applications, including network intrusion detection and spam filters, has motivated several in-memory architectures for automata processing. Existing in-memory architectures focus on accelerating the pattern-matching kernel, but either fail to support a practical reporting solution or optimistically assume that the reporting stage is not the performance bottleneck. However, gathering and processing the reports can be the major bottleneck, especially when the reporting frequency is high. Moreover, all the existing in-memory architectures work with a fixed processing rate (mostly 8-bit/cycle), and they do not adjust the input consumption rate based on the properties of the applications, which can lead to throughput and capacity loss. To address these issues, we present Sunder, an in-SRAM pattern matching architecture, to processes a reconfigurable number of nibbles (4-bit symbols) in parallel, instead of fixed-rate processing, by adopting an algorithm/architecture methodology to perform hardware-aware transformations. Inspired by prior work, we transform the commonly-used 8-bit processing to nibble-processing (4-bit processing) to reduce hardware requirements exponentially and achieve higher information density. This frees up space for storing reporting data in place, which significantly eliminates host communication and reporting overhead. Our proposed reporting architecture supports in-place report summarization and provides an easy access mechanism to read the reporting data. As a result, Sunder enables a low-overhead, high-performance, and flexible in-memory pattern-matching and reporting solution. Our results confirm that Sunder reporting architecture has zero performance overhead for 95% of the applications and incurs only 2% additional hardware overhead.

DOI: 10.1145/3466752.3480934

SAM： Accelerating Strided Memory Accesses

作者: Xin, Xin and Guo, Yanan and Zhang, Youtao and Yang, Jun
关键词: strided access, main memory, DRAM

Abstract

Strided memory accesses are an important type of operations for In-Memory Databases (IMDB) applications. Strided memory accesses often demand data at word granularity with fixed strides. Hence, they tend to produce sub-optimal performance on DRAM memory (the de facto standard memory in modern computer systems) that accesses data at cacheline granularity. Recently proposed optimizations either introduce significant reliability degradation or are limited to non-volatile crossbar memory structures. In this paper, we propose a low-cost DRAM-based optimization scheme SAM for accelerating strided memory accesses. SAM consists of several designs. The primary design, termed SAM-IO, is to exploit under-utilized I/O resources in commodity DRAM chips to support high-performance strided memory accesses with near-zero hardware overhead. Based on SAM-IO, an enhanced design, termed SAM-en, is further proposed by combining several innovations to achieve overall efficiency on energy and area. Our evaluation of the proposed designs shows that SAM not only achieves high performance improvement (up to ∼ 4.2 \texttimes{

DOI: 10.1145/3466752.3480091

Session details： Session 4A： Parallelism

作者: Manerkar, Yatin
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492787

Efficient, Distributed, and Non-Speculative Multi-Address Atomic Operations

作者: G'{o
关键词: synchronization, multi-address atomics, critical sections, atomicity, Multi-core architectures

Abstract

Critical sections that read, modify, and write (RMW) a small set of addresses are common in parallel applications and concurrent data structures. However, to escape from the intricacies of fine-grained locks, which require reasoning about all possible thread interleavings, programmers often resort to coarse-grained locks to ensure atomicity. This results in atomic protection of a much larger set of potentially conflicting addresses, and, consequently, increased lock contention and unneeded serialization. As many before us have observed, these problems would be solved if only general RMW multi-address atomic operations were available, but current proposals are impractical because of deadlock scenarios that appear due to resource limitations. Alternatively, transactional memory can detect conflicts at run-time aiming to maximize concurrency, but it has significant overheads in highly-contended critical sections. In this work, we propose multi-address atomic operations (MAD atomics). MAD atomics achieve complexity-effective, non-speculative, non-deadlocking, fine-grained locking for multiple addresses, relying solely on the coherence protocol and a predetermined locking order. Unlike prior works, MAD atomics address the challenge of enabling atomic modification over a set of cachelines with arbitrary addresses, simultaneously locking all of them while side-stepping deadlock. MAD atomics only require a small storage per core (around 68 bytes), while significantly outperforming typical lock implementations. Indeed, our evaluation using gem5-20 shows that MAD atomics can improve performance by up to 18 \texttimes{

DOI: 10.1145/3466752.3480073

Cohmeleon： Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCs

作者: Zuckerman, Joseph and Giri, Davide and Kwon, Jihye and Mantovani, Paolo and Carloni, Luca P.
关键词: system-on-chip, q-learning, hardware accelerators, cache coherence

Abstract

One of the most critical aspects of integrating loosely-coupled accelerators in heterogeneous SoC architectures is orchestrating their interactions with the memory hierarchy, especially in terms of navigating the various cache-coherence options: from accelerators accessing off-chip memory directly, bypassing the cache hierarchy, to accelerators having their own private cache. By running real-size applications on FPGA-based prototypes of many-accelerator multi-core SoCs, we show that the best cache-coherence mode for a given accelerator varies at runtime, depending on the accelerator’s characteristics, the workload size, and the overall SoC status. Cohmeleon applies reinforcement learning to select the best coherence mode for each accelerator dynamically at runtime, as opposed to statically at design time. It makes these selections adaptively, by continuously observing the system and measuring its performance. Cohmeleon is accelerator-agnostic, architecture-independent, and it requires minimal hardware support. Cohmeleon is also transparent to application programmers and has a negligible software overhead. FPGA-based experiments show that our runtime approach offers, on average, a 38% speedup with a 66% reduction of off-chip memory accesses compared to state-of-the-art design-time approaches. Moreover, it can match runtime solutions that are manually tuned for the target architecture.

DOI: 10.1145/3466752.3480065

Fat Loads： Exploiting Locality Amongst Contemporaneous Load Operations to Optimize Cache Accesses

作者: Baoni, Vanshika and Mittal, Adarsh and Sohi, Gurindar S.
关键词: early branch resolution, cache energy, address pretranslation, Fat loads

Abstract

This paper considers locality among load instructions that are in processing contemporaneously within a processor to optimize the number of accesses to the memory hierarchy. A simple technique is used to learn and predict the number of contemporaneous accesses to a region of memory and classify a particular dynamic load into a normal or a fat load. Fat loads bring in additional data into Contemporaneous Load Access Registers (CLARs), from where other contemporaneous loads could be serviced without accessing the L1 cache. Experimental results indicate that with fat loads, along with 4 or 8 cache line size CLARs (256 or 512 bytes), the number of L1 cache accesses could be reduced by 50-60%, resulting in significant energy savings for the L1 cache operations. Further, in several cases the reduced latency for loads serviced from a CLAR results in an earlier resolution of some mispredicted branches, and a reduction in the number of wrong-path instructions, especially loads.

DOI: 10.1145/3466752.3480104

Criticality Driven Fetch

作者: Deshmukh, Aniket and Patt, Yale N.
关键词: OoO execution, instruction criticality, memory level parallelism

Abstract

Modern OoO cores achieve high levels of performance using large instruction windows. Scaling the window size improves performance by making visible more of the parallelism present in programs. However, this leads to an exponential increase in area and power. We specify Criticality Driven Fetch (CDF), a new execution paradigm that preferentially fetches, allocates, and executes instructions on the critical path of the program. By skipping over non-critical instructions, critical instructions in the ROB can span a sequential instruction window larger than the size of the ROB. This increases the amount of parallelism that can be extracted from critical instructions, thereby improving performance. In our implementation, CDF improves performance by (a) increasing the MLP for independent loads executing concurrently, (b) fetching critical path loads past hard-to-predict branches (by resolving them earlier), and © by initiating last level cache misses that cannot be parallelized earlier. Accelerating critical loads using CDF achieves a 6.1% IPC improvement over a baseline OoO core with prefetching. Compared to Precise Runahead, the prior state of the art work on accelerating last level cache misses on the core, we provide better performance and reduce memory traffic and energy consumption by 4.0% and 7.2% respectively.

DOI: 10.1145/3466752.3480115

Software-Defined Vector Processing on Manycore Fabrics

作者: Bedoukian, Philip and Adit, Neil and Peguero, Edwin and Sampson, Adrian
关键词: SIMD, Reconfigurable, Manycore

Abstract

We describe a tiled architecture that can fluidly transition between manycore (MIMD) and vector (SIMD) execution. The hardware provides a software-defined vector programming model that lets applications aggregate groups of manycore tiles into logical vector engines. In manycore mode, the machine behaves as a standard parallel processor. In vector mode, groups of tiles repurpose their functional units as vector execution lanes and scratchpads as vector memory banks. The key mechanism is an instruction forwarding network: a single tile fetches instructions and sends them to other trailing cores. Most cores disable their frontends and instruction caches, so vector groups amortize the intrinsic hardware costs of von Neumann control. Vector groups also use a decoupled access/execute scheme to centralize their memory requests and issue coalesced, wide loads. We augment an existing RISC-V manycore design with a minimal hardware extension to implement software-defined vectors. Cycle-level simulation results show that software-defined vectors improve performance by an average of 1.7 \texttimes{

DOI: 10.1145/3466752.3480099

Session details： Session 4B： Accelerators I

作者: Ghose, Saugata
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492788

Cerebros： Evading the RPC Tax in Datacenters

作者: Pourhabibi, Arash and Sutherland, Mark and Daglis, Alexandros and Falsafi, Babak
关键词: Remote Procedure Calls, Networked Systems, Microservices, Hardware Accelerators, Datacenters

Abstract

The emerging paradigm of microservices decomposes online services into fine-grained software modules frequently communicating over the datacenter network, often using Remote Procedure Calls (RPCs). Ongoing advancements in the network stack have exposed the RPC layer itself as a bottleneck, that we show accounts for 40–90% of a microservice’s total execution cycles. We break down the underlying modules that comprise production RPC layers and demonstrate, based on prior evidence, that CPUs can only expect limited improvements for such tasks, mandating a shift to hardware to remove the RPC layer as a limiter of microservice performance. Although recently proposed accelerators can efficiently handle a portion of the RPC layer, their overall benefit is limited by unnecessary CPU involvement, which occurs because the accelerators are architected as co-processors under the CPU’s control. Instead, we show that conclusively removing the RPC layer bottleneck requires all of the RPC layer’s modules to be executed by a NIC-attached hardware accelerator. We introduce Cerebros, a dedicated RPC processor that executes the Apache Thrift RPC layer and acts as an intermediary stage between the NIC and the microservice running on the CPU. Our evaluation using the DeathStarBench microservice suite shows that Cerebros reduces the CPU cycles spent in the RPC layer by 37–64 \texttimes{

DOI: 10.1145/3466752.3480055

Equinox： Training (for Free) on a Custom Inference Accelerator

作者: Drumond, Mario and Coulon, Louis and Pourhabibi, Arash and Y"{u
关键词: systolic arrays, DNN inference, DNN accelerators

Abstract

DNN inference accelerators executing online services exhibit low average loads because of service demand variability, leading to poor resource utilization. Unfortunately, reclaiming idle inference cycles is difficult as other workloads can not execute on a custom accelerator. With recent proposals for the use of fixed-point arithmetic in training, there are opportunities for training services to piggyback on inference accelerators. We make the observation that a key challenge in doing so is maintaining service-level latency constraints for inference. We show that relaxing latency constraints in an inference accelerator with ALU arrays that are batching-optimized achieves near-optimal throughput for a given area and power envelope while maintaining inference services’ tail latency goals. We present Equinox, a custom inference accelerator designed to piggyback training. Equinox employs a uniform arithmetic encoding to accommodate inference and training and a priority hardware scheduler with adaptive batching that interleaves training during idle inference cycles. For a 500μs inference service time constraint, Equinox achieves 6.67 \texttimes{

DOI: 10.1145/3466752.3480057

MithriLog： Near-Storage Accelerator for High-Performance Log Analytics

作者: Kang, Seongyoung and An, Jiyoung and Kim, Jinpyo and Jun, Sang-Woo
关键词: No keywords

Abstract

This paper presents, a log analytics platform with near-storage accelerators for high-performance, cost- and power-efficient unstructured log processing. offloads log analytics queries to an efficient near-storage FPGA implementation of a token querying engine, which can take advantage of the high internal bandwidth of storage devices within the available chip resource limitations. This engine is flexible enough to handle complex queries including template search based on user-defined tree-based template libraries, as well as concurrent execution of multiple queries. also uses a log-optimized version of a simple, high-throughput compression algorithm in order to further improve the effective bandwidth of backing storage. Evaluated with complex search queries on large real-world log datasets, achieves an order of magnitude higher performance over software systems, even against more expensive machines with enough DRAM to stage the entire dataset. Furthermore, delivers constant performance regardless of query complexity, resulting in further improved performance benefits with more complex queries. By replacing costly DRAM with storage and power-hungry CPU threads with FPGAs, dramatically improves the cost-effectiveness and accessibility of log analytics.

DOI: 10.1145/3466752.3480108

PointAcc： Efficient Point Cloud Accelerator

作者: Lin, Yujun and Zhang, Zhekai and Tang, Haotian and Wang, Hanrui and Han, Song
关键词: sparse convolution, point cloud, neural network accelerator

Abstract

Deep learning on point clouds plays a vital role in a wide range of applications such as autonomous driving and AR/VR. These applications interact with people in real time on edge devices and thus require low latency and low energy. Compared to projecting the point cloud to 2D space, directly processing 3D point cloud yields higher accuracy and lower #MACs. However, the extremely sparse nature of point cloud poses challenges to hardware acceleration. For example, we need to explicitly determine the nonzero outputs and search for the nonzero neighbors (mapping operation), which is unsupported in existing accelerators. Furthermore, explicit gather and scatter of sparse features are required, resulting in large data movement overhead. In this paper, we comprehensively analyze the performance bottleneck of modern point cloud networks on CPU/GPU/TPU. To address the challenges, we then present PointAcc, a novel point cloud deep learning accelerator. PointAcc maps diverse mapping operations onto one versatile ranking-based kernel, streams the sparse computation with configurable caching, and temporally fuses consecutive dense layers to reduce the memory footprint. Evaluated on 8 point cloud models across 4 applications, PointAcc achieves 3.7 \texttimes{

DOI: 10.1145/3466752.3480084

A Hardware Accelerator for Protocol Buffers

作者: Karandikar, Sagar and Leary, Chris and Kennelly, Chris and Zhao, Jerry and Parimi, Dinesh and Nikolic, Borivoje and Asanovic, Krste and Ranganathan, Parthasarathy
关键词: warehouse-scale computing, serialization, profiling, hyperscale systems, hardware-acceleration, deserialization

Abstract

Serialization frameworks are a fundamental component of scale-out systems, but introduce significant compute overheads. However, they are amenable to acceleration with specialized hardware. To understand the trade-offs involved in architecting such an accelerator, we present the first in-depth study of serialization framework usage at scale by profiling Protocol Buffers (“protobuf”) usage across Google’s datacenter fleet. We use this data to build HyperProtoBench, an open-source benchmark representative of key serialization-framework user services at scale. In doing so, we identify key insights that challenge prevailing assumptions about serialization framework usage. We use these insights to develop a novel hardware accelerator for protobufs, implemented in RTL and integrated into a RISC-V SoC. Applications can easily harness the accelerator, as it integrates with a modified version of the open-source protobuf library and is wire-compatible with standard protobufs. We have fully open-sourced our RTL, which, to the best of our knowledge, is the only such implementation currently available to the community. We also present a first-of-its-kind, end-to-end evaluation of our entire RTL-based system running hyperscale-derived benchmarks and microbenchmarks. We boot Linux on the system using FireSim to run these benchmarks and implement the design in a commercial 22nm FinFET process to obtain area and frequency metrics. We demonstrate an average 6.2 \texttimes{

DOI: 10.1145/3466752.3480051

Session details： Session 5A： Accelerators II

作者: Mahajan, Divya
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492789

Archytas： A Framework for Synthesizing and Dynamically Optimizing Accelerators for Robotic Localization

作者: Liu, Weizhuang and Yu, Bo and Gan, Yiming and Liu, Qiang and Tang, Jie and Liu, Shaoshan and Zhu, Yuhao
关键词: run-time system, robotics, optimization, localization, hardware synthesis, bundle adjustment, accelerator, SLAM, FPGA

Abstract

Despite many recent efforts, accelerating robotic computing is still fundamentally challenging for two reasons. First, robotics software stack is extremely complicated. Manually designing an accelerator while meeting the latency, power, and resource specifications is unscalable. Second, the environment in which an autonomous machine operates constantly changes; a static accelerator design leads to wasteful computation. This paper takes a first step in tackling these two challenges using localization as a case study. We describe, a framework that automatically generates a synthesizable accelerator from the high-level algorithm description while meeting design constraints. The accelerator continuously optimizes itself at run time according to the operating environment to save power while sustaining performance and accuracy. is able to generate FPGA-based accelerator designs that cover large a design space and achieve orders of magnitude performance improvement and/or energy savings compared to state-of-the-art baselines.

DOI: 10.1145/3466752.3480077

HoloAR： On-the-fly Optimization of 3D Holographic Processing for Augmented Reality

作者: Zhao, Shulin and Zhang, Haibo and Mishra, Cyan Subhra and Bhuyan, Sandeepa and Ying, Ziyu and Kandemir, Mahmut Taylan and Sivasubramaniam, Anand and Das, Chita
关键词: Holographic Processing, Energy-efficiency, Augmented Reality, Approximation

Abstract

Hologram processing is the primary bottleneck and contributes to more than 50% of energy consumption in battery-operated augmented reality (AR) headsets. Thus, improving the computational efficiency of the holographic pipeline is critical. The objective of this paper is to maximize its energy efficiency without jeopardizing the hologram quality for AR applications. Towards this, we take the approach of analyzing the workloads to identify approximation opportunities. We show that, by considering various parameters like region of interest and depth of view, we can approximate the rendering of the virtual object to minimize the amount of computation without affecting the user experience. Furthermore, by optimizing the software design flow, we propose HoloAR, which intelligently renders the most important object in sight to the clearest detail, while approximating the computations for the others, thereby significantly reducing the amount of computation, saving energy, and gaining performance at the same time. We implement our design in an edge GPU platform to demonstrate the real-world applicability of our research. Our experimental results show that, compared to the baseline, HoloAR achieves, on average, 2.7 \texttimes{

DOI: 10.1145/3466752.3480056

NOVIA： A Framework for Discovering Non-Conventional Inline Accelerators

作者: Trilla, David and Wellman, John-David and Buyuktosunoglu, Alper and Bose, Pradip
关键词: inline accelerator, hardware-software co-design, accelerator discovery

Abstract

Accelerators provide an increasingly valuable source of performance in modern computing systems. In most cases, accelerators are implemented as stand-alone, offload engines to which the processor can send large computation tasks. For many edge devices, as performance needs increase accelerators become essential, but the tight constraints on these devices limit the extent to which offload engines can be incorporated. An alternative is inline accelerators, which can be integrated as part of the core and provide performance with much smaller start-up times and area overheads. While inline accelerators allow greater flexibility in the interface and acceleration of finer grain code, determining good inline candidate accelerators is non-trivial. In this paper, we present NOVIA, a framework to derive inline accelerators by examining the workload source code and identifying inline accelerator candidates that provide benefits across many different regions of the workload. These NOVIA-derived accelerators are then integrated into an embedded core. For this core, NOVIA produces inline accelerators that improve the performance of various benchmark suites like EEMBC Autobench 2.0 and Mediabench by 1.37x with only a 3% core area increase.

DOI: 10.1145/3466752.3480094

Noema： Hardware-Efficient Template Matching for Neural Population Pattern Detection

作者: Abdelhadi, Ameer M. S. and Sha, Eugene and Bannon, Ciaran and Steenland, Hendrik and Moshovos, Andreas
关键词: No keywords

Abstract

Repeating patterns of activity across neurons is thought to be key to understanding how the brain represents, reacts, and learns. Advances in imaging and electrophysiology allow us to observe activities of groups of neurons in real-time, with ever increasing detail. Detecting patterns over these activity streams is an effective means to explore the brain, and to detect memories, decisions, and perceptions in real-time while driving effectors such as robotic arms, or augmenting and repairing brain function. Template matching is a popular algorithm for detecting recurring patterns in neural populations and has primarily been implemented on commodity systems. Unfortunately, template matching is memory intensive and computationally expensive. This has prevented its use in portable applications, such as neuroprosthetics, which are constrained by latency, form-factor, and energy. We present Noema a dedicated template matching hardware accelerator that overcomes these limitations. Noema is designed to overcome the key bottlenecks of existing implementations: binning that converts the incoming bit-serial neuron activity streams into a stream of aggregate counts, memory storage and traffic for the templates and the binned stream, and the extensive use of floating-point arithmetic. The key innovation in Noema is a reformulation of template matching that enables computations to proceed progressively as data is received without binning while generating numerically identical results. This drastically reduces latency when most computations can now use simple, area- and energy efficient bit- and integer-arithmetic units. Furthermore, Noema implements template encoding to greatly reduce template memory storage and traffic. Noema is a hierarchical and scalable design where the bulk of its units are low-cost and can be readily replicated and their frequency can be adjusted to meet a variety of energy, area, and computation constraints.

DOI: 10.1145/3466752.3480121

SquiggleFilter： An Accelerator for Portable Virus Detection

作者: Dunn, Tim and Sadasivan, Harisankar and Wadden, Jack and Goliya, Kush and Chen, Kuan-Yu and Blaauw, David and Das, Reetuparna and Narayanasamy, Satish
关键词: No keywords

Abstract

The MinION is a recent-to-market handheld nanopore sequencer. It can be used to determine the whole genome of a target virus in a biological sample. Its Read Until feature allows us to skip sequencing a majority of non-target reads (DNA/RNA fragments), which constitutes more than 99% of all reads in a typical sample. However, it does not have any on-board computing, which significantly limits its portability. We analyze the performance of a Read Until metagenomic pipeline for detecting target viruses and identifying strain-specific mutations. We find new sources of performance bottlenecks (basecaller in classification of a read) that are not addressed by past genomics accelerators. We present SquiggleFilter, a novel hardware accelerated dynamic time warping (DTW) based filter that directly analyzes MinION’s raw squiggles and filters everything except target viral reads, thereby avoiding the expensive basecalling step. We show that our 14.3W 13.25mm2 accelerator has 274 \texttimes{

DOI: 10.1145/3466752.3480117

Session details： Session 5B： Security & Privacy II

作者: Lee, Dongyoon
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492790

UC-Check： Characterizing Micro-operation Caches in x86 Processors and Implications in Security and Performance

作者: Kim, Joonsung and Jang, Hamin and Lee, Hunjun and Lee, Seungho and Kim, Jangwoo
关键词: micro-operation cache, cache side-channel attack, cache reverse engineering, cache partitioning

Abstract

The modern x86 processor (e.g., Intel, AMD) translates CISC-style x86 instructions to RISC-style micro operations (uops) as RISC pipelines are more efficient than CISC pipelines. However, this x86 decoding process requires complex hardware logic (i.e., x86 decoder) to identify variable-length x86 instructions, which incurs high translation overhead. To avoid this overhead, the x86 processors adopt a micro-operation cache (uop cache) to bypass the expensive x86 decoder by caching the decoded uops. In this paper, we find out modern uop caches suffer from (1) security vulnerability and (2) severe cache contention between co-located SMT cores. To understand these security and performance implications of the uop cache, we propose UC-Check to extract various undisclosed features by using carefully designed microbenchmarks. With the extracted features, (1) we present two attack scenarios exploiting the uop cache as a new timing side-channel and propose a secure architecture to mitigate these attacks with negligible overhead. In addition, (2) we propose a logical uop cache allocation technique to alleviate the cache contention problem. For the evaluation, we extract many undocumented features on a wide spectrum of modern x86 processors and show that our proposed schemes (e.g., security attack/defense, performance optimization) are directly applicable to commodity x86 processors. For example, our logical uop cache allocation improves uop cache hit ratios by up to 1.33 \texttimes{

DOI: 10.1145/3466752.3480079

Network-on-Chip Microarchitecture-based Covert Channel in GPUs

作者: Ahn, Jaeguk and Kim, Jiho and Kasan, Hans and Delshadtehrani, Leila and Song, Wonjun and Joshi, Ajay and Kim, John
关键词: Network-on-Chip, GPU, Covert Channel

Abstract

As GPUs are becoming widely deployed in the cloud infrastructure to support different application domains, the security concerns of GPUs are becoming increasingly important. In particular, the support for multiprogramming in modern GPUs has led to new vulnerabilities since multiple kernels in a GPU can be executed at the same time. In this work, we propose a new microarchitectural timing covert channel for GPUs that can be established based on the shared, on-chip interconnect channels. We first reverse-engineer the organization of the on-chip networks in modern GPUs to understand the core placements throughout the GPU. The hierarchical organization of the GPU results in the sharing of interconnect bandwidth between neighboring cores. Based on this understanding, we identify how contention for the interconnect bandwidth can be exploited for a novel covert channel attack. We propose two types of interconnect-based covert channels that exploit the on-chip network hierarchy. Unlike cache-based covert channels, no states of the on-chip network need to be modified for communication in our interconnect-based covert channel and the impact of contention is very predictable. By exploiting the parallelism of GPUs, our proposed covert channel results in very high bandwidth – achieving approximately 24 Mbps of bandwidth on NVIDIA Volta GPUs and results in one of the highest known microarchitectural covert channel bandwidth.

DOI: 10.1145/3466752.3480093

作者: Buiras, Pablo and Nemati, Hamed and Lindner, Andreas and Guanciale, Roberto
关键词: Testing, Side channels, Model validation, Microarchitectures, Information flow security

Abstract

Observational models enable the analysis of information flow properties against side channels. Relational testing has been used to validate the soundness of these models by measuring the side channel on states that the model considers indistinguishable. However, unguided search can generate test states that are too similar to each other to invalidate the model. To address this we introduce observation refinement, a technique to guide the exploration of the state space to focus on hardware features of interest. We refine observational models to include fine-grained observations that characterize behavior that we want to exclude. States that yield equivalent refined observations are then ruled out, reducing the size of the space. We have extended an existing model validation framework, Scam-V, to support refinement. We have evaluated the usefulness of refinement for search guidance by analyzing cache coloring and speculative leakage in the ARMv8-A architecture. As a surprising result, we have exposed SiSCLoak, a new vulnerability linked to speculative execution in Cortex-A53.

DOI: 10.1145/3466752.3480130

GhostMinion： A Strictness-Ordered Cache System for Spectre Mitigation

作者: Ainsworth, Sam
关键词: microarchitectural security, caches, Spectre

Abstract

Out-of-order speculation, a technique ubiquitous since the early 1990s, remains a fundamental security flaw. Via attacks such as Spectre and Meltdown, an attacker can trick a victim, in an otherwise entirely correct program, into leaking its secrets through the effects of misspeculated execution, in a way that is entirely invisible to the programmer’s model. This has serious implications for application sandboxing and inter-process communication. Designing efficient mitigations that preserve the performance of out-of-order execution has been a challenge. The speculation-hiding techniques in the literature have been shown to not close such channels comprehensively, allowing adversaries to redesign attacks. Strong, precise guarantees are necessary, but mitigations must achieve high performance to be adopted. We present Strictness Ordering, a new constraint system that shows how we can comprehensively eliminate transient side channel attacks, while still allowing complex speculation and data forwarding between speculative instructions. We then present GhostMinion, a cache modification built using a variety of new techniques designed to provide Strictness Order at only 2.5% overhead.

DOI: 10.1145/3466752.3480074

Speculative Privacy Tracking (SPT)： Leaking Information From Speculative Execution Without Compromising Privacy

作者: Choudhary, Rutvik and Yu, Jiyong and Fletcher, Christopher and Morrison, Adam
关键词: No keywords

Abstract

Speculative execution attacks put a dangerous new twist on information leakage through microarchitectural side channels. Ordinarily, programmers can reason about leakage based on the program’s semantics, and prevent said leakage by carefully writing the program to not pass secrets to covert channel-creating “transmitter” instructions, such as branches and loads. Speculative execution breaks this defense, because a transmitter might mis-speculatively execute with a secret operand even if it can never execute with said operand in valid executions. This paper proposes a new security definition that enables hardware to provide comprehensive, low-overhead and transparent-to-software protection against these attacks. The key idea is that it is safe to speculatively execute a transmitter without any protection if its operands were already leaked by the non-speculative execution. Based on this definition we design Speculative Privacy Tracking (SPT), a hardware protection that delays execution of every transmitter until it can prove that the transmitter’s operands leak during the program’s non-speculative execution. Using a novel dynamic information flow analysis microarchitecture, SPT efficiently proves when such an operand declassification implies that other data becomes declassified, which enables other delayed transmitters to be executed safely. We evaluate SPT on SPEC2017 and constant-time code benchmarks, and find that it adds only 45%/11% overhead on average (depending on the attack model) relative to an insecure processor. Compared to a secure baseline with the same protection scope, SPT reduces overhead by an average 3.6 \texttimes{

DOI: 10.1145/3466752.3480068

Session details： Session 6A： Reliabiity & Verification

作者: Awad, Amro
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492791

HARP： Practically and Effectively Identifying Uncorrectable Errors in Memory Chips That Use On-Die Error-Correcting Codes

作者: Patel, Minesh and de Oliveira, Geraldo Francisco and Mutlu, Onur
关键词: Repair, Reliability, On-Die ECC, Memory Test, Memory Scaling, Fault Tolerance, Error Profiling, Error Modeling, DRAM

Abstract

Aggressive storage density scaling in modern main memories causes increasing error rates that are addressed using error-mitigation techniques. State-of-the-art techniques for addressing high error rates identify and repair bits that are at risk of error from within the memory controller. Unfortunately, modern main memory chips internally use on-die error correcting codes (on-die ECC) that obfuscate the memory controller’s view of errors, complicating the process of identifying at-risk bits (i.e., error profiling). To understand the problems that on-die ECC causes for error profiling, we analytically study how on-die ECC changes the way that memory errors appear outside of the memory chip (e.g., to the memory controller). We show that on-die ECC introduces statistical dependence between errors in different bit positions, raising three key challenges for practical and effective error profiling: on-die ECC (1) exponentially increases the number of at-risk bits the profiler must identify; (2) makes individual at-risk bits more difficult to identify; and (3) interferes with commonly-used memory data patterns that are designed to make at-risk bits easier to identify. To address the three challenges, we introduce Hybrid Active-Reactive Profiling (HARP), a new error profiling algorithm that rapidly achieves full coverage of at-risk bits based on two key insights. First, errors that on-die ECC fails to correct have two sources: (1) direct errors from raw bit errors in the data portion of the ECC word and (2) indirect errors that on-die ECC introduces when facing uncorrectable errors. Second, the maximum number of indirect errors that can occur concurrently is limited to the correction capability of on-die ECC. HARP’s key idea is to first identify all bits at risk of direct errors using existing profiling techniques with the help of small modifications to the on-die ECC mechanism. Then, a secondary ECC within the memory controller with correction capability equal to or greater than that of on-die ECC can safely identify bits at-risk of indirect errors, if and when they fail. We evaluate HARP in simulation relative to two state-of-the-art baseline error profiling algorithms. We show that HARP achieves full coverage of all at-risk bits faster (e.g., 99th-percentile coverage 20.6%/36.4%/52.9%/62.1% faster, on average, given 2/3/4/5 raw bit errors per ECC word) than the baseline algorithms, which sometimes fail to achieve full coverage. We perform a case study of how each profiler impacts the system’s overall bit error rate (BER) when using a repair mechanism to tolerate DRAM data-retention errors. We show that HARP identifies all errors faster than the best-performing baseline algorithm (e.g., by 3.7 \texttimes{

DOI: 10.1145/3466752.3480061

Characterizing and Mitigating Soft Errors in GPU DRAM

作者: Sullivan, Michael B. and Saxena, Nirmal and O’Connor, Mike and Lee, Donghyuk and Racunas, Paul and Hukerikar, Saurabh and Tsai, Timothy and Hari, Siva Kumar Sastry and Keckler, Stephen W.
关键词: No keywords

Abstract

GPUs are used in high-reliability systems, including high-performance computers and autonomous vehicles. Because GPUs employ a high-bandwidth, wide-interface to DRAM and fetch each memory access from a single DRAM device, implementing full-device correction through ECC is expensive and impractical. This challenge is compounded by worsening relative rates of multi-bit DRAM errors and increasing GPU memory capacities. This paper first presents high-energy neutron beam testing results for the HBM2 memory on a compute-class GPU. These results uncovered unexpected intermittent errors that we determine to be caused by cell damage from the high-intensity beam. As these errors are an artifact of the testing apparatus, we provide best-practice guidance on how to identify and filter them from the results of beam testing campaigns. Second, we use the soft error beam testing results to inform the design and evaluation of system-level error protection mechanisms by reporting the relative error rates and error patterns from soft errors in GPU DRAM. We observe locality in the multi-bit errors, which we attribute to the underlying structure of the HBM2 memory. Based on these error patterns, we propose several novel ECC schemes to decrease the silent data corruption risk by up to five orders of magnitude relative to SEC-DED ECC, while also reducing the number of uncorrectable errors by up to 7.87 \texttimes{

DOI: 10.1145/3466752.3480111

Turnpike： Lightweight Soft Error Resilience for In-Order Cores

作者: Zeng, Jianping and Kim, Hongjune and Lee, Jaejin and Jung, Changhee
关键词: No keywords

Abstract

Acoustic-sensor-based soft error resilience is particularly promising, since it can verify the absence of soft errors and eliminate silent data corruptions at a low hardware cost. However, the state-of-the-art work incurs a significant performance overhead for in-order cores due to frequent structural/data hazards during the verification. To address the problem, this paper presents Turnpike, a compiler/architecture co-design scheme that can achieve lightweight yet guaranteed soft error resilience for in-order cores. The key idea is that many of the data computed in the core can bypass the soft error verification without compromising the resilience. Along with simple microarchitectural support for realizing the idea, Turnpike leverages compiler optimizations to further reduce the performance overhead. Experimental results with 36 benchmarks demonstrate that Turnpike only incurs a 0-14% run-time overhead on average while the state-of-the-art incurs a 29-84% overhead when the worst-case latency of the sensor based error detection is 10-50 cycles.

DOI: 10.1145/3466752.3480042

Effective Processor Verification with Logic Fuzzer Enhanced Co-simulation

作者: Kabylkas, Nursultan and Thorn, Tommy and Srinath, Shreesha and Xekalakis, Polychronis and Renau, Jose
关键词: microprocessor verification, enhanced simulation, co-simulation, RISC-V

Abstract

The study on verification trends in the semiconductor industry shows that the design complexity is increasing, fewer companies achieve first silicon success and need more spins before production, companies hire more verification engineers, and 53% of the whole hardware-design-cycle is spent on the design verification [18]. The cost of a respin is high, and more than 40% of the cases that contribute to it are post-fabrication functional bug exposures [16]. The study also shows that 65% of verification engineers’ time is spent on debug, test creation, and simulation [17]. This paper presents a set of tools for RISC-V processor verification engineers that help to expose more bugs before production and increase the productivity of time spent on debugging, test creation and simulation. We present Logic Fuzzer (LF), a novel tool that expands the verification space exploration without the creation of additional verification tests. The LF randomizes the states or control signals of the design-under-test at the places that do not affect functionality. It brings the processor execution outside its normal flow to increase the number of microarchitectural states exercised by the tests. We also present Dromajo, the state of the art processor verification framework for RISC-V cores. Dromajo is an RV64GC emulator that was designed specifically for co-simulation purposes. It can boot Linux, handle external stimuli, such as interrupts and debug requests on the fly, and can be integrated into existing testbench infrastructure with minimal effort. We evaluate the effectiveness of the tools on three RISC-V cores: CVA6, BlackParrot, and BOOM. Dromajo by itself found a total of nine bugs. The enhancement of Dromajo with the Logic Fuzzer increases the exposed bug count to thirteen without creating additional verification tests.

DOI: 10.1145/3466752.3480092

Synthesizing Formal Models of Hardware from RTL for Efficient Verification of Memory Model Implementations

作者: Hsiao, Yao and Mulligan, Dominic P. and Nikoleris, Nikos and Petri, Gustavo and Trippel, Caroline
关键词: verification, shared memory, memory consistency, concurrency

Abstract

Modern hardware complexity makes it challenging to determine if a given microarchitecture adheres to a particular memory consistency model (MCM). This observation inspired the Check tools, which formally check that a specific microarchitecture correctly implements an MCM with respect to a suite of litmus test programs. Unfortunately, despite their effectiveness and efficiency, the Check tools must be supplied a microarchitecture in the guise of a manually constructed axiomatic specification, called a μspec model. To facilitate MCM verification—and enable the Check tools to consume processor RTL directly—we introduce a methodology and associated tool, rtl2μspec, for automatically synthesizing μspec models from processor designs written in Verilog or SystemVerilog, with the help of modest user-provided design metadata. As a case study, we use rtl2μspec to facilitate the Check-based verification of the four-core RISC-V V-scale (multi-V-scale) processor’s MCM implementation. We show that rtl2μspec can synthesize a complete, and proven correct by construction, μspec model from the SystemVerilog design of the multi-V-scale processor in 6.84 minutes. Subsequent Check-based MCM verification of the synthesized μspec model takes less than one second per litmus test.

DOI: 10.1145/3466752.3480087

Session details： Session 6B： GPGPU

作者: Jog, Adwait
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492792

Ohm-GPU： Integrating New Optical Network and Heterogeneous Memory into GPU Multi-Processors

作者: Zhang, Jie and Jung, Myoungsoo
关键词: optical network, graphics processing unit (GPU), Parallel processing, Optane DC PMM, GDDR DRAM

Abstract

Traditional graphics processing units (GPUs) suffer from the low memory capacity and demand for high memory bandwidth. To address these challenges, we propose Ohm-GPU, a new optical network based heterogeneous memory design for GPUs. Specifically, Ohm-GPU can expand the memory capacity by combing a set of high-density 3D XPoint and DRAM modules as heterogeneous memory. To prevent memory channels from throttling throughput of GPU memory system, Ohm-GPU replaces the electrical lanes in the traditional memory channel with a high-performance optical network. However, the hybrid memory can introduce frequent data migrations between DRAM and 3D XPoint, which can unfortunately occupy the memory channel and increase the optical network traffic. To prevent the intensive data migrations from blocking normal memory services, Ohm-GPU revises the existing memory controller and designs a new optical network infrastructure, which enables the memory channel to serve the data migrations and memory requests, in parallel. Our evaluation results reveal that Ohm-GPU can improve the performance by 181% and 27%, compared to a DRAM-based GPU memory system and the baseline optical network based heterogeneous memory system, respectively.

DOI: 10.1145/3466752.3480107

Intersection Prediction for Accelerated GPU Ray Tracing

作者: Liu, Lufei and Chang, Wesley and Demoullin, Francois and Chou, Yuan Hsi and Saed, Mohammadreza and Pankratz, David and Nowicki, Tyler and Aamodt, Tor M.
关键词: ray tracing, hardware accelerator, graphics, GPU

Abstract

Ray tracing has been used for years in motion picture to generate photorealistic images while faster raster-based shading techniques have been preferred for video games to meet real-time requirements. However, recent Graphics Processing Units (GPUs) incorporate hardware accelerator units designed for ray tracing. These accelerator units target the process of traversing hierarchical tree data structures used to test for ray-object intersections. Distinct rays following similar paths through these structures execute many redundant ray-box intersection tests. We propose a ray intersection predictor that speculatively elides redundant operations during this process and proceeds directly to test primitives that the ray is likely to intersect. A key aspect of our predictor strategy involves identifying hash functions that preserve enough spatial information to identify redundant traversals. We explore how to integrate our ray prediction strategy into existing GPU pipelines along with improving the predictor effectiveness by predicting nodes higher in the tree as well as regrouping and scheduling traversal operations in a low cost, judicious manner. On a mobile class GPU with a ray tracing accelerator unit, we find the addition of a 5.5KB predictor per streaming multiprocessor improves performance for ambient occlusion workloads by a geometric mean of 26%.

DOI: 10.1145/3466752.3480097

Principal Kernel Analysis： A Tractable Methodology to Simulate Scaled GPU Workloads

作者: Avalos Baddouh, Cesar and Khairy, Mahmoud and Green, Roland N. and Payer, Mathias and Rogers, Timothy G.
关键词: Workload sampling, Simulation methodology, GPU

Abstract

Simulating all threads in a scaled GPU workload results in prohibitive simulation cost. Cycle-level simulation is orders of magnitude slower than native silicon, the only solution is to reduce the amount of work simulated while accurately representing the program. Existing solutions to simulate GPU programs either scale the input size, simulate the first several billion instructions, or simulate a portion of both the GPU and the workload. These solutions lack validation against scaled systems, produce unrealistic contention conditions and frequently miss critical code sections. Existing CPU sampling mechanisms, like SimPoint, reduce per-thread workload, and are ill-suited to GPU programs where reducing the number of threads is critical. Sampling solutions on GPUs space lack silicon validation, require per-workload parameter tuning, and do not scale. A tractable solution, validated on contemporary scaled workloads, is needed to provide credible simulation results. By studying scaled workloads with centuries-long simulation times, we uncover practical and algorithmic limitations of existing solutions and propose Principal Kernel Analysis: a hierarchical program sampling methodology that concisely represents GPU programs by selecting representative kernel portions using a scalable profiling methodology, tractable clustering algorithm and detection of intra-kernel IPC stability. We validate Principal Kernel Analysis across 147 workloads and three GPU generations using the Accel-Sim simulator, demonstrating a better performance/error tradeoff than prior work and that century-long MLPerf simulations are reduced to hours with an average cycle error of 27% versus silicon.

DOI: 10.1145/3466752.3480100

AccelWattch： A Power Modeling Framework for Modern GPUs

作者: Kandiah, Vijay and Peverelle, Scott and Khairy, Mahmoud and Pan, Junrui and Manjunath, Amogh and Rogers, Timothy G. and Aamodt, Tor M. and Hardavellas, Nikos
关键词: Power Modeling and Simulation, GPGPU/GPU Computing

Abstract

Graphics Processing Units (GPUs) are rapidly dominating the accelerator space, as illustrated by their wide-spread adoption in the data analytics and machine learning markets. At the same time, performance per watt has emerged as a crucial evaluation metric together with peak performance. As such, GPU architects require robust tools that will enable them to model both the performance and the power consumption of modern GPUs. However, while GPU performance modeling has progressed in great strides, power modeling has lagged behind. To mitigate this problem we propose AccelWattch, a configurable GPU power model that resolves two long-standing needs: the lack of a detailed and accurate cycle-level power model for modern GPU architectures, and the inability to capture their constant and static power with existing tools. AccelWattch can be driven by emulation and trace-driven environments, hardware counters, or a mix of the two, models both PTX and SASS ISAs, accounts for power gating and control-flow divergence, and supports DVFS. We integrate AccelWattch with GPGPU-Sim and Accel-Sim to facilitate its widespread use. We validate AccelWattch on a NVIDIA Volta GPU, and show that it achieves strong correlation against hardware power measurements. Finally, we demonstrate that AccelWattch can enable reliable design space exploration: by directly applying AccelWattch tuned for Volta on GPU configurations resembling NVIDIA Pascal and Turing GPUs, we obtain accurate power models for these architectures.

DOI: 10.1145/3466752.3480063

Vortex： Extending the RISC-V ISA for GPGPU and 3D-Graphics

作者: Tine, Blaise and Yalamarthy, Krishna Praveen and Elsabbagh, Fares and Hyesoon, Kim
关键词: reconfigurable computing, memory systems., computer graphics

Abstract

The importance of open-source hardware and software has been increasing. However, despite GPUs being one of the more popular accelerators across various applications, there is very little open-source GPU infrastructure in the public domain. We argue that one of the reasons for the lack of open-source infrastructure for GPUs is rooted in the complexity of their ISA and software stacks. In this work, we first propose an ISA extension to RISC-V that supports GPGPUs and graphics. The main goal of the ISA extension proposal is to minimize the ISA changes so that the corresponding changes to the open-source ecosystem are also minimal, which makes for a sustainable development ecosystem. To demonstrate the feasibility of the minimally extended RISC-V ISA, we implemented the complete software and hardware stacks of Vortex on FPGA. Vortex is a PCIe-based soft GPU that supports OpenCL and OpenGL. Vortex can be used in a variety of applications, including machine learning, graph analytics, and graphics rendering. Vortex can scale up to 32 cores on an Altera Stratix 10 FPGA, delivering a peak performance of 25.6 GFlops at 200 Mhz.

DOI: 10.1145/3466752.3480128

Session details： Session 7A： Microarchitecture I

作者: Trancoso, Pedro
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492793

Enabling Branch-Mispredict Level Parallelism by Selectively Flushing Instructions

作者: Eyerman, Stijn and Heirman, Wim and Van Den Steen, Sam and Hur, Ibrahim
关键词: No keywords

Abstract

Conventionally, branch mispredictions are resolved by flushing wrongly speculated instructions from the reorder buffer and refetching instructions along the correct path. However, a large part of the misspeculated instructions could have reconverged with the correct path and executed correctly. Yet, they are flushed to ensure in-order commit. This inefficiency has been recognized in prior work, which proposes either complex additions to a core to reuse the correctly executed instructions, or less intrusive solutions that only reuse part of the converged instructions. We propose a hardware-software cooperative mechanism to recover correctly executed instructions, avoiding the need to refetch and re-execute them. It combines relatively limited additions to the core architecture with a high reuse of reconverged instructions. Adding the software hints to enable our mechanism is a similar effort as parallelizing an application, which is already necessary to extract high performance from current multicore processors. We evaluate the technique on emerging graph applications and sorting, applications that are known to perform poorly on conventional CPUs, and report an average 29% increase in performance.

DOI: 10.1145/3466752.3480045

PDede： Partitioned, Deduplicated, Delta Branch Target Buffer

作者: Soundararajan, Niranjan K and Braun, Peter and Khan, Tanvir Ahmed and Kasikci, Baris and Litz, Heiner and Subramoney, Sreenivas
关键词: Superscalar cores, Performance, Branch Target Buffer

Abstract

Due to large instruction footprints, contemporary data center applications suffer from frequent frontend stalls. Despite being a significant contributor to these stalls, the Branch Target Buffer (BTB) has received less attention compared to other frontend structures such as the instruction cache. While prior works have looked at enhancing the BTB through more efficient replacement policies and prefetching policies, a thorough analysis into optimizing the BTB’s storage efficiency is missing. In this work, we analyze BTB accesses for a large number (100+) of frontend bound applications to understand their branch target characteristics. This analysis, provides three significant observations about the nature of branch targets: (1) a significant number of branch instructions have the same branch target, (2) a significant number of branch targets share the same page address, and (3) a significant percentage of branch instructions and their targets are located on the same page. Furthermore, we observe that while applications’ address spaces are sparsely populated, they exhibit spatial locality within and across pages. We refer to these multi-page addresses as regions and we show that applications traverse a significantly smaller number of regions than pages. Based on these insights, we propose PDede, an efficient re-design of the BTB micro-architecture that improves storage efficiency by removing redundancy among branches and their targets. PDede introduces three techniques, (a) BTB Partitioning, (b) Branch Target Deduplication, and © Delta Branch Target Encoding to reduce BTB miss induced frontend stalls. We evaluate PDede across 100+ applications, spanning several usage scenarios, and show that it provides an average 14.4% (up to 76%) IPC speedup by reducing BTB misses by 54.7% on average (and up to 99.8%).

DOI: 10.1145/3466752.3480046

Leveraging Targeted Value Prediction to Unlock New Hardware Strength Reduction Potential

作者: Perais, Arthur
关键词: value prediction, speculation, performance, Microarchitecture

Abstract

Value Prediction (VP) is a microarchitectural technique that speculatively breaks data dependencies to increase the available Instruction Level Parallelism (ILP) in general purpose processors. Despite recent proposals, VP remains expensive and has intricate interactions with several stages of the classical superscalar pipeline. In this paper, we revisit and simplify VP by leveraging the irregular distribution of the values produced during the execution of common programs. First, we demonstrate that a reasonable fraction of the performance uplift brought by a full VP infrastructure can be obtained by predicting only a few ”usual suspects” values. Furthermore, we show that doing so allows to greatly simplify VP operation as well as reduce the value predictor footprint. Lastly, we show that these Minimal and Targeted VP infrastructures conceptually enable Speculative Strength Reduction (SpSR), a rename-time optimization whereby instructions can disappear at rename in the presence of specific operand values.

DOI: 10.1145/3466752.3480050

Branch Runahead： An Alternative to Branch Prediction for Impossible to Predict Branches

作者: Pruett, Stephen and Patt, Yale
关键词: Pre-computation, Control Independence, Branch Prediction

Abstract

High performance microprocessors require high levels of instruction supply. Branch prediction has been the most important driver of this for nearly 30 years. Unfortunately, modern predictors are increasingly bottlenecked by hard-to-predict data-dependent branches that fundamentally cannot be predicted via a history based approach. Pre-computation of branch instructions has been suggested as a solution, but such schemes require a careful trade-off between timeliness and complexity. This paper introduces Branch Runahead: a low-cost, hardware-only solution that achieves high accuracy while only performing lightweight pre-computation. The result: a reduction in branch MPKI of 47.5% and an average improvement in IPC of 16.9%.

DOI: 10.1145/3466752.3480053

Twig： Profile-Guided BTB Prefetching for Data Center Applications

作者: Khan, Tanvir Ahmed and Brown, Nathan and Sriraman, Akshitha and Soundararajan, Niranjan K and Kumar, Rakesh and Devietti, Joseph and Subramoney, Sreenivas and Pokam, Gilles A and Litz, Heiner and Kasikci, Baris
关键词: frontend stalls, data center, branch target buffer, Prefetching

Abstract

Modern data center applications have deep software stacks, with instruction footprints that are orders of magnitude larger than typical instruction cache (I-cache) sizes. To efficiently prefetch instructions into the I-cache despite large application footprints, modern server-class processors implement a decoupled frontend with Fetch Directed Instruction Prefetching (FDIP). In this work, we first characterize the limitations of a decoupled frontend processor with FDIP and find that FDIP suffers from significant Branch Target Buffer (BTB) misses. We also find that existing techniques (e.g., stream prefetchers and predecoders) are unable to mitigate these misses, as they rely on an incomplete understanding of a program’s branching behavior. To address the shortcomings of existing BTB prefetching techniques, we propose Twig, a novel profile-guided BTB prefetching mechanism. Twig analyzes a production binary’s execution profile to identify critical BTB misses and inject BTB prefetch instructions into code. Additionally, Twig coalesces multiple non-contiguous BTB prefetches to improve the BTB’s locality. Twig exposes these techniques via new BTB prefetch instructions. Since Twig prefetches BTB entries without modifying the underlying BTB organization, it is easy to adopt in modern processors. We study Twig’s behavior across nine widely-used data center applications, and demonstrate that it achieves an average 20.86% (up to 145%) performance speedup over a baseline 8K-entry BTB, outperforming the state-of-the-art BTB prefetch mechanism by 19.82% (on average).

DOI: 10.1145/3466752.3480124

Session details： Session 7B： Accelerators III

作者: Wills, Lisa Wu
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492794

EdgeBERT： Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

作者: Tambe, Thierry and Hooper, Coleman and Pentecost, Lillian and Jia, Tianyu and Yang, En-Yu and Donato, Marco and Sanh, Victor and Whatmough, Paul and Rush, Alexander M. and Brooks, David and Wei, Gu-Yeon
关键词: software and hardware co-design, natural language processing, latency-aware, embedded non-volatile memories

Abstract

Transformer-based language models such as BERT provide significant accuracy improvement to a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimizations for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7 \texttimes{

DOI: 10.1145/3466752.3480095

HiMA： A Fast and Scalable History-based Memory Access Engine for Differentiable Neural Computer

作者: Tao, Yaoyu and Zhang, Zhengya
关键词: memory-augmented neural networks, memory access engine, differentiable neural computer

Abstract

Memory-augmented neural networks (MANNs) provide better inference performance in many tasks with the help of an external memory. The recently developed differentiable neural computer (DNC) is a MANN that has been shown to outperform in representing complicated data structures and learning long-term dependencies. DNC’s higher performance is derived from new history-based attention mechanisms in addition to the previously used content-based attention mechanisms. History-based mechanisms require a variety of new compute primitives and state memories, which are not supported by existing neural network (NN) or MANN accelerators. We present HiMA, a tiled, history-based memory access engine with distributed memories in tiles. HiMA incorporates a multi-mode network-on-chip (NoC) to reduce the communication latency and improve scalability. An optimal submatrix-wise memory partition strategy is applied to reduce the amount of NoC traffic; and a two-stage usage sort method leverages distributed tiles to improve computation speed. To make HiMA fundamentally scalable, we create a distributed version of DNC called DNC-D to allow almost all memory operations to be applied to local memories with trainable weighted summation to produce the global memory output. Two approximation techniques, usage skimming and softmax approximation, are proposed to further enhance hardware efficiency. HiMA prototypes are created in RTL and synthesized in a 40nm technology. By simulations, HiMA running DNC and DNC-D demonstrates 6.47 \texttimes{

DOI: 10.1145/3466752.3480052

FPRaker： A Processing Element For Accelerating Neural Network Training

作者: Awad, Omar Mohamed and Mahmoud, Mostafa and Edo, Isak and Zadeh, Ali Hadi and Bannon, Ciaran and Jayarajan, Anand and Pekhimenko, Gennady and Moshovos, Andreas
关键词: No keywords

Abstract

We present FPRaker, a processing element for composing training accelerators. FPRaker processes several floating-point multiply-accumulation operations concurrently and accumulates their result into a higher precision accumulator. FPRaker boosts performance and energy efficiency during training by taking advantage of the values that naturally appear during training. It processes the significand of the operands of each multiply-accumulate as a series of signed powers of two. The conversion to this form is done on-the-fly. This exposes ineffectual work that can be skipped: values when encoded have few terms and some of them can be discarded as they would fall outside the range of the accumulator given the limited precision of floating-point. FPRaker also takes advantage of spatial correlation in values across channels and uses delta-encoding off-chip to reduce memory footprint and bandwidth. We demonstrate that FPRaker can be used to compose an accelerator for training and that it can improve performance and energy efficiency compared to using optimized bit-parallel floating-point units under iso-compute area constraints. We also demonstrate that FPRaker delivers additional benefits when training incorporates pruning and quantization. Finally, we show that FPRaker naturally amplifies performance with training methods that use a different precision per layer.

DOI: 10.1145/3466752.3480106

RecPipe： Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance

作者: Gupta, Udit and Hsia, Samuel and Zhang, Jeff and Wilkening, Mark and Pombra, Javin and Lee, Hsien-Hsin Sean and Wei, Gu-Yeon and Wu, Carole-Jean and Brooks, David
关键词: personalized recommendation, hardware accelerator, deep learning, datacenter

Abstract

Deep learning recommendation systems must provide high quality, personalized content under strict tail-latency targets and high system loads. This paper presents RecPipe, a system to jointly optimize recommendation quality and inference performance. Central to RecPipe is decomposing recommendation models into multi-stage pipelines to maintain quality while reducing compute complexity and exposing distinct parallelism opportunities. RecPipe implements an inference scheduler to map multi-stage recommendation engines onto commodity, heterogeneous platforms (e.g., CPUs, GPUs). While the hardware-aware scheduling improves ranking efficiency, the commodity platforms suffer from many limitations requiring specialized hardware. Thus, we design RecPipeAccel (RPAccel), a custom accelerator that jointly optimizes quality, tail-latency, and system throughput. RPAccel is designed specifically to exploit the distinct design space opened via RecPipe. In particular, RPAccel processes queries in sub-batches to pipeline recommendation stages, implements dual static and dynamic embedding caches, a set of top-k filtering units, and a reconfigurable systolic array. Compared to previously proposed specialized recommendation accelerators and at iso-quality, we demonstrate that RPAccel improves latency and throughput by 3 \texttimes{

DOI: 10.1145/3466752.3480127

Shift-BNN： Highly-Efficient Probabilistic Bayesian Neural Network Training via Memory-Friendly Pattern Retrieving

作者: Wan, Qiyu and Xia, Haojun and Zhang, Xingyao and Wang, Lening and Song, Shuaiwen Leon and Fu, Xin
关键词: random number generation, energy efficiency, Bayesian neural networks accelerator

Abstract

Bayesian Neural Networks (BNNs) that possess a property of uncertainty estimation have been increasingly adopted in a wide range of safety-critical AI applications which demand reliable and robust decision making, e.g., self-driving, rescue robots, medical image diagnosis. The training procedure of a probabilistic BNN model involves training an ensemble of sampled DNN models, which induces orders of magnitude larger volume of data movement than training a single DNN model. In this paper, we reveal that the root cause for BNN training inefficiency originates from the massive off-chip data transfer by Gaussian Random Variables (GRVs). To tackle this challenge, we propose a novel design that eliminates all the off-chip data transfer by GRVs through the reversed shifting of Linear Feedback Shift Registers (LFSRs) without incurring any training accuracy loss. To efficiently support our LFSR reversion strategy at the hardware level, we explore the design space of the current DNN accelerators and identify the optimal computation mapping scheme to best accommodate our strategy. By leveraging this finding, we design and prototype the first highly efficient BNN training accelerator, named Shift-BNN, that is low-cost and scalable. Extensive evaluation on five representative BNN models demonstrates that Shift-BNN achieves an average of 4.9 \texttimes{

DOI: 10.1145/3466752.3480120

Session details： Session 8A： Superconducting & Quantum

作者: Shi, Yunong
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492795

Exploiting Different Levels of Parallelism in the Quantum Control Microarchitecture for Superconducting Qubits

作者: Zhang, Mengyu and Xie, Lei and Zhang, Zhenxing and Yu, Qiaonian and Xi, Guanglei and Zhang, Hualiang and Liu, Fuming and Zheng, Yarui and Zheng, Yicong and Zhang, Shengyu
关键词: Quantum Instruction Set Architecture, Quantum Control Microarchitecture, Parallelism, NISQ

Abstract

As current Noisy Intermediate Scale Quantum (NISQ) devices suffer from decoherence errors, any delay in the instruction execution of quantum control microarchitecture can lead to the loss of quantum information and incorrect computation results. Hence, it is crucial for the control microarchitecture to issue quantum operations to the Quantum Processing Unit (QPU) in time. As in classical microarchitecture, parallelism in quantum programs needs to be exploited for speedup. However, three challenges emerge in the quantum scenario: 1) the quantum feedback control can introduce significant pipeline stall latency; 2) timing control is required for all quantum operations; 3) QPU requires a deterministic operation supply to prevent the accumulation of quantum errors. In this paper, we propose a novel control microarchitecture design to exploit Circuit Level Parallelism (CLP) and Quantum Operation Level Parallelism (QOLP). Firstly, we develop a Multiprocessor architecture to exploit CLP, which supports dynamic scheduling of different sub-circuits. This architecture can handle parallel feedback control and minimize the potential overhead that disrupts the timing control. Secondly, we propose a Quantum Superscalar approach that exploits QOLP by efficiently executing massive quantum instructions in parallel. Both methods issue quantum operations to QPU deterministically. In the benchmark test of a Shor syndrome measurement, a six-core implementation of our proposal achieves up to 2.59 \texttimes{

DOI: 10.1145/3466752.3480116

SMART： A Heterogeneous Scratchpad Memory Architecture for Superconductor SFQ-based Systolic CNN Accelerators

作者: Zokaee, Farzaneh and Jiang, Lei
关键词: single-flux-quantum, scratchpad memory, CNN accelerator

Abstract

Ultra-fast & low-power superconductor single-flux-quantum (SFQ)-based CNN systolic accelerators are built to enhance the CNN inference throughput. However, shift-register (SHIFT)-based scratchpad memory (SPM) arrays prevent a SFQ CNN accelerator from exceeding 40% of its peak throughput, due to the lack of random access capability. This paper first documents our study of a variety of cryogenic memory technologies, including Vortex Transition Memory (VTM), Josephson-CMOS SRAM, MRAM, and Superconducting Nanowire Memory, during which we found that none of the aforementioned technologies made a SFQ CNN accelerator achieve high throughput, small area, and low power simultaneously. Second, we present a heterogeneous SPM architecture, SMART, composed of SHIFT arrays and a random access array to improve the inference throughput of a SFQ CNN systolic accelerator. Third, we propose a fast, low-power and dense pipelined random access CMOS-SFQ array by building SFQ passive-transmission-line-based H-Trees that connect CMOS sub-banks. Finally, we create an ILP-based compiler to deploy CNN models on SMART. Experimental results show that, with the same chip area overhead, compared to the latest SHIFT-based SFQ CNN accelerator, SMART improves the inference throughput by 3.9 \texttimes{

DOI: 10.1145/3466752.3480041

AutoBraid： A Framework for Enabling Efficient Surface Code Communication in Quantum Computing

作者: Hua, Fei and Chen, Yanhao and Jin, Yuwei and Zhang, Chi and Hayes, Ari and Zhang, Youtao and Zhang, Eddy Z.
关键词: No keywords

Abstract

Quantum computers can solve problems that are intractable using the most powerful classical computer. However, qubits are fickle and error prone. It is necessary to actively correct errors in the execution of a quantum circuit. Quantum error correction (QEC) codes are developed to enable fault-tolerant quantum computing. With QEC, one logical circuit is converted into an encoded circuit. Most studies on quantum circuit compilation focus on NISQ devices which have 10-100 qubits and are not fault-tolerant. In this paper, we focus on the compilation for fault-tolerant quantum hardware. In particular, we focus on optimizing communication parallelism for the surface code based QEC. The execution of surface code circuits involves non-trivial geometric manipulation of a large lattice of entangled physical qubits. A two-qubit gate in surface code is implemented as a virtual “pipe” in space-time called a braiding path. The braiding paths should be carefully routed to avoid congestion. Communication between qubits is considered the major bottleneck as it involves scheduling and searching for simultaneous paths between qubits. We provide a framework for efficiently scheduling braiding paths. We discover that for quantum programs with a local parallelism pattern, our framework guarantees an optimal solution, while the previous greedy-heuristic-based solution cannot. Moreover, we propose an extension to the local parallelism analysis framework to address the communication bottleneck. Our framework achieves orders of magnitude improvement after addressing the communication bottleneck.

DOI: 10.1145/3466752.3480072

JigSaw： Boosting Fidelity of NISQ Programs via Measurement Subsetting

作者: Das, Poulami and Tannu, Swamit and Qureshi, Moinuddin
关键词: Quantum Computing, NISQ Computing, Error Mitigation

Abstract

Near-term quantum computers contain noisy devices, which makes it difficult to infer the correct answer even if a program is run for thousands of trials. On current machines, qubit measurements tend to be the most error-prone operations (with an average error-rate of 4%) and often limit the size of quantum programs that can be run reliably on these systems. As quantum programs create and manipulate correlated states, all the program qubits are measured in each trial and thus, the severity of measurement errors increases with the program size. The fidelity of quantum programs can be improved by reducing the number of measurement operations. We present JigSaw, a framework that reduces the impact of measurement errors by running a program in two modes. First, running the entire program and measuring all the qubits for half of the trials to produce a global (albeit noisy) histogram. Second, running additional copies of the program and measuring only a subset of qubits in each copy, for the remaining trials, to produce localized (higher fidelity) histograms over the measured qubits. JigSaw then employs a Bayesian post-processing step, whereby the histograms produced by the subset measurements are used to update the global histogram. Our evaluations using three different IBM quantum computers with 27 and 65 qubits show that JigSaw improves the success rate on average by 3.6x and up-to 8.4x. Our analysis shows that the storage and time complexity of JigSaw scales linearly with the number of qubits and trials, making JigSaw applicable to programs with hundreds of qubits.

DOI: 10.1145/3466752.3480044

ADAPT： Mitigating Idling Errors in Qubits via Adaptive Dynamical Decoupling

作者: Das, Poulami and Tannu, Swamit and Dangwal, Siddharth and Qureshi, Moinuddin
关键词: Quantum computing, NISQ, Idling errors, Dynamical decoupling

Abstract

The fidelity of applications on near-term quantum computers is limited by hardware errors. In addition to errors that occur during gate and measurement operations, a qubit is susceptible to idling errors, which occur when the qubit is idle and not actively undergoing any operations. To mitigate idling errors, prior works in the quantum devices community have proposed Dynamical Decoupling (DD), that reduces stray noise on idle qubits by continuously executing a specific sequence of single-qubit operations that effectively behave as an identity gate. Unfortunately, existing DD protocols have been primarily studied for individual qubits and their efficacy at the application-level is not yet fully understood. Our experiments show that naively enabling DD for every idle qubit does not necessarily improve fidelity. While DD reduces the idling error-rates for some qubits, it increases the overall error-rate for others due to the additional operations of the DD protocol. Furthermore, idling errors are program-specific and the set of qubits that benefit from DD changes with each program. To enable robust use of DD, we propose Adaptive Dynamical Decoupling (ADAPT), a software framework that estimates the efficacy of DD for each qubit combination and judiciously applies DD only to the subset of qubits that provide the most benefit. ADAPT employs a Decoy Circuit, which is structurally similar to the original program but with a known solution, to identify the DD sequence that maximizes the fidelity. To avoid the exponential search of all possible DD combinations, ADAPT employs a localized algorithm that has linear complexity in the number of qubits. Our experiments on IBM quantum machines (with 16-27 qubits) show that ADAPT improves the application fidelity by 1.86x on average and up-to 5.73x compared to no DD and by 1.2x compared to DD on all qubits.

DOI: 10.1145/3466752.3480059

Session details： Session 8B： Sparse Processing

作者: Nowatzki, Tony
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492796

Distilling Bit-level Sparsity Parallelism for General Purpose Deep Learning Acceleration

作者: Lu, Hang and Chang, Liang and Li, Chenglong and Zhu, Zixuan and Lu, Shengjian and Liu, Yanhuan and Zhang, Mingzhe
关键词: neural network, bit-level sparsity, accelerator

Abstract

Along with the rapid evolution of deep neural networks, the ever-increasing complexity imposes formidable computation intensity to the hardware accelerator. In this paper, we propose a novel computing philosophy called “bit interleaving” and the associate accelerator design called “Bitlet” to maximally exploit the bit-level sparsity. Apart from existing bit-serial/parallel accelerators, Bitlet leverages the abundant “sparsity parallelism” in the parameters to enforce the inference acceleration. Bitlet is versatile by supporting diverse precisions on a single platform, including floating-point 32 and fixed-point from 1b to 24b. The versatility enables Bitlet feasible for both efficient inference and training. Empirical studies on 12 domain-specific deep learning applications highlight the following results: (1) up to 81 \texttimes{

DOI: 10.1145/3466752.3480123

Sanger： A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture

作者: Lu, Liqiang and Jin, Yicheng and Bi, Hangrui and Luo, Zizhang and Li, Peng and Wang, Tao and Liang, Yun
关键词: systolic array, sparse, reconfigurable architecture, hardware-software co-design, attention, Transformer

Abstract

In recent years, attention-based models have achieved impressive performance in natural language processing and computer vision applications by effectively capturing contextual knowledge from the entire sequence. However, the attention mechanism inherently contains a large number of redundant connections, imposing a heavy computational burden on model deployment. To this end, sparse attention has emerged as an attractive approach to reduce the computation and memory footprint, which involves the sampled dense-dense matrix multiplication (SDDMM) and sparse-dense matrix multiplication (SpMM) at the same time, thus requiring the hardware to eliminate zero-valued operations effectively. Existing techniques based on irregular sparse patterns or regular but coarse-grained patterns lead to low hardware efficiency or less computation saving. This paper proposes Sanger, a framework that harvests sparsity in the attention mechanism through synergistic hardware and software co-design. The software part prunes the attention matrix into a dynamic structured pattern, and the hardware part features a reconfigurable architecture that exploits such patterns. Specifically, we dynamically sparsify vanilla attention based on a quantized prediction of the attention matrix. Then, the sparse mask is re-arranged into structured blocks that are more amenable to hardware implementation. The hardware design of Sanger features a score-stationary dataflow that keeps sparse scores stationary in the PE to avoid decoding overhead. Using this dataflow and a reconfigurable systolic array design, we can unify the computation of SDDMM and SpMM operations. Typically, the PEs can be configured during runtime to support different data access and partial sum accumulation schemes. Experiments on BERT show that Sanger can prune the model to 0.08 - 0.27 sparsity without accuracy loss, achieving 4.64X, 22.7X, 2.39X, and 1.47X speedup compared to V100 GPU, AMD Ryzen Threadripper 3970X CPU, as well as the state-of-the-art attention accelerators A3 and SpAtten.

DOI: 10.1145/3466752.3480125

ESCALATE： Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition

作者: Li, Shiyu and Hanson, Edward and Qian, Xuehai and Li, Hai “Helen” and Chen, Yiran
关键词: Sparse Accelerators, Neural Network Compression, Kernel Decomposition, Convolutional Neural Networks

Abstract

The ever-growing parameter size and computation cost of Convolutional Neural Network (CNN) models hinder their deployment onto resource-constrained platforms. Network pruning techniques are proposed to remove the redundancy in CNN parameters and produce a sparse model. Sparse-aware accelerators are also proposed to reduce the computation cost and memory bandwidth requirements of inference by leveraging the model sparsity. The irregularity of sparse patterns, however, limits the efficiency of those designs. Researchers proposed to address this issue by creating a regular sparsity pattern through hardware-aware pruning algorithms. However, the pruning rate of these solutions is largely limited by the enforced sparsity patterns. This limitation motivates us to explore other compression methods beyond pruning. With two decoupled computation stages, we found that kernel decomposition could potentially take the processing of the sparse pattern off from the critical path of inference and achieve a high compression ratio without enforcing the sparse patterns. To exploit these advantages, we propose ESCALATE, an algorithm-hardware co-design approach based on kernel decomposition. At algorithm level, ESCALATE reorganizes the two computation stages of the decomposed convolution to enable a stream processing of the intermediate feature map. We proposed a hybrid quantization to exploit the different reuse frequency of each part of the decomposed weight. At architecture level, ESCALATE proposes a novel ‘Basis-First’ dataflow and its corresponding microarchitecture design to maximize the benefits brought by the decomposed convolution. We evaluate ESCALATE with four representative CNN models on both CIFAR-10 and ImageNet datasets and compare it against previous sparse accelerators and pruning algorithms. Results show that ESCALATE can achieve up to 325 \texttimes{

DOI: 10.1145/3466752.3480043

SparseAdapt： Runtime Control for Sparse Linear Algebra on a Reconfigurable Accelerator

作者: Pal, Subhankar and Amarnath, Aporva and Feng, Siying and O’Boyle, Michael and Dreslinski, Ronald and Dubach, Christophe
关键词: sparse linear algebra, reconfigurable accelerators, predictive models, machine learning, energy-efficient computing

Abstract

Dynamic adaptation is a post-silicon optimization technique that adapts the hardware to workload phases. However, current adaptive approaches are oblivious to implicit phases that arise from operating on irregular data, such as sparse linear algebra operations. Implicit phases are short-lived and do not exhibit consistent behavior throughout execution. This calls for a high-accuracy, low overhead runtime mechanism for adaptation at a fine granularity. Moreover, adopting such techniques for reconfigurable manycore hardware, such as coarse-grained reconfigurable architectures (CGRAs), adds complexity due to synchronization and resource contention. We propose a lightweight machine learning-based adaptive framework called SparseAdapt. It enables low-overhead control of configuration parameters to tailor the hardware to both implicit (data-driven) and explicit (code-driven) phase changes. SparseAdapt is implemented within the runtime of a recently-proposed CGRA called Transmuter, which has been shown to deliver high performance for irregular sparse operations. SparseAdapt can adapt configuration parameters such as resource sharing, cache capacities, prefetcher aggressiveness, and dynamic voltage-frequency scaling (DVFS). Moreover, it can operate under the constraints of either (i) high energy-efficiency (maximal GFLOPS/W), or (ii) high power-performance (maximal GFLOPS3/W). We evaluate SparseAdapt with sparse matrix-matrix and matrix-vector multiplication (SpMSpM and SpMSpV) routines across a suite of uniform random, power-law and real-world matrices, in addition to end-to-end evaluation on two graph algorithms. SparseAdapt achieves similar performance on SpMSpM as the largest static configuration, with 5.3\texttimes{

DOI: 10.1145/3466752.3480134

Capstan： A Vector RDA for Sparsity

作者: Rucker, Alexander and Vilim, Matthew and Zhao, Tian and Zhang, Yaqi and Prabhakar, Raghu and Olukotun, Kunle
关键词: vectorization, sparsity, sparse iteration, reconfigurable dataflow accelerator, parallel patterns, RDA, CGRA

Abstract

This paper proposes Capstan: a scalable, parallel-patterns-based, reconfigurable dataflow accelerator (RDA) for sparse and dense tensor applications. Instead of designing for one application, we start with common sparse data formats, each of which supports multiple applications. Using a declarative programming model, Capstan supports application-independent sparse iteration and memory primitives that can be mapped to vectorized, high-performance hardware. We optimize random-access sparse memories with configurable out-of-order execution to increase SRAM random-access throughput from 32% to 80%. For a variety of sparse applications, Capstan with DDR4 memory is 18\texttimes{

DOI: 10.1145/3466752.3480047

Session details： Session 9A： Graph Processing

作者: Yingyan and Lin
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492797

Improving Streaming Graph Processing Performance using Input Knowledge

作者: Basak, Abanti and Qu, Zheng and Lin, Jilan and Alameldeen, Alaa R. and Chishti, Zeshan and Ding, Yufei and Xie, Yuan
关键词: Streaming graphs, Graph analytics

Abstract

Streaming graphs are ubiquitous in today’s big data era. Prior work has improved the performance of streaming graph workloads without taking input characteristics into account. In this work, we demonstrate that input knowledge-driven software and hardware co-design is critical to optimize the performance of streaming graph processing. To improve graph update efficiency, we first characterize the performance trade-offs of input-oblivious batch reordering. Guided by our findings, we propose input-aware batch reordering to adaptively reorder input batches based on their degree distributions. To complement adaptive batch reordering, we propose updating graphs dynamically, based on their input characteristics, either in software (via update search coalescing) or in hardware (via acceleration support). To improve graph computation efficiency, we present input-aware work aggregation which adaptively modulates the computation granularity based on inter-batch locality characteristics. Evaluated across 260 workloads, our input-aware techniques provide on average 4.55 \texttimes{

DOI: 10.1145/3466752.3480096

I-GCN： A Graph Convolutional Network Accelerator with Runtime Locality Enhancement through Islandization

作者: Geng, Tong and Wu, Chunshu and Zhang, Yongan and Tan, Cheng and Xie, Chenhao and You, Haoran and Herbordt, Martin and Lin, Yingyan and Li, Ang
关键词: Machine Learning, High-Performance Computing, Hardware Accelerator, Graph Neural Network, Data Locality

Abstract

Graph Convolutional Networks (GCNs) have drawn tremendous attention in the past three years. Compared with other deep learning modalities, high-performance hardware acceleration of GCNs is as critical but even more challenging. The hurdles arise from the poor data locality and redundant computation due to the large size, high sparsity, and irregular non-zero distribution of real-world graphs. In this paper we propose a novel hardware accelerator for GCN inference, called I-GCN, that significantly improves data locality and reduces unnecessary computation. The mechanism is a new online graph restructuring algorithm we refer to as islandization. The proposed algorithm finds clusters of nodes with strong internal but weak external connections. The islandization process yields two major benefits. First, by processing islands rather than individual nodes, there is better on-chip data reuse and fewer off-chip memory accesses. Second, there is less redundant computation as aggregation for common/shared neighbors in an island can be reused. The parallel search, identification, and leverage of graph islands are all handled purely in hardware at runtime working in an incremental pipeline. This is done without any preprocessing of the graph data or adjustment of the GCN model structure. Experimental results show that I-GCN can significantly reduce off-chip accesses and prune 38% of aggregation operations, leading to performance speedups over CPUs, GPUs, the prior art GCN accelerators of 5549 \texttimes{

DOI: 10.1145/3466752.3480113

Fifer： Practical Acceleration of Irregular Applications on Reconfigurable Architectures

作者: Nguyen, Quan M. and Sanchez, Daniel
关键词: reconfigurable architectures, pipeline parallelism, CGRAs

Abstract

Coarse-grain reconfigurable arrays (CGRAs) can achieve much higher performance and efficiency than general-purpose cores, approaching the performance of a specialized design while retaining programmability. Unfortunately, CGRAs have so far only been effective on applications with regular compute patterns. However, many important workloads like graph analytics, sparse linear algebra, and databases, are irregular applications with unpredictable access patterns and control flow. Since CGRAs map computation statically to a spatial fabric of functional units, irregular memory accesses and control flow cause frequent stalls and load imbalance. We present Fifer, an architecture and compilation technique that makes irregular applications efficient on CGRAs. Fifer first decouples irregular applications into a feed-forward network of pipeline stages. Each resulting stage is regular and can efficiently use the CGRA fabric. However, irregularity causes stages to have widely varying loads, resulting in high load imbalance if they execute spatially in a conventional CGRA. Fifer solves this by introducing dynamic temporal pipelining: it time-multiplexes multiple stages onto the same CGRA, and dynamically schedules stages to avoid load imbalance. Fifer makes time-multiplexing fast and cheap to quickly respond to load imbalance while retaining the efficiency and simplicity of a CGRA design. We show that Fifer improves performance by gmean 2.8 \texttimes{

DOI: 10.1145/3466752.3480048

Point-X： A Spatial-Locality-Aware Architecture for Energy-Efficient Graph-Based Point-Cloud Deep Learning

作者: Zhang, Jie-Fang and Zhang, Zhengya
关键词: spatial locality, neural network, graph traversal, graph convolution, edge convolution, Point cloud

Abstract

Deep learning on point clouds has attracted increasing attention in the fields of 3D computer vision and robotics. In particular, graph-based point-cloud deep neural networks (DNNs) have demonstrated promising performance in 3D object classification and scene segmentation tasks. However, the scattered and irregular graph-structured data in a graph-based point-cloud DNN cannot be computed efficiently by existing SIMD architectures and accelerators. We present Point-X, an energy-efficient accelerator architecture that extracts and exploits the spatial locality in point cloud data for efficient processing. Point-X uses a clustering method to extract fine-grained and coarse-grained spatial locality from the input point cloud. The clustering maps the point cloud into distributed compute tiles to maximize intra-tile computational parallelism and minimize inter-tile data movement. Point-X employs a chain network-on-chip (NoC) to further reduce the NoC traffic and achieve up to 3.2 \texttimes{

DOI: 10.1145/3466752.3480081

JetStream： Graph Analytics on Streaming Data with Event-Driven Hardware Accelerator

作者: Rahman, Shafiur and Afarin, Mahbod and Abu-Ghazaleh, Nael and Gupta, Rajiv
关键词: streaming graphs, incremental algorithms, accelerators

Abstract

Graph Processing is at the core of many critical emerging workloads operating on unstructured data, including social network analysis, bioinformatics, and many others. Many applications operate on graphs that are constantly changing, i.e., new nodes and edges are added or removed over time. In this paper, we present JetStream, a hardware accelerator for evaluating queries over streaming graphs and capable of handling additions, deletions, and updates of edges. JetStream extends a recently proposed event-based accelerator for graph workloads to support streaming updates. It handles both accumulative and monotonic graph algorithms via an event-driven computation model that limits accesses to a smaller subset of the graph vertices, efficiently reuses the prior query results to eliminate redundancy, and optimizes the memory access pattern for enhanced memory bandwidth utilization. To the best of our knowledge, JetStream is the first graph accelerator that supports streaming graphs, reducing the computation time by 90% compared with cold-start computation using an existing accelerator. In addition, JetStream achieves about 18 \texttimes{

DOI: 10.1145/3466752.3480126

Session details： Session 9B： Virtual Memory & Prefetching

作者: Jevdjic, Djordje
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492798

Trident： Harnessing Architectural Resources for All Page Sizes in x86 Processors

作者: Ram, Venkat Sri Sai and Panwar, Ashish and Basu, Arkaprava
关键词: page table walks, large pages, Virtual memory, TLB

Abstract

Intel and AMD processors have long supported more than one large page sizes – 1GB and 2MB, to reduce address translation overheads for applications with large memory footprints. However, previous works on large pages have primarily focused on 2MB pages, partly due to a lack of evidence on the usefulness of 1GB pages to real-world applications. Consequently, micro-architectural resources devoted to 1GB pages have gone underutilized for a decade. We quantitatively demonstrate where 1GB pages can be valuable, especially when employed in conjunction with 2MB pages. Unfortunately, the lack of application-transparent dynamic allocation of 1GB pages is to blame for the under-utilization of 1GB pages on today’s systems. Toward this, we design and implement Trident in Linux to fully harness micro-architectural resources devoted for all page sizes in the current x86 hardware by transparently allocating 1GB, 2MB, and 4KB pages as suitable at runtime. Trident speeds up eight memory-intensive applications by 18%, on average, over Linux’s use of 2MB pages. We then propose Tridentpv, an extension to Trident that virtualizes 1GB pages via copy-less promotion and compaction in the guest OS. Overall, this paper shows that adequate software enablement brings practical relevance to even GB-sized pages, and motivates micro-architects to continue enhancing hardware support for all large page sizes.

DOI: 10.1145/3466752.3480062

Pythia： A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

作者: Bera, Rahul and Kanellopoulos, Konstantinos and Nori, Anant and Shahroodi, Taha and Subramoney, Sreenivas and Mutlu, Onur
关键词: No keywords

Abstract

Past research has proposed numerous hardware prefetching techniques, most of which rely on exploiting one specific type of program context information (e.g., program counter, cacheline address, or delta between cacheline addresses) to predict future memory accesses. These techniques either completely neglect a prefetcher’s undesirable effects (e.g., memory bandwidth usage) on the overall system, or incorporate system-level feedback as an afterthought to a system-unaware prefetch algorithm. We show that prior prefetchers often lose their performance benefit over a wide range of workloads and system configurations due to their inherent inability to take multiple different types of program context and system-level feedback information into account while prefetching. In this paper, we make a case for designing a holistic prefetch algorithm that learns to prefetch using multiple different types of program context and system-level feedback information inherent to its design. To this end, we propose Pythia, which formulates the prefetcher as a reinforcement learning agent. For every demand request, Pythia observes multiple different types of program context information to make a prefetch decision. For every prefetch decision, Pythia receives a numerical reward that evaluates prefetch quality under the current memory bandwidth usage. Pythia uses this reward to reinforce the correlation between program context information and prefetch decision to generate highly accurate, timely, and system-aware prefetch requests in the future. Our extensive evaluations using simulation and hardware synthesis show that Pythia outperforms two state-of-the-art prefetchers (MLOP and Bingo) by 3.4% and 3.8% in single-core, 7.7% and 9.6% in twelve-core, and 16.9% and 20.2% in bandwidth-constrained core configurations, while incurring only 1.03% area overhead over a desktop-class processor and no software changes in workloads. The source code of Pythia can be freely downloaded from https://github.com/CMU-SAFARI/Pythia.

DOI: 10.1145/3466752.3480114

Morrigan： A Composite Instruction TLB Prefetcher

作者: Vavouliotis, Georgios and Alvarez, Lluc and Grot, Boris and Jim'{e
关键词: virtual memory, translation lookaside buffer, markov prefetching, address translation, TLB prefetching, TLB management

Abstract

The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of the second-level TLB (STLB) misses in desktop and HPC applications. The address translation cost of instruction accesses has been relatively neglected due to historically small instruction footprints. However, state-of-the-art datacenter and server applications feature massive instruction footprints owing to deep software stacks, resulting in high STLB miss rates for instruction accesses. This paper demonstrates that instruction address translation is a performance bottleneck in server workloads. In response, we propose Morrigan, a microarchitectural instruction STLB prefetcher whose design is based on new insights regarding instruction STLB misses. At the core of Morrigan there is an ensemble of table-based Markov prefetchers that build and store variable length Markov chains out of the instruction STLB miss stream. Morrigan further employs a sequential prefetcher and a scheme that exploits page table locality to maximize miss coverage. An important contribution of the work is showing that access frequency is more important than access recency when choosing replacement candidates. Based on this insight, Morrigan introduces a new replacement policy that identifies victims in the Markov prefetchers using a frequency stack while adapting to phase-change behavior. On a set of 45 industrial server workloads, Morrigan eliminates 69% of the memory references in demand page walks triggered by instruction STLB misses and improves geometric mean performance by 7.6%.

DOI: 10.1145/3466752.3480049

作者: Li, Bingyao and Yin, Jieming and Zhang, Youtao and Tang, Xulong
关键词: multi-application, multi-GPU, TLB

Abstract

In recent years, the ever-growing application complexity and input dataset sizes have driven the popularity of multi-GPU systems as a desirable computing platform for many application domains. While employing multiple GPUs intuitively exposes substantial parallelism for the application acceleration, the delivered performance rarely scales with the number of GPUs. One of the major challenges behind is the address translation efficiency. Many prior works focus on CPUs or single GPU execution scenarios while the address translation in multi-GPU systems receives little attention. In this paper, we conduct a comprehensive investigation of the address translation efficiency in both “single-application-multi-GPU” and “multi-application-multi-GPU” execution paradigms. Based on our observations, we propose a new TLB hierarchy design, called least-TLB, tailored for multi-GPU systems and effectively improves the TLB performance with minimal hardware overheads. Experimental results on 9 single-application workloads and 10 multi-application workloads indicate the proposed least-TLB improves the performances, on average, by 23.5% and 16.3%, respectively.

DOI: 10.1145/3466752.3480083

Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip Resources

作者: Kotra, Jagadish B. and LeBeane, Michael and Kandemir, Mahmut T. and Loh, Gabriel H.
关键词: Virtual Memory, Reconfigurable Systems, Irregular Applications, CPU+GPU Systems

Abstract

Many GPU applications issue irregular memory accesses to a very large memory footprint. We confirm observations from prior work that these irregular access patterns are severely bottlenecked by insufficient Translation Lookaside Buffer (TLB) reach, resulting in expensive page table walks. In this work, we investigate mechanisms to improve TLB reach without increasing the page size or the size of the TLB itself. Our work is based around the observation that a GPU’s instruction cache (I-cache) and Local Data Share (LDS) scratchpad memory are under-utilized in many applications, including those that suffer from poor TLB reach. We leverage this to opportunistically utilize idle capacity and port bandwidth from the GPU’s I-cache and LDS structures for address translations. We explore various potential architectural designs for each structure to optimize performance and minimize complexity. Both structures are organized as a victim cache between the L1 and L2 TLBs to boost translation reach. We find that our designs can increase performance on average by 30.1% without impacting the performance of applications that do not require additional reach.

DOI: 10.1145/3466752.3480105

Session details： Session 10A： Security & Privacy III

作者: Naghibijouybari, Hoda
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492799

A Deeper Look into RowHammer’s Sensitivities： Experimental Analysis of Real DRAM Chips and Implications on Future Attacks and Defenses

作者: Orosa, Lois and Yaglikci, Abdullah Giray and Luo, Haocong and Olgun, Ataberk and Park, Jisung and Hassan, Hasan and Patel, Minesh and Kim, Jeremie S. and Mutlu, Onur
关键词: Testing, Temperature, Security, Safety, RowHammer, Reliability, Memory, DRAM, Characterization

Abstract

RowHammer is a circuit-level DRAM vulnerability where repeatedly accessing (i.e., hammering) a DRAM row can cause bit flips in physically nearby rows. The RowHammer vulnerability worsens as DRAM cell size and cell-to-cell spacing shrink. Recent studies demonstrate that modern DRAM chips, including chips previously marketed as RowHammer-safe, are even more vulnerable to RowHammer than older chips such that the required hammer count to cause a bit flip has reduced by more than 10X in the last decade. Therefore, it is essential to develop a better understanding and in-depth insights into the RowHammer vulnerability of modern DRAM chips to more effectively secure current and future systems. Our goal in this paper is to provide insights into fundamental properties of the RowHammer vulnerability that are not yet rigorously studied by prior works, but can potentially be i) exploited to develop more effective RowHammer attacks or ii) leveraged to design more effective and efficient defense mechanisms. To this end, we present an experimental characterization using 248 DDR4 and 24 DDR3 modern DRAM chips from four major DRAM manufacturers demonstrating how the RowHammer effects vary with three fundamental properties: 1) DRAM chip temperature, 2) aggressor row active time, and 3) victim DRAM cell’s physical location. Among our 16 new observations, we highlight that a RowHammer bit flip 1) is very likely to occur in a bounded range, specific to each DRAM cell (e.g., 5.4% of the vulnerable DRAM cells exhibit errors in the range to ), 2) is more likely to occur if the aggressor row is active for longer time (e.g., RowHammer vulnerability increases by 36% if we keep a DRAM row active for 15 column accesses), and 3) is more likely to occur in certain physical regions of the DRAM module under attack (e.g., 5% of the rows are 2x more vulnerable than the remaining 95% of the rows). Our study has important practical implications on future RowHammer attacks and defenses. We describe and analyze the implications of our new findings by proposing three future RowHammer attack and five future RowHammer defense improvements.

DOI: 10.1145/3466752.3480069

Uncovering In-DRAM RowHammer Protection Mechanisms：A New Methodology, Custom RowHammer Patterns, and Implications

作者: Hassan, Hasan and Tugrul, Yahya Can and Kim, Jeremie S. and van der Veen, Victor and Razavi, Kaveh and Mutlu, Onur
关键词: Testing, Security, RowHammer, Reliability, DRAM

Abstract

The RowHammer vulnerability in DRAM is a critical threat to system security. To protect against RowHammer, vendors commit to security-through-obscurity: modern DRAM chips rely on undocumented, proprietary, on-die mitigations, commonly known as Target Row Refresh (TRR). At a high level, TRR detects and refreshes potential RowHammer-victim rows, but its exact implementations are not openly disclosed. Security guarantees of TRR mechanisms cannot be easily studied due to their proprietary nature. To assess the security guarantees of recent DRAM chips, we present Uncovering TRR (U-TRR), an experimental methodology to analyze in-DRAM TRR implementations. U-TRR is based on the new observation that data retention failures in DRAM enable a side channel that leaks information on how TRR refreshes potential victim rows. U-TRR allows us to (i) understand how logical DRAM rows are laid out physically in silicon; (ii) study undocumented on-die TRR mechanisms; and (iii) combine (i) and (ii) to evaluate the RowHammer security guarantees of modern DRAM chips. We show how U-TRR allows us to craft RowHammer access patterns that successfully circumvent the TRR mechanisms employed in 45 DRAM modules of the three major DRAM vendors. We find that the DRAM modules we analyze are vulnerable to RowHammer, having bit flips in up to 99.9% of all DRAM rows.

DOI: 10.1145/3466752.3480110

Soteria： Towards Resilient Integrity-Protected and Encrypted Non-Volatile Memories

作者: Zubair, Kazi Abu and Gurumurthi, Sudhanva and Sridharan, Vilas and Awad, Amro
关键词: Non-Volatile Memory, Memory Security, Memory Reliability

Abstract

Although emerging Non-Volatile Memories (NVMs) are expected to be adopted in future memory and storage systems, their non-volatility brings complications in designing processors wherein security is an essential requirement. One of these complications is maintaining the correctness of the security metadata for encryption and integrity verification. Due to the accommodation of security metadata in the NVMs, they are susceptible to reliability threats posed by the underlying memory technology. This is undesirable because the secure operation of the system highly depends on the correctness of the security metadata stored in the memory. We observe that the error sensitivity of security metadata is higher than general data and requires special attention. A single uncorrectable error in a top Merkle tree node can leave a large portion of memory data unverifiable. To solve this, we propose Soteria, a scheme that provides higher error-tolerance of security metadata by lazily duplicating them. Soteria decouples existing memory reliability from the security metadata reliability and achieves security, performance, and high reliability within the same system with only minor memory controller changes. Our proposed scheme improves the reliability of the improved security NVM system significantly while causing only about 1% system slowdown on average.

DOI: 10.1145/3466752.3480066

Bonsai Merkle Forests： Efficiently Achieving Crash Consistency in Secure Persistent Memory

作者: Freij, Alexander and Zhou, Huiyang and Solihin, Yan
关键词: security, persistency, non-volatile cache, integrity tree update, integrity forest

Abstract

Due to its durability, the security of persistent memory (PM) needs to be ensured. Recent works have identified the requirements for correctly architecting secure PM to achieve crash recoverability. A key performance bottleneck, however, lies in the integrity tree update, which needs to be consistent with the memory persistency model and incurs a very high performance overhead. In this paper, we aim to drastically reduce this performance overhead. First, we propose to leverage a small on-chip non-volatile metadata cache (nvMC) for keeping a small portion of the integrity tree. We show that nvMC cannot be managed like a regular cache due to violating crash recoverability, and hence derive a set of invariants to be satisfied for the nvMC to work properly. Then, we propose the idea of Bonsai Merkle Forests (BMF), which splits an integrity tree into multiple trees, leading to a forest, with the tree roots maintained in the nvMC. We propose and analyze different ways of BMF management. Our experimental results show that our proposed BMF schemes drastically reduce the performance overhead of BMT root updates, from 426% to just 3.5%.

DOI: 10.1145/3466752.3480067

Dolos： Improving the Performance of Persistent Applications in ADR-Supported Secure Memory

作者: Han, Xijing and Tuck, James and Awad, Amro
关键词: Persistent memory, Merkle Tree, Memory Security, MAC, Encryption

Abstract

The performance of persistent applications is severely hurt by current secure processor architectures. Persistent applications use long-latency flush instructions and memory fences to make sure that writes to persistent data reach the persistency domain in a way that is crash consistent. Recently introduced features like Intel’s Asynchronous DRAM Refresh (ADR) make the on-chip Write Pending Queue (WPQ) part of the persistency domain and help reduce the penalty of persisting data since data only needs to reach the on-chip WPQ to be considered persistent. However, when persistent applications run on secure processors, for the sake of securing memory many cycles are added to the critical path of their write operations before they ever reach the persistent WPQ, preventing them from fully exploiting the performance advantages of the persistent WPQ. Our goal in this work is to make it feasible for secure persistent applications to benefit more from the on-chip persistency domain. We propose Dolos, an architecture that prioritizes persisting data without sacrificing security in order to gain a significant performance boost for persistent applications. Dolos achieves this goal by an additional minor security unit, Mi-SU, that utilizes a much faster secure process that protects only the WPQ. Thus, the secure operation latency in the critical path of persist operations is reduced and hence persistent transactions can complete earlier. Dolos retains a conventional major security unit for protecting memory that occurs off the critical path after inserting secured data into the WPQ. To evaluate our design, we implemented our architecture in the GEM5 simulator, and analyzed the performance of 6 benchmarks from the WHISPER suite. Dolos improves their performance by 1.66x on average.

DOI: 10.1145/3466752.3480118

Session details： Session 10B： Microarchitecture II

作者: Peng, Lu
关键词: No keywords

Abstract

No abstract available.

DOI: 10.1145/3492800

The Laplace Microarchitecture for Tracking Data Uncertainty and Its Implementation in a RISC-V Processor

作者: Tsoutsouras, Vasileios and Kaparounakis, Orestis and Bilgin, Bilgesu and Samarakoon, Chatura and Meech, James and Heck, Jan and Stanley-Marbell, Phillip
关键词: uncertainty tracking, distributional representations, arithmetic on distributions, RISC-V

Abstract

We present Laplace, a microarchitecture for tracking machine representations of probability distributions paired with architectural state. We present two new methods for in-processor distribution representations which are approximations of probability distributions just as floating-point number representations are approximations of real-valued numbers. Laplace executes unmodified RISC-V binaries and can track uncertainty through them. We present two sets of ISA extensions to provide a mechanism to initialize distributional information in the microarchitecture and to allow applications to query statistics of the distributional information without exposing the uncertainty representations above the ISA. We evaluate the accuracy and performance of Laplace using a suite of 21 benchmarks spanning domains ranging from variational quantum algorithms and sensor data processing to materials properties modeling. Monte Carlo simulation on the benchmarks requires 2 076 \texttimes{

DOI: 10.1145/3466752.3480131

Post-Fabrication Microarchitecture

作者: Kumar, Chanchal and Seshadri, Anirudh and Chaudhary, Aayush and Bhawalkar, Shubham and Singh, Rohit and Rotenberg, Eric
关键词: superscalar processor, reconfigurable logic, prefetching, pre-execution, instruction-level parallelism (ILP), field-programmable gate array (FPGA), branch prediction

Abstract

Microarchitectural enhancements that improve performance generally, across many workloads, are favored in superscalar processor design. Targeting general performance is necessary but it also constrains some microarchitecture innovation. We explore relieving this constraint, via a new paradigm called Post-Fabrication Microarchitecture (PFM). A high-performance superscalar core is coupled with a reconfigurable logic fabric, RF. A programmable interface, or Agent, allows for RF to observe and microarchitecturally intervene at key pipeline stages of the superscalar core. New microarchitectural components, specific to applications, are synthesized on-demand to RF. All instructions still flow through the superscalar pipeline, as usual, but their execution is streamlined (better instructions per cycle (IPC)) through microarchitectural intervention by RF. Our research shows that one can achieve large speedups of individual applications, by analyzing their bottlenecks and providing customized microarchitectural solutions to target these bottlenecks. Examples of PFM use-cases explored in this paper include custom branch predictors and data prefetchers.

DOI: 10.1145/3466752.3480119

PCCS： Processor-Centric Contention-aware Slowdown Model for Heterogeneous System-on-Chips

作者: Xu, Yuanchao and Belviranli, Mehmet Esat and Shen, Xipeng and Vetter, Jeffrey
关键词: System-on-Chips, Performance Models, Accelerator Architectures

Abstract

Many slowdown models have been proposed to characterize memory interference of workloads co-running on heterogeneous System-on-Chips (SoCs). But they are mostly for post-silicon usage. How to effectively consider memory interference in the SoC design stage remains an open problem. This paper presents a new approach to this problem, consisting of a novel processor-centric slowdown modeling methodology and a new three-region interference-conscious slowdown model. The modeling process needs no measurement of co-running of various combinations of applications, but the produced slowdown models can be used to estimate the co-run slowdowns of arbitrary workloads on various SoC designs that embed a newer generation of accelerators, such as deep learning accelerators (DLA), in addition to CPUs and GPUs. The new method reduces average prediction errors of the state-of-art model from 30.3% to 8.7% on GPU, from 13.4% to 3.7% on CPU, from 20.6% to 5.6% on DLA and demonstrates much improved efficacy in guiding SoC designs.

DOI: 10.1145/3466752.3480101

ITSLF： Inter-Thread Store-to-Load Forwardingin Simultaneous Multithreading

作者: Feliu, Josu'{e
关键词: store-to-load forwarding, multiple-copy atomicity, memory consistency, Simultaneous multithreading

Abstract

In this paper, we argue that, for a class of fine-grain, synchronization-intensive, parallel workloads, it is advantageous to consolidate synchronization and communication as much as possible among the threads of simultaneous multithreading (SMT) cores. While, today, the shared L1 is the closest coherent level where synchronization and communication between SMT threads can take place, we observe that there is an even closer shared level, entirely inside a single core. This level comprises the load queues (LQ) and store queues (SQ) / store buffers (SB) of the SMT threads and to the best of our knowledge it has never been used as such. The reason is that if we allow communication of different SMT threads via their LQs and SQs/SBs, i.e., inter-thread store-to-load forwarding (ITSLF), we violate write atomicity with respect to the outside world, beyond the acceptable model of read-own-write-early multiple-copy atomicity (rMCA). The key insight of our work is that we can accelerate synchronization and communication among SMT threads with inter-thread store-to-load forwarding, without affecting the memory model—in particular without violating rMCA. We demonstrate how we can achieve this entirely through speculative interactions between LQs and SQs/SBs of different threads, while ensuring deadlock-free execution. Without changing the architectural model, the ISA, or the software, and without adding extra hardware in the form of a specialized accelerator, our insight enables a new design point for a standard architecture. We demonstrate that with ITSLF, workloads scale better on a single 8-way SMT core (with the resources of a single-threaded core) than on a baseline SMT (with or without optimizations), or on 8 single-threaded cores.

DOI: 10.1145/3466752.3480086

ENMC： Extreme Near-Memory Classification via Approximate Screening

作者: Liu, Liu and Lin, Jilan and Qu, Zheng and Ding, Yufei and Xie, Yuan
关键词: Near-memory processing, Extreme classification

Abstract

Extreme classification (XC) is the essential component of large-scale Deep Learning Systems for a wide range of application domains, including image recognition, language modeling, and recommendation. As classification categories keep scaling in real-world applications, the classifier’s parameters could reach several thousands of Gigabytes, way exceed the on-chip memory capacity. With the advent of near-memory processing (NMP) architectures, offloading the XC component onto NMP units could alleviate the memory-intensive problem. However, naive NMP design with limited area and power budget cannot afford the computational complexity of full classification. To tackle the problem, we first propose a novel screening method to reduce the computation and memory consumption by efficiently approximating the classification output and identifying a small portion of key candidates that require accurate results. Then, we design a new extreme-classification-tailored NMP architecture, namely ENMC, to support both screening and candidates-only classification. Overall, our approximate screening method achieves 7.3 \texttimes{

DOI: 10.1145/3466752.3480090

Session details： Session 1： Best Paper Session

Abstract

APOLLO： An Automated Power Modeling Framework for Runtime Power Introspection in High-Volume Commercial Microprocessors

Abstract

TIP： Time-Proportional Instruction Profiling

Abstract

NDS： N-Dimensional Storage

Abstract

GPS： A Global Publish-Subscribe Model for Multi-GPU Memory Management

Abstract

Session details： Session 2A： Non-Volatile Memory

Abstract

ParaBit： Processing Parallel Bitwise Operations in NAND Flash Memory based SSDs

Abstract

Distributed Data Persistency

Abstract

COSPlay： Leveraging Task-Level Parallelism for High-Throughput Synchronous Persistence

Abstract

RACER： Bit-Pipelined Processing Using Resistive Memory

Abstract

LADDER： Architecting Content and Location-aware Writes for Crossbar Resistive Memories

Abstract

Session details： Session 2B： Energy Efficiency &amp; Low Power

Abstract

GreenDIMM： OS-assisted DRAM Power Management for DRAM with a Sub-array Granularity Power-Down State

Abstract

NMAP： Power Management Based on Network Packet Processing Mode Transition for Latency-Critical Workloads

Abstract

BurstLink： Techniques for Energy-Efficient Video Display for Conventional and Virtual Reality Systems

Abstract

ReplayCache： Enabling Volatile Cachesfor Energy Harvesting Systems

Abstract

AutoFL： Enabling Heterogeneity-Aware Energy Efficient Federated Learning

Abstract

Session details： Session 3A： Security &amp; Privacy I

Abstract

IceClave： A Trusted Execution Environment for In-Storage Computing

Abstract

DarKnight： An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware

Abstract

2-in-1 Accelerator： Enabling Random Precision Switch for Winning Both Adversarial Robustness and Efficiency

Abstract

F1： A Fast and Programmable Accelerator for Fully Homomorphic Encryption

Abstract

Cryptographic Capability Computing

Abstract

Session details： Session 3B： Processing In/Near Memory

Abstract

TRiM： Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory

Abstract

SISA： Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems

Abstract

OrderLight： Lightweight Memory-Ordering Primitive for Efficient Fine-Grained PIM Computations

Abstract

Sunder： Enabling Low-Overhead and Scalable Near-Data Pattern Matching Acceleration

Abstract

SAM： Accelerating Strided Memory Accesses

Abstract

Session details： Session 4A： Parallelism

Abstract

Efficient, Distributed, and Non-Speculative Multi-Address Atomic Operations

Abstract

Cohmeleon： Learning-Based Orchestration of Accelerator Coherence in Heterogeneous SoCs

Abstract

Fat Loads： Exploiting Locality Amongst Contemporaneous Load Operations to Optimize Cache Accesses

Abstract

Criticality Driven Fetch

Abstract

Software-Defined Vector Processing on Manycore Fabrics

Abstract

Session details： Session 4B： Accelerators I

Abstract

Cerebros： Evading the RPC Tax in Datacenters

Abstract

Equinox： Training (for Free) on a Custom Inference Accelerator

Abstract

MithriLog： Near-Storage Accelerator for High-Performance Log Analytics

Abstract

PointAcc： Efficient Point Cloud Accelerator

Abstract

Session details： Session 2B： Energy Efficiency & Low Power

Session details： Session 3A： Security & Privacy I

Session details： Session 5B： Security & Privacy II

Session details： Session 6A： Reliabiity & Verification