ISCA 2020

Data compression accelerator on IBM POWER9 and z15 processors

作者: Abali, Bulent and Blaner, Bart and Reilly, John and Klein, Matthias and Mishra, Ashutosh and Agricola, Craig B. and Sendir, Bedri and Buyuktosunoglu, Alper and Jacobi, Christian and Starke, William J. and Myneni, Haren and Wang, Charlie
关键词: No keywords

Abstract

Lossless data compression is highly desirable in enterprise and cloud environments for storage and memory cost savings and improved utilization I/O and network. While the value provided by compression is recognized, its application in practice is often limited because it’s a processor intensive operation resulting low throughput and high elapsed time for compression intense workloads.The IBM POWER9 and IBM z15 systems overcome the shortcomings of existing approaches by including a novel on-chip integrated data compression accelerator. The accelerator reduces processor cycles, I/O traffic, memory and storage footprint of many applications practically with zero hardware cost. The accelerator also eliminates the cost and I/O slots that would have been necessary with FPGA/ASIC based compression adapters. On the POWER9 chip, a single accelerator uses less than 0.5% of the processor chip area, but provides a 388x speedup factor over the zlib compression software running on a general-purpose core and provides a 13x speedup factor over the entire chip of cores. On a POWER9 system, the accelerators provide an end-to-end 23% speedup to Apache Spark TPC-DS workload compared to the software baseline. The z15 chip doubles the compression rate of POWER9 resulting in even much higher speedup factors over the compression software running on general-purpose cores. On a maximally configured z15 system topology, on-chip compression accelerators provide up to 280 GB/s data compression rate, the highest in the industry. Overall, the on-chip accelerators significantly advance the state of the art in terms of area, throughput, latency, compression ratio, reduced processor utilization, power/energy efficiency, and integration into the system stack.This paper describes the architecture, and novel elements of the POWER9 and z15 compression/decompression accelerators with emphasis on trade-offs that made the on-chip implementation possible.

DOI: 10.1109/ISCA45697.2020.00012

High-performance deep-learning coprocessor integrated into x86 SoC with server-class CPUs

作者: Henry, Glenn and Palangpour, Parviz and Thomson, Michael and Gardner, J Scott and Arden, Bryce and Donahue, Jim and Houck, Kimble and Johnson, Jonathan and O’Brien, Kyle and Petersen, Scott and Seroussi, Benjamin and Walker, Tyler
关键词: No keywords

Abstract

Demand for high performance deep learning (DL) inference in software applications is growing rapidly. DL workloads run on myriad platforms, including general purpose processors (CPU), system-on-chip (SoC) with accelerators, graphics processing units (GPU), and neural processing unit (NPU) add-in cards. DL software engineers typically must choose between relatively slow general hardware (e.g., CPUs, SoCs) or relatively expensive, large, power-hungry hardware (e.g., GPUs, NPUs).This paper describes Centaur Technology’s Ncore, the industry’s first high-performance DL coprocessor technology integrated into an x86 SoC with server-class CPUs. Ncore’s 4096 byte-wide SIMD architecture supports INT8, UINT8, INT16, and BF16 datatypes, with 20 tera-operations-per-second compute capability. Ncore shares the SoC ring bus for low-latency communication and work sharing with eight 64-bit x86 cores, offering flexible support for new and evolving models. The x86 SoC platform can further scale out performance via multiple sockets, systems, or third-party PCIe accelerators. Ncore’s software stack automatically converts quantized models for Ncore consumption and leverages existing DL frameworks.In MLPerf’s Inference v0.5 closed division benchmarks, Ncore achieves 1218 IPS throughput and 1.05ms latency on ResNet-50-v1.5 and achieves lowest latency of all Mobilenet-V1 submissions (329μs). Ncore yields 23x speedup over other x86 vendor per-core throughput, while freeing its own x86 cores for other work. Ncore is the only integrated solution among the memory intensive neural machine translation (NMT) submissions.

DOI: 10.1109/ISCA45697.2020.00013

The IBM zl5 high frequency mainframe branch predictor

作者: Adiga, Narasimha and Bonanno, James and Collura, Adam and Heizmann, Matthias and Prasky, Brian R. and Saporito, Anthony
关键词: No keywords

Abstract

The design of the modern, enterprise-class IBM z15 branch predictor is described. Implemented as a multi-level look-ahead structure, the branch predictor is capable of predicting branch direction and target addresses, augmented with multiple auxiliary direction, target, and power predictors. Predictions are made asynchronously, and later integrated into the processor pipeline. The design is optimized for the unique workloads executed on these enterprise-class systems, including compute intensive and both large instruction and data footprint workloads. This paper highlights the major operations and functions of the IBM z15 branch predictor, including its pipeline, prediction structures and verification methodology. Explanations as to how the design matured to its current state are also provided.

DOI: 10.1109/ISCA45697.2020.00014

Evolution of the samsung exynos CPU microarchitecture

作者: Grayson, Brian and Rupley, Jeff and Zuraski, Gerald and Quinnell, Eric and Jim'{e
关键词: superscalar, prefetching, microprocessor, branch prediction

Abstract

The Samsung Exynos family of cores are high-performance “big” processors developed at the Samsung Austin Research & Design Center (SARC) starting in late 2011. This paper discusses selected aspects of the microarchitecture of these cores - specifically perceptron-based branch prediction, Spectre v2 security enhancements, micro-operation cache algorithms, prefetcher advancements, and memory latency optimizations. Each micro-architecture item evolved over time, both as part of continuous yearly improvement, and in reaction to changing mobile workloads.

DOI: 10.1109/ISCA45697.2020.00015

Xuantie-910： a commercial multi-core 12-stage pipeline out-of-order 64-bit high performance RISC-V processor with vector extension

作者: Chen, Chen and Xiang, Xiaoyan and Liu, Chang and Shang, Yunhai and Guo, Ren and Liu, Dongqi and Lu, Yimin and Hao, Ziyi and Luo, Jiahui and Chen, Zhijian and Li, Chunqiang and Pu, Yu and Meng, Jianyi and Yan, Xiaolang and Xie, Yuan and Qi, Xiaoning
关键词: vector, out of order, multi-core, memory architectures, extension, cache, RISC-V

Abstract

The open source RISC-V ISA has been quickly gaining momentum. This paper presents Xuantie-910, an industry leading 64-bit high performance embedded RISC-V processor from Alibaba T-Head division. It is fully based on the RV64GCV instruction set and it features custom extensions to arithmetic operation, bit manipulation, load and store, TLB and cache operations. It also implements the 0.7.1 stable release of RISC-V vector extension specification for high efficiency vector processing. Xuantie-910 supports multi-core multi-cluster SMP with cache coherence. Each cluster contains 1 to 4 core(s) capable of booting the Linux operating system. Each single core utilizes the state-of-the-art 12-stage deep pipeline, out-of-order, multi-issue superscalar architecture, achieving a maximum clock frequency of 2.5 GHz in the typical process, voltage and temperature condition in a TSMC 12nm FinFET process technology. Each single core with the vector execution unit costs an area of 0.8 mm2 (excluding the L2 cache). The toolchain is enhanced significantly to support the vector extension and custom extensions. Through hardware and toolchain co-optimization, to date Xuantie-910 delivers the highest performance (in terms of IPC, speed, and power efficiency) for a number of industrial control flow and data computing benchmarks, when compared with its predecessors in the RISC-V family. Xuantie-910 FPGA implementation has been deployed in the data centers of Alibaba Cloud, for application-specific acceleration (e.g., blockchain transaction). The ASIC deployment at low-cost SoC applications, such as IoT endpoints and edge computing, is planned to facilitate Alibaba’s end-to-end and cloud-to-edge computing infrastructure.

DOI: 10.1109/ISCA45697.2020.00016

Divide and conquer frontend bottleneck

作者: Ansari, Ali and Lotfi-Kamran, Pejman and Sarbazi-Azad, Hamid
关键词: instruction and BTB prefetching, frontend bottleneck, divide and conquer

Abstract

The frontend stalls caused by instruction and BTB misses are a significant source of performance degradation in server processors. Prefetchers are commonly employed to mitigate frontend bottleneck. However, next-line prefetchers, which are available in server processors, are incapable of eliminating a considerable number of L1 instruction misses. Temporal instruction prefetchers, on the other hand, effectively remove most of the instruction and BTB misses but impose significant area overhead.Recently, an old idea of using BTB-directed instruction prefetching is revived to address the limitations of temporal instruction prefetchers. While this approach leads to prefetchers with low area overhead, it requires significant changes to the frontend of a processor. Moreover, as this approach relies on the BTB content for prefetching, BTB misses stall the prefetcher, and likely lead to costly instruction misses. Especially as instruction misses are usually more expensive than BTB misses, the dependence of instruction prefetching to the BTB content is harmful to workloads with very large instruction footprints. Moreover, BTB-directed instruction prefetchers, as proposed in prior work, cannot be applied to variable-length ISAs.In this work, we showcase the harmful effects of making instruction prefetchers depend on the BTB content. Moreover, we divide the frontend bottleneck into three categories and use a divide-and-conquer approach to propose simple and effective solutions for each one. Sequential misses can be covered by an accurate and timely sequential prefetcher named SN4L, a lightweight discontinuity prefetcher named Dis eliminates discontinuity misses, and the BTB misses are reduced by pre-decoding the prefetched blocks. We also discuss how our proposal can be used for variable-length ISAs with low storage overhead. Our proposal, SN4L+Dis+BTB, imposes the same area overhead as the state-of-the-art BTB-directed prefetcher, and at the same time, outperforms it by 5% on average and up to 16%.

DOI: 10.1109/ISCA45697.2020.00017

Focused value prediction

作者: Bandishte, Sumeet and Gaur, Jayesh and Sperber, Zeev and Rappoport, Lihu and Yoaz, Adi and Subramoney, Sreenivas
关键词: No keywords

Abstract

Value Prediction was proposed to speculatively break true data dependencies, thereby allowing Out of Order (OOO) processors to achieve higher instruction level parallelism (ILP) and gain performance. State-of-the-art value predictors try to maximize the number of instructions that can be value predicted, with the belief that a higher coverage will unlock more ILP and increase performance. Unfortunately, this comes at increased complexity with implementations that require multiple different types of value predictors working in tandem, incurring substantial area and power cost.In this paper we motivate towards lower coverage, but focused, value prediction. Instead of aggressively increasing the coverage of value prediction, at the cost of higher area and power, we motivate refocusing value prediction as a mechanism to achieve an early execution of instructions that frequently create performance bottlenecks in the OOO processor. Since we do not aim for high coverage, our implementation is light-weight, needing just 1.2 KB of storage. Simulation results on 60 diverse workloads show that we deliver 3.3% performance gain over a baseline similar to the Intel Skylake processor. This performance gain increases substantially to 8.6% when we simulate a futuristic up-scaled version of Skylake. In contrast, for the same storage, state-of-the-art value predictors deliver a much lower speedup of 1.7% and 4.7% respectively. Notably, our proposal is similar to these predictors in performance, even when they are given nearly eight times the storage and have 60% more prediction coverage than our solution.

DOI: 10.1109/ISCA45697.2020.00018

Auto-predication of critical branches

作者: Chauhan, Adarsh and Gaur, Jayesh and Sperber, Zeev and Sala, Franck and Rappoport, Lihu and Yoaz, Adi and Subramoney, Sreenivas
关键词: run-time throttling, microarchitecture, dynamic predication, control flow convergence

Abstract

Advancements in branch predictors have allowed modern processors to aggressively speculate and gain significant performance with every generation of increasing out-of-order depth and width. Unfortunately, there are branches that are still hard-to-predict (H2P) and mis-speculation on these branches is severely limiting the performance scalability of future processors. One potential solution to mitigate this problem is to predicate branches by substituting control dependencies with data dependencies. Predication is very costly for performance as it inhibits instruction level parallelism. To overcome this limitation, prior works selectively applied predication at run-time on H2P branches that have low confidence of branch prediction. However, these schemes do not fully comprehend the delicate trade-offs involved in suppressing speculation and can suffer from performance degradation on certain workloads. Additionally, they need significant changes not just to the hardware but also to the compiler and the instruction set architecture, rendering their implementation complex and challenging.In this paper, by analyzing the fundamental trade-offs between branch prediction and predication, we propose Auto-Predication of Critical Branches (ACB) — an end-to-end hardware-based solution that intelligently disables speculation only on branches that are critical for performance. Unlike existing approaches, ACB uses a sophisticated performance monitoring mechanism to gauge the effectiveness of dynamic predication, and hence does not suffer from performance inversions. Our simulation results show that, with just 386 bytes of additional hardware and no software support, ACB delivers 8% performance gain over a baseline similar to the Skylake processor. We also show that ACB reduces pipeline flushes because of mis-speculations by 22%, thus effectively helping both power and performance.

DOI: 10.1109/ISCA45697.2020.00019

Slipstream processors revisited： exploiting branch sets

作者: Srinivasan, Vinesh and Chowdhury, Rangeen Basu Roy and Rotenberg, Eric
关键词: prefetching, pre-execution, helper threads, hard-to-predict branch, delinquent load, control independence, branch prediction

Abstract

Delinquent branches and loads remain key performance limiters in some applications. One approach to mitigate them is pre-execution. Broadly, there are two classes of pre-execution: one class repeatedly forks small helper threads, each targeting an individual dynamic instance of a delinquent branch or load; the other class begins with two redundant threads in a leader-follower arrangement, and speculatively reduces the leading thread. The objective of this paper is to design a new pre-execution microarchitecture that meets four criteria: (i) retains the simpler coordination of a leader-follower microarchitecture, (ii) is fully automated with just hardware, (iii) targets both branches and loads, (iv) and is effective. We review prior pre-execution proposals and show that none of them meet all four criteria.We develop Slipstream 2.0 to meet all four criteria. The key innovation in the space of leader-follower architectures is to remove the forward control-flow slices of delinquent branches and loads, from the leading thread. This innovation overcomes key limitations in the only other hardware-only leader-follower prior works: Slipstream and Dual Core Execution (DCE). Slipstream removes backward slices of confident branches to pre-execute unconfident branches, which is ineffective in phases dominated by unconfident branches when branch pre-execution is most needed. DCE is very effective at tolerating cache-missed loads, unless their dependent branches are mispredicted. Removing forward control-flow slices of delinquent branches and delinquent loads enables two firsts, respectively: (1) leader-follower-style branch pre-execution without relying on confident instruction removal, and (2) tolerance of cache-missed loads that feed mispredicted branches.For SPEC 2006/2017 SimPoints wherein Slipstream 2.0 is auto-enabled, it achieves geomean speedups of 67%, 60%, and 12%, over baseline (one core), Slipstream, and DCE.

DOI: 10.1109/ISCA45697.2020.00020

Bouquet of instruction pointers： instruction pointer classifier-based spatial hardware prefetching

作者: Pakalapati, Samuel and Panda, Biswabandan
关键词: hardware prefetching, caching

Abstract

Hardware prefetching is one of the common off-chip DRAM latency hiding techniques. Though hardware prefetchers are ubiquitous in the commercial machines and prefetching techniques are well studied in the computer architecture community, the “memory wall” problem still exists after decades of micro-architecture research and is considered to be an essential problem to solve. In this paper, we make a case for breaking the memory wall through data prefetching at the L1 cache.We propose a bouquet of hardware prefetchers that can handle a variety of access patterns driven by the control flow of an application. We name our proposal Instruction Pointer Classifier based spatial Prefetching (IPCP). We propose IPCP in two flavors: (i) an L1 spatial data prefetcher that classifies instruction pointers at the L1 cache level, and issues prefetch requests based on the classification, and (ii) a multi-level IPCP where the IPCP at the L1 communicates the classification information to the L2 IPCP so that it can kick-start prefetching based on this classification done at the L1. Overall, IPCP is a simple, lightweight, and modular framework for L1 and multi-level spatial prefetching. IPCP at the L1 and L2 incurs a storage overhead of 740 bytes and 155 bytes, respectively.Our empirical results show that, for memory-intensive single-threaded SPEC CPU 2017 benchmarks, compared to a baseline system with no prefetching, IPCP provides an average performance improvement of 45.1%. For the entire SPEC CPU 2017 suite, it provides an improvement of 22%. In the case of multi-core systems, IPCP provides an improvement of 23.4% (evaluated over more than 1000 mixes). IPCP outperforms the already high-performing state-of-the-art prefetchers like SPP with PPF and Bingo by demanding 30X to 50X less storage.

DOI: 10.1109/ISCA45697.2020.00021

MuonTrap： preventing cross-domain spectre-like attacks by capturing speculative state

作者: Ainsworth, Sam and Jones, Timothy M.
关键词: speculative execution, spectre, microarchitecture, hardware security

Abstract

The disclosure of the Spectre speculative-execution attacks in January 2018 has left a severe vulnerability that systems are still struggling with how to patch. The solutions that currently exist tend to have incomplete coverage, perform badly, or have highly undesirable performance edge cases.MuonTrap allows processors to continue to speculate, avoiding significant reductions in performance, without impacting security. We instead prevent the propagation of any state based on speculative execution, by placing the results of speculative cache accesses into a small, fast L0 filter cache, that is non-inclusive, non-exclusive with the rest of the cache hierarchy. This isolates all parts of the system that can’t be quickly cleared on any change in threat domain. MuonTrap uses these speculative filter caches, which are cleared on context and protection-domain switches, along with a series of extensions to the cache coherence protocol and prefetcher. This renders systems immune to cross-domain information leakage via Spectre and a host of similar attacks based on speculative execution, with low performance impact and few changes to the CPU design.

DOI: 10.1109/ISCA45697.2020.00022

Think fast： a tensor streaming processor (TSP) for accelerating deep learning workloads

作者: Abts, Dennis and Ross, Jonathan and Sparling, Jonathan and Wong-VanHaren, Mark and Baker, Max and Hawkins, Tom and Bell, Andrew and Thompson, John and Kahsai, Temesghen and Kimmell, Garrin and Hwang, Jennifer and Leslie-Hurd, Rebekah and Bye, Michael and Creswick, E. R. and Boyd, Matthew and Venigalla, Mahitha and Laforge, Evan and Purdy, Jon and Kamath, Purushotham and Maheshwari, Dinesh and Beidler, Michael and Rosseel, Geert and Ahmad, Omar and Gagarin, Gleb and Czekalski, Richard and Rane, Ashay and Parmar, Sahil and Werner, Jeff and Sproch, Jim and Macias, Adrian and Kurtz, Brian
关键词: No keywords

Abstract

In this paper, we introduce the Tensor Streaming Processor (TSP) architecture, a functionally-sliced microarchitecture with memory units interleaved with vector and matrix deep learning functional units in order to take advantage of dataflow locality of deep learning operations. The TSP is built based on two key observations: (1) machine learning workloads exhibit abundant data parallelism, which can be readily mapped to tensors in hardware, and (2) a simple and deterministic processor with producer-consumer stream programming model enables precise reasoning and control of hardware components, achieving good performance and power efficiency. The TSP is designed to exploit parallelism inherent in machine-learning workloads including instruction-level, memory concurrency, data and model parallelism, while guaranteeing determinism by eliminating all reactive elements in the hardware (e.g. arbiters, and caches). Early ResNet50 image classification results demonstrate 20.4K processed images per second (IPS) with a batch-size of one—a 4x improvement compared to other modern GPUs and accelerators [44]. Our first ASIC implementation of the TSP architecture yields a computational density of more than 1 TeraOp/s per square mm of silicon for its 25x29 mm 14nm chip operating at a nominal clock frequency of 900 MHz. The TSP demonstrates a novel hardware-software approach to achieve fast, yet predictable, performance on machine-learning workloads within a desired power envelope.

DOI: 10.1109/ISCA45697.2020.00023

T4： compiling sequential code for effective speculative parallelization in hardware

作者: Ying, Victor A. and Jeffrey, Mark C. and Sanchez, Daniel
关键词: No keywords

Abstract

Multicores are now ubiquitous, but programmers still write sequential code. Speculative parallelization is an enticing approach to parallelize code while retaining the ease of sequential programming, making parallelism pervasive. However, prior speculative parallelizing compilers and architectures achieved limited speedups due to high costs of recovering from misspeculation and hardware scalability bottlenecks.We present T4, a parallelizing compiler that successfully leverages recent hardware features for speculative execution, which present new opportunities and challenges for automatic parallelization. T4 transforms sequential programs into trees of tiny timestamped tasks. T4 introduces novel compiler techniques to expose parallelism aggressively across the entire program, breaking applications into tiny tasks of tens of instructions each. Task trees unfold their branches in parallel to enable high task-spawn throughput while exploiting selective aborts to recover from misspeculation cheaply. T4 exploits parallelism across function calls, loops, and loop nests; performs new transformations to reduce task spawn costs and avoid false sharing; and exploits data locality among fine-grain tasks. As a result, T4 scales several hard-to-parallelize SPEC CPU2006 benchmarks to tens of cores, on which prior work attained little or no speedup.

DOI: 10.1109/ISCA45697.2020.00024

Efficiently supporting dynamic task parallelism on heterogeneous cache-coherent systems

作者: Wang, Moyang and Ta, Tuan and Cheng, Lin and Batten, Christopher
关键词: No keywords

Abstract

Manycore processors, with tens to hundreds of tiny cores but no hardware-based cache coherence, can offer tremendous peak throughput on highly parallel programs while being complexity and energy efficient. Manycore processors can be combined with a few high-performance big cores for executing operating systems, legacy code, and serial regions. These systems use heterogeneous cache coherence (HCC) with hardware-based cache coherence between big cores and software-centric cache coherence between tiny cores. Unfortunately, programming these heterogeneous cache-coherent systems to enable collaborative execution is challenging, especially when considering dynamic task parallelism. This paper seeks to address this challenge using a combination of light-weight software and hardware techniques. We provide a detailed description of how to implement a work-stealing runtime to enable dynamic task parallelism on heterogeneous cache-coherent systems. We also propose direct task stealing (DTS), a new technique based on user-level interrupts to bypass the memory system and thus improve the performance and energy efficiency of work stealing. Our results demonstrate that executing dynamic task-parallel applications on a 64-core system (4 big, 60 tiny) with complexity-effective HCC and DTS can achieve: 7x speedup over a single big core; 1.4x speedup over an area-equivalent eight big-core system with hardware-based cache coherence; and 21% better performance and similar energy efficiency compared to a 64-core system (4 big, 60 tiny) with full-system hardware-based cache coherence.

DOI: 10.1109/ISCA45697.2020.00025

Flick： fast and lightweight ISA-crossing call for heterogeneous-ISA environments

作者: Cho, Shenghsun and Chen, Han and Madaminov, Sergey and Ferdman, Michael and Milder, Peter
关键词: virtual memory, thread migration, multi-core, heterogeneous-ISA, FPGA

Abstract

Heterogeneous-ISA multi-core systems have performance and power consumption benefits. Today, numerous system components, such as NVRAMs and Smart NICs, already have built-in processor cores with ISAs different from that of the host CPUs, making many modern systems heterogeneous-ISA multi-core systems. Unfortunately, programming and using such systems efficiently is difficult and requires extensive support from the host operating systems. Existing programming solutions are complex, require dramatic changes to the systems, and often incur significant performance overheads.To address this challenge, we propose Flick: Fast and Lightweight ISA-Crossing Call, for migrating threads in heterogeneous-ISA multi-core systems. By leveraging hardware virtual memory support and standard operating system mechanisms, a software thread can transparently migrate between cores with different ISAs. We prototype a heterogeneous-ISA multi-core system using FPGAs with off-the-shelf hardware and software to evaluate Flick. Experiments with microbenchmarks and a BFS application show that Flick requires only minor changes to the existing OS and software, and incurs only 18μs round trip overhead for migrating a thread through PCIe, which is at least 23x faster than prior work.

DOI: 10.1109/ISCA45697.2020.00026

The NeBuLa RPC-optimized architecture

作者: Sutherland, Mark and Gupta, Siddharth and Falsafi, Babak and Marathe, Virendra and Pnevmatikatos, Dionisios and Daglis, Alexandres
关键词: queuing theory, network protocols, memory hierarchy, client/server and multitier systems

Abstract

Large-scale online services are commonly structured as a network of software tiers, which communicate over the datacenter network using RPCs. Ongoing trends towards software decomposition have led to the prevalence of tiers receiving and generating RPCs with runtimes of only a few microseconds. With such small software runtimes, even the smallest latency overheads in RPC handling have a significant relative performance impact. In particular, we find that growing network bandwidth introduces queuing effects within a server’s memory hierarchy, considerably hurting the response latency of fine-grained RPCs. In this work we introduce NeBuLa, an architecture optimized to accelerate the most challenging microsecond-scale RPCs, by leveraging two novel mechanisms to drastically improve server throughput under strict tail latency goals. First, NeBuLa reduces detrimental queuing at the memory controllers via hardware support for efficient in-LLC network buffer management. Second, NeBuLa’s network interface steers incoming RPCs into the CPU cores’ L1 caches, improving RPC startup latency. Our evaluation shows that NeBuLa boosts the throughput of a state-of-the-art key-value store by 1.25-2.19x compared to existing proposals, while maintaining strict tail latency goals.

DOI: 10.1109/ISCA45697.2020.00027

Printed microprocessors

作者: Bleier, Nathaniel and Mubarik, Muhammad Husnain and Rasheed, Farhan and Aghassi-Hagmann, Jasmin and Tahoori, Mehdi B and Kumar, Rakesh
关键词: No keywords

Abstract

Printed electronics holds the promise of meeting the cost and conformality needs of emerging disposable and ultra-low cost margin applications. Recent printed circuits technologies also have low supply voltage and can, therefore, be battery-powered. In this paper, we explore the design space of microprocessors implemented in such printing technologies - these printed microprocessors will be needed for battery-powered applications with requirements of low cost, conformality, and programmability. To enable this design space exploration, we first present the standard cell libraries for EGFET and CNT-TFT printed technologies - to the best of our knowledge, these are the first synthesis and physical design ready standard cell libraries for any low voltage printing technology. We then present an area, power, and delay characterization of several off-the-shelf low gate count microprocessors (Z80, light8080, ZPU, and openMSP430) in EGFET and CNT-TFT technologies. Our characterization shows that several printing applications can be feasibly targeted by battery-powered printed microprocessors. However, our results also show the need to significantly reduce area and power of such printed microprocessors. We perform a design space exploration of printed microprocessor architectures over multiple parameters - datawidths, pipeline depth, etc. We show that the best cores outperform pre-existing cores by at least one order of magnitude in terms of power and area. Finally, we show that printing-specific architectural and low-level optimizations further improve area and power characteristics of low voltage battery-compatible printed microprocessors. Program-specific ISA, for example, improves power, and area by up to 4.18x and 1.93x respectively. Crosspoint-based instruction ROM outperforms a RAM-based design by 5.77x, 16.8x, and 2.42x respectively in terms of power, area, and delay.

DOI: 10.1109/ISCA45697.2020.00028

SysScale： exploiting multi-domain dynamic voltage and frequency scaling for energy efficient mobile processors

作者: Haj-Yahya, Jawad and Alser, Mohammed and Kim, Jeremie and Ya\u{g
关键词: No keywords

Abstract

There are three domains in a modern thermally-constrained mobile system-on-chip (SoC): compute, IO, and memory. We observe that a modern SoC typically allocates a fixed power budget, corresponding to worst-case performance demands, to the IO and memory domains even if they are underutilized. The resulting unfair allocation of the power budget across domains can cause two major issues: 1) the IO and memory domains can operate at a higher frequency and voltage than necessary, increasing power consumption and 2) the unused power budget of the IO and memory domains cannot be used to increase the throughput of the compute domain, hampering performance. To avoid these issues, it is crucial to dynamically orchestrate the distribution of the SoC power budget across the three domains based on their actual performance demands.We propose SysScale, a new multi-domain power management technique to improve the energy efficiency of mobile SoCs. SysScale is based on three key ideas. First, SysScale introduces an accurate algorithm to predict the performance (e.g., bandwidth and latency) demands of the three SoC domains. Second, SysScale uses a newDVFS (dynamic voltage and frequency scaling) mechanism to distribute the SoC power to each domain according to the predicted performance demands. This mechanism is designed to minimize the significant latency overheads associated with applying DVFS across multiple domains. Third, in addition to using a global DVFS mechanism, SysScale uses domain-specialized techniques to optimize the energy efficiency of each domain at different operating points.We implement SysScale on an Intel Skylake microprocessor for mobile devices and evaluate it using a wide variety of SPEC CPU2006, graphics (3DMark), and battery life workloads (e.g., video playback). On a 2-core Skylake, SysScale improves the performance of SPEC CPU2006 and 3DMark workloads by up to 16% and 8.9% (9.2% and 7.9% on average), respectively. For battery life workloads, which typically have fixed performance demands, SysScale reduces the average power consumption by up to 10.7% (8.5% on average), while meeting performance demands.

DOI: 10.1109/ISCA45697.2020.00029

D'{e

作者: Zhao, Shulin and Zhang, Haibo and Bhuyan, Sandeepa and Mishra, Cyan Subhra and Ying, Ziyu and Kandemir, Mahmut T. and Sivasubramaniam, Anand and Das, Chita R.
关键词: virtual reality, edge computing, IoT, 360° video processing

Abstract

The emergence of virtual reality (VR) and augmented reality (AR) has revolutionized our lives by enabling a 360° artificial sensory stimulation across diverse domains, including, but not limited to, sports, media, healthcare, and gaming. Unlike the conventional planar video processing, where memory access is the main bottleneck, in 360° VR videos the compute is the primary bottleneck and contributes to more than 50% energy consumption in battery-operated VR headsets. Thus, improving the computational efficiency of the video processing pipeline in a VR is critical. While prior efforts have attempted to address this problem through acceleration using a GPU or FPGA, none of them has analyzed the 360° VR pipeline to examine if there is any scope to optimize the computation with known techniques such as memoization.Thus, in this paper, we analyze the VR computation pipeline and observe that there is significant scope to skip computations by leveraging the temporal and spatial locality in head orientation and eye correlations, respectively, resulting in computation reduction and energy efficiency. The proposed D'{e

DOI: 10.1109/ISCA45697.2020.00030

Genesis： a hardware acceleration framework for genomic data analysis

作者: Ham, Tae Jun and Bruns-Smith, David and Sweeney, Brendan and Lee, Yejin and Seo, Seong Hoon and Song, U Gyeong and Oh, Young H. and Asanovic, Krste and Lee, Jae W. and Wills, Lisa Wu
关键词: hardware accelerator, genomic data analysis, genome sequencing, SQL, FPGA

Abstract

In this paper, we describe our vision to accelerate algorithms in the domain of genomic data analysis by proposing a framework called Genesis (genome analysis) that contains an interface and an implementation of a system that processes genomic data efficiently. This framework can be deployed in the cloud and exploit the FPGAs-as-a-service paradigm to provide cost-efficient secondary DNA analysis. We propose conceptualizing genomic reads and associated read attributes as a very large relational database and using extended SQL as a domain-specific language to construct queries that form various data manipulation operations. To accelerate such queries, we design a Genesis hardware library which consists of primitive hardware modules that can be composed to construct a dataflow architecture specialized for those queries.As a proof of concept for the Genesis framework, we present the architecture and the hardware implementation of several genomic analysis stages in the secondary analysis pipeline corresponding to the best known software analysis toolkit, GATK4 workflow proposed by the Broad Institute. We walk through the construction of genomic data analysis operations using a sequence of SQL-style queries and show how Genesis hardware library modules can be utilized to construct the hardware pipelines designed to accelerate such queries. We exploit parallelism and data reuse by utilizing a dataflow architecture along with the use of on-chip scratchpads as well as non-blocking APIs to manage the accelerators, allowing concurrent execution of the accelerator and the host. Our accelerated system deployed on the cloud FPGA performs up to 19.3x better than GATK4 running on a commodity multi-core Xeon server and obtains up to 15x better cost savings. We believe that if a software algorithm can be mapped onto a hardware library to utilize the underlying accelerator(s) using an already-standardized software interface such as SQL, while allowing the efficient mapping of such interface to primitive hardware modules as we have demonstrated here, it will expedite the acceleration of domain-specific algorithms and allow the easy adaptation of algorithm changes.

DOI: 10.1109/ISCA45697.2020.00031

DSAGEN： synthesizing programmable spatial accelerators

作者: Weng, Jian and Liu, Sihao and Dadu, Vidushi and Wang, Zhengrong and Shah, Preyas and Nowatzki, Tony
关键词: No keywords

Abstract

Domain-specific hardware accelerators can provide orders of magnitude speedup and energy efficiency over general purpose processors. However, they require extensive manual effort in hardware design and software stack development. Automated ASIC generation (eg. HLS) can be insufficient, because the hardware becomes inflexible. An ideal accelerator generation framework would be automatable, enable deep specialization to the domain, and maintain a uniform programming interface.Our insight is that many prior accelerator architectures can be approximated by composing a small number of hardware primitives, specifically those from spatial architectures. With careful design, a compiler can understand how to use available primitives, with modular and composable transformations, to take advantage of the features of a given program. This suggests a paradigm where accelerators can be generated by searching within such a rich accelerator design space, guided by the affinity of input programs for hardware primitives and their interactions.We use this approach to develop the DSAGEN framework, which automates the hardware/software co-design process for reconfigurable accelerators. For several existing accelerators, our evaluation demonstrates that the compiler can achieve 89% of the performance of manually tuned versions. For automated design space exploration, we target multiple sets of workloads which prior accelerators are design for; the generated hardware has mean 1.3 x perf2/mm2 over prior programmable accelerators.

DOI: 10.1109/ISCA45697.2020.00032

Bonsai： high-performance adaptive merge tree sorting

作者: Samardzic, Nikola and Qiao, Weikang and Aggarwal, Vaibhav and Chang, Mau-Chung Frank and Cong, Jason
关键词: performance modeling, merge sort, memory hierarchy, FPGA

Abstract

Sorting is a key computational kernel in many big data applications. Most sorting implementations focus on a specific input size, record width, and hardware configuration. This has created a wide array of sorters that are optimized only to a narrow application domain.In this work we show that merge trees can be implemented on FPGAs to offer state-of-the-art performance over many problem sizes. We introduce a novel merge tree architecture and develop Bonsai, an adaptive sorting solution that takes into consideration the off-chip memory bandwidth and the amount of on-chip resources to optimize sorting time. FPGA programmability allows us to leverage Bonsai to quickly implement the optimal merge tree configuration for any problem size and memory hierarchy.Using Bonsai, we develop a state-of-the-art sorter which specifically targets DRAM-scale sorting on AWS EC2 F1 instances. For 4-32 GB array size, our implementation has a minimum of 2.3x, 1.3x, 1.2x and up to 2.5x, 3.7x, 1.3x speedup over the best designs on CPUs, FPGAs, and GPUs, respectively. Our design exhibits 3.3x better bandwidth-efficiency compared to the best previous sorting implementations. Finally, we demonstrate that Bonsai can tune our design over a wide range of problem sizes (megabyte to terabyte) and memory hierarchies including DDR DRAMs, high-bandwidth memories (HBMs) and solid-state disks (SSDs).

DOI: 10.1109/ISCA45697.2020.00033

SOFF： an OpenCL high-level synthesis framework for FPGAs

作者: Jo, Gangwon and Kim, Heehoon and Lee, Jeesoo and Lee, Jaejin
关键词: pipeline processing, parallel programming, high level synthesis, accelerator architectures, FPGAs

Abstract

Recently, OpenCL has been emerging as a programming model for energy-efficient FPGA accelerators. However, the state-of-the-art OpenCL frameworks for FPGAs suffer from poor performance and usability. This paper proposes a high-level synthesis framework of OpenCL for FPGAs, called SOFF. It automatically synthesizes a datapath to execute many OpenCL kernel threads in a pipelined manner. It also synthesizes an efficient memory subsystem for the datapath based on the characteristics of OpenCL kernels. Unlike previous high-level synthesis techniques, we propose a formal way to handle variable-latency instructions, complex control flows, OpenCL barriers, and atomic operations that appear in real-world OpenCL kernels. SOFF is the first OpenCL framework that correctly compiles and executes all applications in the SPEC ACCEL benchmark suite except three applications that require more FPGA resources than are available. In addition, SOFF achieves the speedup of 1.33 over Intel FPGA SDK for OpenCL without any explicit user annotation or source code modification.

DOI: 10.1109/ISCA45697.2020.00034

Gorgon： accelerating machine learning from relational data

作者: Vilim, Matthew and Rucker, Alexander and Zhang, Yaqi and Liu, Sophia and Olukotun, Kunle
关键词: plasticine, machine learning, database, accelerator, Gorgon, CGRA

Abstract

Accelerator deployment in data centers remains limited despite domain-specific architectures’ promise of higher performance. Rapidly-changing applications and high nre cost make deploying fixed-function accelerators at scale untenable. More flexible than dsas, fpgas are gaining traction but remain hampered by cumbersome programming models, long synthesis times, and slow clocks. Coarse-grained reconfigurable architectures (cgra) are a compelling alternative and offer efficiency while retaining programmability—by providing general-purpose hardware and communication patterns, a single cgra targets multiple application domains.One emerging application is in-database machine learning: a high-performance, low-friction interface for analytics on large databases. We co-locate database and machine learning processing in a unified reconfigurable data analytics accelerator, Gorgon, which flexibly shares resources between db and ml without compromising performance or incurring excessive overheads in either domain. We distill and integrate database parallel patterns into an existing ml-focused cgra, increasing area by less than 4% while outperforming a multicore software baseline by 1500x. We also explore the performance impact of unifying db and ml in a single accelerator, showing up to 4x speedup over split accelerators.

DOI: 10.1109/ISCA45697.2020.00035

A specialized architecture for object serialization with applications to big data analytics

作者: Jang, Jaeyoung and Jung, Sung Jun and Jeong, Sunmin and Heo, Jun and Shin, Hoon and Ham, Tae Jun and Lee, Jae W.
关键词: object serialization, hardware-software co-design, domain-specific architecture, data analytics, apache spark

Abstract

Object serialization and deserialization (S/D) is an essential feature for efficient communication between distributed computing nodes with potentially non-uniform execution environments. S/D operations are widely used in big data analytics frameworks for remote procedure calls and massive data transfers like shuffles. However, frequent S/D operations incur significant performance and energy overheads as they must traverse and process a large object graph. Prior approaches improve S/D throughput by effectively hiding disk or network I/O latency with computation, increasing compression ratio, and/or application-specific customization. However, inherent dependencies in the existing (de)serialization formats and algorithms eventually become the major performance bottleneck. Thus, we propose Cereal, a specialized hardware accelerator for memory object serialization. By co-designing the serialization format with hardware architecture, Cereal effectively utilizes abundant parallelism in the S/D process to deliver high throughput. Cereal also employs an efficient object packing scheme to compress metadata such as object reference offsets and a space-efficient bitmap representation for the object layout. Our evaluation of Cereal using both a cycle-level simulator and synthesizable Chisel RTL demonstrates that Cereal delivers 43.4x higher average S/D throughput than 88 other S/D libraries on Java Serialization Benchmark Suite. For six Spark applications Cereal achieves 7.97x and 4.81x speedups on average for S/D operations over Java built-in serializer and Kryo, respectively, while saving S/D energy by 227.75x and 136.28x.

DOI: 10.1109/ISCA45697.2020.00036

CryoCore： a fast and dense processor architecture for cryogenic computing

作者: Byun, Ilkwon and Min, Dongmoon and Lee, Gyu-hyeon and Na, Seongmin and Kim, Jangwoo
关键词: simulation, modeling, cryogenic processor, cryogenic computing

Abstract

Cryogenic computing can achieve high performance and power efficiency by dramatically reducing the device’s leakage power and wire resistance at low temperatures. Recent advances towards cryogenic computing focus on developing cryogenic-optimal cache and memory devices to overcome memory capacity, latency, and power walls. However, little research has been conducted to develop a cryogenic-optimal core architecture despite its high potentials in performance, power, and area efficiency. Once a cryogenic-optimal core becomes available, it will also take full advantage of the cryogenic-optimal cache and memory devices, which leads to a cryogenic-optimal computer.In this paper, we first develop CryoCore-Model (CC-Model), a cryogenic processor modeling framework which can accurately estimate the maximum clock frequency of processor models running at 77K. Next, driven by the modeling tool, we design CryoCore, a 77K-optimal core microarchitecture to maximize the core’s performance and area efficiency while minimizing the cooling cost. The key idea of CryoCore is to architect a core in a way to reduce the size and number of cooling-unfriendly microarchitecture units and maximize the potential of a voltage and frequency scaling at 77K. Finally, we propose two half-sized, but differently voltage-scaled CryoCore designs aiming for either the maximum performance or power efficiency. With both conventional and our design integrated with cryogenic memories, our high-performance CryoCore design achieves 41% higher single-thread performance for the same power budget and 2x higher multi-thread performance for the same die area. Our low-power CryoCore design reduces the power cost by 38% without sacrificing the single-thread performance.

DOI: 10.1109/ISCA45697.2020.00037

SpinalFlow： an architecture and dataflow tailored for spiking neural networks

作者: Narayanan, Surya and Taht, Karl and Balasubramonian, Rajeev and Giacomin, Edouard and Gaillardon, Pierre-Emmanuel
关键词: accelerators, SNNs, CNNs

Abstract

Spiking neural networks (SNNs) are expected to be part of the future AI portfolio, with heavy investment from industry and government, e.g., IBM TrueNorth, Intel Loihi. While Artificial Neural Network (ANN) architectures have taken large strides, few works have targeted SNN hardware efficiency. Our analysis of SNN baselines shows that at modest spike rates, SNN implementations exhibit significantly lower efficiency than accelerators for ANNs. This is primarily because SNN dataflows must consider neuron potentials for several ticks, introducing a new data structure and a new dimension to the reuse pattern. We introduce a novel SNN architecture, SpinalFlow, that processes a compressed, time-stamped, sorted sequence of input spikes. It adopts an ordering of computations such that the outputs of a network layer are also compressed, time-stamped, and sorted. All relevant computations for a neuron are performed in consecutive steps to eliminate neuron potential storage overheads. Thus, with better data reuse, we advance the energy efficiency of SNN accelerators by an order of magnitude. Even though the temporal aspect in SNNs prevents the exploitation of some reuse patterns that are more easily exploited in ANNs, at 4-bit input resolution and 90% input sparsity, SpinalFlow reduces average energy by 1.8x, compared to a 4-bit Eyeriss baseline. These improvements are seen for a range of networks and sparsity/resolution levels; SpinalFlow consumes 5x less energy and 5.4x less time than an 8-bit version of Eyeriss. We thus show that, depending on the level of observed sparsity, SNN architectures can be competitive with ANN architectures in terms of latency and energy for inference, thus lowering the barrier for practical deployment in scenarios demanding real-time learning.

DOI: 10.1109/ISCA45697.2020.00038

NEBULA： a neuromorphic spin-based ultra-low power architecture for SNNs and ANNs

作者: Singh, Sonali and Sarma, Anup and Jao, Nicholas and Pattnaik, Ashutosh and Lu, Sen and Yang, Kezhou and Sengupta, Abhronil and Narayanan, Vijaykrishnan and Das, Chita R.
关键词: neural nets, memory technologies, low power design, domain-specific architectures

Abstract

Brain-inspired cognitive computing has so far followed two major approaches - one uses multi-layered artificial neural networks (ANNs) to perform pattern-recognition-related tasks, whereas the other uses spiking neural networks (SNNs) to emulate biological neurons in an attempt to be as efficient and fault-tolerant as the brain. While there has been considerable progress in the former area due to a combination of effective training algorithms and acceleration platforms, the latter is still in its infancy due to the lack of both. SNNs have a distinct advantage over their ANN counterparts in that they are capable of operating in an event-driven manner, thus consuming very low power. Several recent efforts have proposed various SNN hardware design alternatives, however, these designs still incur considerable energy overheads.In this context, this paper proposes a comprehensive design spanning across the device, circuit, architecture and algorithm levels to build an ultra low-power architecture for SNN and ANN inference. For this, we use spintronics-based magnetic tunnel junction (MTJ) devices that have been shown to function as both neuro-synaptic crossbars as well as thresholding neurons and can operate at ultra low voltage and current levels. Using this MTJ-based neuron model and synaptic connections, we design a low power chip that has the flexibility to be deployed for inference of SNNs, ANNs as well as a combination of SNN-ANN hybrid networks - a distinct advantage compared to prior works. We demonstrate the competitive performance and energy efficiency of the SNNs as well as hybrid models on a suite of workloads. Our evaluations show that the proposed design, NEBULA, is up to 7.9x more energy efficient than a state-of-the-art design, ISAAC, in the ANN mode. In the SNN mode, our design is about 45x more energy-efficient than a contemporary SNN architecture, INXS. Power comparison between NEBULA ANN and SNN modes indicates that the latter is at least 6.25x more power-efficient for the observed benchmarks.

DOI: 10.1109/ISCA45697.2020.00039

uGEMM： unary computing architecture for GEMM applications

作者: Wu, Di and Li, Jingjie and Yin, Ruokai and Hsiao, Hsuan and Kim, Younghyun and Miguel, Joshua San
关键词: No keywords

Abstract

General matrix multiplication (GEMM) is universal in various applications, such as signal processing, machine learning, and computer vision. Conventional GEMM hardware architectures based on binary computing exhibit low area and energy efficiency as they scale due to the spatial nature of number representation and computing. Unary computing, on the other hand, can be performed with extremely simple processing units, often just with a single logic gate. But currently there exist no efficient architectures for unary GEMM.In this paper, we present uGEMM, an area- and energy-efficient unary GEMM architecture enabled by novel arithmetic units. The proposed design relaxes previously-imposed constraints on input bit streams—low correlation and long stream length—and achieves superior area and energy efficiency over existing unary systems. Furthermore, uGEMM’s output bit streams exhibit higher accuracy and faster convergence, enabling dynamic energy-accuracy scaling on resource-constrained systems.

DOI: 10.1109/ISCA45697.2020.00040

Hardware-software co-design for brain-computer interfaces

作者: Karageorgos, Ioannis and Sriram, Karthik and Vesel'{y
关键词: No keywords

Abstract

Brain-computer interfaces (BCIs) offer avenues to treat neurological disorders, shed light on brain function, and interface the brain with the digital world. Their wider adoption rests, however, on achieving adequate real-time performance, meeting stringent power constraints, and adhering to FDA-mandated safety requirements for chronic implantation. BCIs have, to date, been designed as custom ASICs for specific diseases or for specific tasks in specific brain regions. General-purpose architectures that can be used to treat multiple diseases and enable various computational tasks are needed for wider BCI adoption, but the conventional wisdom is that such systems cannot meet necessary performance and power constraints.We present HALO (Hardware Architecture for LOw-power BCIs), a general-purpose architecture for implantable BCIs. HALO enables tasks such as treatment of disorders (e.g., epilepsy, movement disorders), and records/processes data for studies that advance our understanding of the brain. We use electrophysiological data from the motor cortex of a non-human primate to determine how to decompose HALO’s computational capabilities into hardware building blocks. We simplify, prune, and share these building blocks to judiciously use available hardware resources while enabling many modes of brain-computer interaction. The result is a configurable heterogeneous array of hardware processing elements (PEs). The PEs are configured by a low-power RISC-V micro-controller into signal processing pipelines that meet the target performance and power constraints necessary to deploy HALO widely and safely.

DOI: 10.1109/ISCA45697.2020.00041

Heat to power： thermal energy harvesting and recycling for warm water-cooled datacenters

作者: Zhu, Xinhui and Jiang, Weixiang and Liu, Fangming and Zhang, Qixia and Pan, Li and Chen, Qiong and Jia, Ziyang
关键词: warm water cooling, thermoelectric generator, thermal energy harvesting, energy recycling, datacenter energy

Abstract

Warm water cooling has been regarded as a promising method to improve the energy efficiency of water-cooled datacenters. In warm water-cooling systems, hot spots occur as a common problem where the hybrid cooling architecture integrating thermoelectric coolers (TECs) emerges as a new remedy. Equipped with this architecture, the inlet water temperature can be raised higher, which provides more opportunities for heat recycling. However, currently, the heat absorbed from the server components is ejected directly into the water without being recycled, which leads to energy wasting. In order to further improve the energy efficiency, we propose Heat to Power (H2P), an economical and energy-recycling warm water cooling architecture, where thermoelectric generators (TEGs) harvest thermal energy from the “used” warm water and generate electricity for reusing in datacenters. Specifically, we propose some efficient optimization methods, including an economical water circulation design, fine-grained adjustments of the cooling setting and dynamic workload scheduling for increasing the power generated by TEGs. We evaluate H2P based on a real hardware prototype and cluster traces from Google and Alibaba. Experiment results show that TEGs equipped with our optimization methods can averagely generate 4.349 W, 4.203 W, and 3.979 W (4.177 W averagely) electricity on one CPU under the drastic, irregular and common workload traces, respectively. The power reusing efficiency (PRE) can reach 12.8%~16.2% (14.23% averagely) and the total cost of ownership (TCO) of datacenters can be reduced by up to 0.57%.

DOI: 10.1109/ISCA45697.2020.00042

GraphABCD： scaling out graph analytics with asynchronous block coordinate descent

作者: Yang, Yifan and Li, Zhaoshi and Deng, Yangdong and Liu, Zhiwei and Yin, Shouyi and Wei, Shaojun and Liu, Leibo
关键词: heterogeneous architecture, hardware accelerator, graph analytics, FPGA

Abstract

It is of vital importance to efficiently process large graphs for many data-intensive applications. As a result, a large collection of graph analytic frameworks has been proposed to improve the per-iteration performance on a single kind of computation resource. However, heavy coordination and synchronization overhead make it hard to scale out graph analytic frameworks from single platform to heterogeneous platforms. Furthermore, increasing the convergence rate, i.e. reducing the number of iterations, which is equally vital for improving the overall performance of iterative graph algorithms, receives much less attention.In this paper, we introduce the Block Coordinate Descent (BCD) view of graph algorithms and propose an asynchronous heterogeneous graph analytic framework, GraphABCD, using the BCD view. The BCD view offers key insights and trade-offs on achieving high convergence rate of iterative graph algorithms. GraphABCD features fast convergence under the algorithm design options suggested by BCD. GraphABCD offers algorithm and architectural supports for asynchronous execution, without undermining its fast convergence properties. With minimum synchronization overhead, GraphABCD is able to scale out to heterogeneous and distributed accelerators efficiently. To demonstrate GraphABCD, we prototype its whole system on Intel HARPv2 CPU-FPGA heterogeneous platform. Evaluations on HARPv2 show that GraphABCD achieves geo-mean speedups of 4.8x and 2.0x over GraphMat, a state-of-the-art framework in terms of convergence rate and execution time, respectively.

DOI: 10.1109/ISCA45697.2020.00043

GaaS-X： graph analytics accelerator supporting sparse data representation using crossbar architectures

作者: Challapalle, Nagadastagiri and Rampalli, Sahithi and Song, Linghao and Chandramoorthy, Nandhini and Swaminathan, Karthik and Sampson, John and Chen, Yiran and Narayanan, Vijaykrishnan
关键词: sparsity, processing-in-memory, graph processing, crossbar memory, SpMV

Abstract

Graph analytics applications are ubiquitous in this era of a connected world. These applications have very low compute to byte-transferred ratios and exhibit poor locality, which limits their computational efficiency on general purpose computing systems. Conventional hardware accelerators employ custom dataflow and memory hierarchy organization to overcome these challenges. Processing-in-memory (PIM) accelerators leverage massively parallel compute capable memory arrays to perform the in-situ operations on graph data or employ custom compute elements near the memory to leverage larger internal bandwidths. In this work, we present GaaS-X, a graph analytics accelerator that inherently supports the sparse graph data representations using an in-situ compute-enabled crossbar memory architectures. We alleviate the overheads of redundant writes, sparse to dense conversions, and redundant computations on the invalid edges that are present in the state of the art crossbar-based PIM accelerators. GaaS-X achieves 7.7x and 2.4x performance and 22x and 5.7x, energy savings, respectively, over two state-of-the-art crossbar accelerators and offers orders of magnitude improvements over GPU and CPU solutions.

DOI: 10.1109/ISCA45697.2020.00044

MLPerf inference benchmark

作者: Reddi, Vijay Janapa and Cheng, Christine and Kanter, David and Mattson, Peter and Schmuelling, Guenther and Wu, Carole-Jean and Anderson, Brian and Breughe, Maximilien and Charlebois, Mark and Chou, William and Chukka, Ramesh and Coleman, Cody and Davis, Sam and Deng, Pan and Diamos, Greg and Duke, Jared and Fick, Dave and Gardner, J. Scott and Hubara, Itay and Idgunji, Sachin and Jablin, Thomas B. and Jiao, Jeff and John, Tom St. and Kanwar, Pankaj and Lee, David and Liao, Jeffery and Lokhmotov, Anton and Massa, Francisco and Meng, Peng and Micikevicius, Paulius and Osborne, Colin and Pekhimenko, Gennady and Rajan, Arun Tejusve Raghunath and Sequeira, Dilip and Sirasao, Ashish and Sun, Fei and Tang, Hanlin and Thomson, Michael and Wei, Frank and Wu, Ephrem and Xu, Lingjie and Yamada, Koichi and Yu, Bing and Yuan, George and Zhong, Aaron and Zhang, Peizhao and Zhou, Yuchen
关键词: machine learning, inference, benchmarking

Abstract

Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark’s flexibility and adaptability.

DOI: 10.1109/ISCA45697.2020.00045

Mocktails： capturing the memory behaviour of proprietary mobile architectures

作者: Badr, Mario and Delconte, Carlo and Edo, Isak and Jagtap, Radhika and Andreozzi, Matteo and Jerger, Natalie Enright
关键词: systems-on-chip, simulation, memory systems

Abstract

Computation demands on mobile and edge devices are increasing dramatically. Mobile devices, such as smart phones, incorporate a large number of dedicated accelerators and fixed-function hardware blocks to deliver the required performance and power efficiency. Due to the heterogeneous nature of these devices, they feature vastly larger design spaces than traditional systems featuring only a CPU. Currently, academia struggles to fully evaluate such heterogeneous systems on chip due to the limited access and availability of proprietary workloads. To address these challenges, we propose Mocktails: a methodology to synthetically recreate the varying spatio-temporal memory access behaviour of proprietary heterogeneous compute devices. We focus on capturing the interspersed address streams of the workload and the burstiness of the injection process for proprietary compute devices commonly found in mobile systems. We evaluate Mocktails in simulation with proprietary memory traces of IP blocks. Mocktails accurately recreates the dynamic behaviour of memory access scheduling for memory controller metrics including read row hits (at most 7.3% error) and write row hits (at most 2.8% error). Architects can use Mocktails in their simulations as a substitute for a proprietary compute device, making the tool a useful conduit between industry and academia.

DOI: 10.1109/ISCA45697.2020.00046

Accel-sim： an extensible simulation framework for validated GPU modeling

作者: Khairy, Mahmoud and Shen, Zhesheng and Aamodt, Tor M. and Rogers, Timothy G.
关键词: modeling and simulation, GPGPU

Abstract

In computer architecture, significant innovation frequently comes from industry. However, the simulation tools used by industry are often not released for open use, and even when they are, the exact details of industrial designs are not disclosed. As a result, research in the architecture space must ensure that assumptions about contemporary processor design remain true.To help bridge the gap between opaque industrial innovation and public research, we introduce three mechanisms that make it much easier for GPU simulators to keep up with industry. First, we introduce a new GPU simulator frontend that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA’s native machine ISA, while still supporting execution-driven simulation of the virtual ISA. Second, we extensively update GPGPU-Sim’s performance model to increase its level of detail, configurability and accuracy. Finally, surrounding the new frontend and flexible performance model is an infrastructure that enables quick, detailed validation. A comprehensive set of microbenchmarks and automated correlation plotting ease the modeling process.We use these three new mechanisms to build Accel-Sim, a detailed simulation framework that decreases cycle error 79 percentage points, over a wide range of 80 workloads, consisting of 1,945 kernel instances. We further demonstrate that Accel-Sim is able to simulate benchmark suites that no other open-source simulator can. In particular, we use Accel-sim to simulate an additional 60 workloads, comprised of 11,440 kernel instances, from the machine learning benchmark suite Deepbench. Deepbench makes use of closed-source, hand-tuned kernels with no virtual ISA implementation. Using a rigorous counter-by-counter analysis, we validate Accel-Sim against contemporary GPUs.Finally, to highlight the effects of falling behind industry, this paper presents two case-studies that demonstrate how incorrect baseline assumptions can hide new areas of opportunity and lead to potentially incorrect design decisions.

DOI: 10.1109/ISCA45697.2020.00047

HyperTRIO： hyper-tenant translation of I/O addresses

作者: Lavrov, Alexey and Wentzlaff, David
关键词: visualization, address translation, I/O subsystem

Abstract

Hardware resource sharing has proven to be an efficient way to increase resource utilization, save energy, and decrease operational cost. Modern-day servers accommodate hundreds of Virtual Machines (VMs) running concurrently, and lightweight software abstractions like containers enable the consolidation of an even larger number of independent tenants per server. The increasing number of hardware accelerators along with growing interconnection bandwidth creates a new class of devices available for sharing. To fully utilize the potential of these devices, I/O architecture needs to be carefully designed for both processors and devices.This paper presents the design and analysis of scalable Hyper-tenant TRanslation of I/O addresses (HyperTRIO) for shared devices. HyperTRIO provides isolation and performance guarantees at low hardware cost by supporting multiple in-flight address translations, partitioning translation caches, and utilizing both inter- and intra-tenant access patterns for translation prefetching. This work also constructs a Hyper-tenant Simulator of I/O address accesses (HyperSIO) for 1000-tenant systems which we open-sourced. This work characterizes tenant access patterns and uses these insights to address identified challenges. Overall, the HyperTRIO design enables the system to utilize full available I/O bandwidth in a hyper-tenant environment.

DOI: 10.1109/ISCA45697.2020.00048

BabelFish： fusing address translations for containers

作者: Skarlatos, Dimitrios and Darbaz, Umur and Gopireddy, Bhargava and Kim, Nam Sung and Torrellas, Josep
关键词: virtual memory, page tables, containers, address translation, TLB

Abstract

Cloud computing has begun a transformation from using virtual machines to containers. Containers are attractive because multiple of them can share a single kernel, and add minimal performance overhead. Cloud providers leverage the lean nature of containers to run hundreds of them on a few cores. Furthermore, containers enable the serverless paradigm, which leads to the creation of short-lived processes.In this work, we identify that containerized environments create page translations that are extensively replicated across containers in the TLB and in page tables. The result is high TLB pressure and redundant kernel work during page table management. To remedy this situation, this paper proposes BabelFish, a novel architecture to share page translations across containers in the TLB and in page tables. We evaluate BabelFish with simulations of an 8-core processor running a set of Docker containers in an environment with conservative container co-location. On average, under BabelFish, 53% of the translations in containerized workloads and 93% of the translations in serverless workloads are shared. As a result, BabelFish reduces the mean and tail latency of containerized data-serving workloads by 11% and 18%, respectively. It also lowers the execution time of containerized compute workloads by 11%. Finally, it reduces serverless function bring-up time by 8% and execution time by 10%-55%.

DOI: 10.1109/ISCA45697.2020.00049

Enhancing and exploiting contiguity for fast memory virtualization

作者: Alverti, Chloe and Psomadakis, Stratos and Karakostas, Vasileios and Gandhi, Jayneel and Nikas, Konstantinos and Goumas, Georgios and Koziris, Nectarios
关键词: visualization, virtual memory, memory management, address translation

Abstract

We propose synergistic software and hardware mechanisms that alleviate the address translation overhead, focusing particularly on virtualized execution. On the software side, we propose contiguity-aware (CA) paging, a novel physical memory allocation technique that creates larger-than-a-page contiguous mappings while preserving the flexibility of demand paging. CA paging applies to the hypervisor and guest OS memory manager independently, as well as to native systems. Moreover, CA paging benefits any address translation scheme that leverages contiguous mappings. On the hardware side, we propose SpOT, a simple micro-architectural mechanism to hide TLB miss latency by exploiting the regularity of large contiguous mappings to predict address translations in both native and virtualized systems. We implement and emulate the proposed techniques for the x86-64 architecture in Linux and KVM, and evaluate them across a variety of memory-intensive workloads. Our results show that: (i) CA paging is highly effective at creating vast contiguous mappings, even when memory is fragmented, and (ii) SpOT exploits the created contiguity and reduces address translation overhead of nested paging from ~16.5% to ~0.9%.

DOI: 10.1109/ISCA45697.2020.00050

Architecting noisy intermediate-scale trapped ion quantum computers

作者: Murali, Prakash and Debroy, Dripto M. and Brown, Kenneth R. and Martonosi, Margaret
关键词: No keywords

Abstract

Trapped ions (TI) are a leading candidate for building Noisy Intermediate-Scale Quantum (NISQ) hardware. TI qubits have fundamental advantages over other technologies such as superconducting qubits, including high qubit quality, coherence and connectivity. However, current TI systems are small in size, with 5-20 qubits and typically use a single trap architecture which has fundamental scalability limitations. To progress towards the next major milestone of 50-100 qubit TI devices, a modular architecture termed the Quantum Charge Coupled Device (QCCD) has been proposed. In a QCCD-based TI device, small traps are connected through ion shuttling. While the basic hardware components for such devices have been demonstrated, building a 50-100 qubit system is challenging because of a wide range of design possibilities for trap sizing, communication topology and gate implementations and the need to match diverse application resource requirements.Towards realizing QCCD-based TI systems with 50-100 qubits, we perform an extensive application-driven architectural study evaluating the key design choices of trap sizing, communication topology and operation implementation methods. To enable our study, we built a design toolflow which takes a QCCD architecture’s parameters as input, along with a set of applications and realistic hardware performance models. Our toolflow maps the applications onto the target device and simulates their execution to compute metrics such as application run time, reliability and device noise rates. Using six applications and several hardware design points, we show that trap sizing and communication topology choices can impact application reliability by up to three orders of magnitude. Microarchitectural gate implementation choices influence reliability by another order of magnitude. From these studies, we provide concrete recommendations to tune these choices to achieve highly reliable and performant application executions. With industry and academic efforts underway to build TI devices with 50-100 qubits, our insights have the potential to influence QC hardware in the near-future and accelerate the progress towards practical QC systems.

DOI: 10.1109/ISCA45697.2020.00051

AccQOC： accelerating quantum optimal control based pulse generation

作者: Cheng, Jinglei and Deng, Haoqing and Qian, Xuehai
关键词: quantum optimal control, quantum computing, pre-compilation

Abstract

In the last decades, we have witnessed the rapid growth of Quantum Computing. In the current Noisy Intermediate-Scale Quantum (NISQ) era, the capability of a quantum machine is limited by the decoherence time, gate fidelity and the number of Qubits. Current quantum computing applications are far from the real “quantum supremacy” due to the fragile physical Qubits, which can only be entangled for a few microseconds. Recent works use quantum optimal control to reduce the latency of quantum circuits, thereby effectively increasing quantum volume. However, the key challenge of this technique is the large overhead due to long compilation time.In this paper, we propose AccQOC, a comprehensive static/dynamic hybrid workflow to transform gate groups (equivalent to matrices) to pulses using QOC (Quantum Optimal Control) with a reasonable compilation time budget. AccQOC is composed of static pre-compilation and accelerated dynamic compilation. After the quantum program is mapped to the quantum circuit with our heuristic mapping algorithm considering crosstalk, we leverage static pre-compilation to generate pulses for the frequently used groups to eliminate the dynamic compilation time for them. The pulse is generated using QOC with binary search to determine the latency. For a new program, we use the same policy to generate groups, thus avoid incurring overhead for the “covered” groups. The dynamic compilation deals with “un-covered” groups with accelerated pulse generation. The key insight is that the pulse of a group can be generated faster based on the generated pulse of a similar group. We propose to reduce the compilation time by generating an ordered sequence of groups in which the sum of similarity among consecutive groups in the sequence is minimized. We can find the sequence by constructing a similarity graph — a complete graph in which each vertex is a gate group and the weight of an edge is the similarity between the two groups it connects, then construct a Minimum Spanning Tree (MST) for SG. With the methodology of AccQOC, we reached a balanced point of compilation time and overall latency. The results show that accelerated compilation based on MST achieves 9.88x compilation speedup compared to the standard compilation of each group while maintaining an average 2.43x latency reduction compared with gate-based compilation.

DOI: 10.1109/ISCA45697.2020.00052

NISQ+： boosting quantum computing power by approximating quantum error correction

作者: Holmes, Adam and Jokar, Mohammad Reza and Pasandi, Ghasem and Ding, Yongshan and Pedram, Massoud and Chong, Frederic T.
关键词: surface code, quantum error correction, quantum computing

Abstract

Quantum computers are growing in size, and design decisions are being made now that attempt to squeeze more computation out of these machines. In this spirit, we design a method to boost the computational power of near-term quantum computers by adapting protocols used in quantum error correction to implement “Approximate Quantum Error Correction (AQEC).” By approximating fully-fledged error correction mechanisms, we can increase the compute volume (qubits x gates, or “Simple Quantum Volume (SQV)”) of near-term machines. The crux of our design is a fast hardware decoder that can approximately decode detected error syndromes rapidly. Specifically, we demonstrate a proof-of-concept that approximate error decoding can be accomplished online in near-term quantum systems by designing and implementing a novel algorithm in superconducting Single Flux Quantum (SFQ) logic technology. This avoids a critical decoding backlog, hidden in all offline decoding schemes, that leads to idle time exponential in the number of T gates in a program [58].Our design utilizes one SFQ processing module per physical quantum bit. Employing state-of-the-art SFQ synthesis tools, we show that the circuit area, power, and latency are within the constraints of typical, contemporary quantum system designs. Under a pure dephasing error model, the proposed accelerator and AQEC solution is able to expand SQV by factors between 3,402 and 11,163 on expected near-term machines. The decoder achieves a 5% accuracy threshold as well as pseudo-thresholds of approximately 5%, 4.75%, 4.5%, and 3.5% physical error rates for code distances 3, 5, 7, and 9, respectively. Decoding solutions are achieved in a maximum of ~ 20 nanoseconds on the largest code distances studied. By avoiding the exponential idle time in offline decoders, we achieve a 10x reduction in required code distances to achieve the same logical performance as alternative designs.

DOI: 10.1109/ISCA45697.2020.00053

SQUARE： strategic quantum ancilla reuse for modular quantum programs via cost-effective uncomputation

作者: Ding, Yongshan and Wu, Xin-Chuan and Holmes, Adam and Wiseth, Ash and Franklin, Diana and Martonosi, Margaret and Chong, Frederic T.
关键词: reversible logic synthesis, quantum computing, compiler optimization

Abstract

Compiling high-level quantum programs to machines that are size constrained (i.e. limited number of quantum bits) and time constrained (i.e. limited number of quantum operations) is challenging. In this paper, we present SQUARE (Strategic QUantum Ancilla REuse), a compilation infrastructure that tackles allocation and reclamation of scratch qubits (called ancilla) in modular quantum programs. At its core, SQUARE strategically performs uncomputation to create opportunities for qubit reuse.Current Noisy Intermediate-Scale Quantum (NISQ) computers and forward-looking Fault-Tolerant (FT) quantum computers have fundamentally different constraints such as data locality, instruction parallelism, and communication overhead. Our heuristic-based ancilla-reuse algorithm balances these considerations and fits computations into resource-constrained NISQ or FT quantum machines, throttling parallelism when necessary. To precisely capture the workload of a program, we propose an improved metric, the “active quantum volume,” and use this metric to evaluate the effectiveness of our algorithm. Our results show that SQUARE improves the average success rate of NISQ applicationsby 1.47X. Surprisingly, the additional gates for uncomputation create ancilla with better locality, and result in substantially fewer swap gates and less gate noise overall. SQUARE also achieves an average reduction of 1.5X (and up to 9.6X) in active quantum volume for FT machines.

DOI: 10.1109/ISCA45697.2020.00054

Hoop： efficient hardware-assisted out-of-place update for non-volatile memory

作者: Cai, Miao and Coats, Chance C. and Huang, Jian
关键词: out-of-place update, non-volatile memory, memory persistency, logging

Abstract

Byte-addressable non-volatile memory (NVM) is a promising technology that provides near-DRAM performance with scalable memory capacity. However, it requires atomic data durability to ensure memory persistency. Therefore, many techniques, including logging and shadow paging, have been proposed. However, most of them either introduce extra write traffic to NVM or suffer from significant performance overhead on the critical path of program execution, or even both.In this paper, we propose a transparent and efficient hardware-assisted out-of-place update (HOOP) mechanism that supports atomic data durability, without incurring much extra writes and performance overhead. The key idea is to write the updated data to a new place in NVM, while retaining the old data until the updated data becomes durable. To support this, we develop a lightweight indirection layer in the memory controller to enable efficient address translation and adaptive garbage collection for NVM. We evaluate Hoop with a variety of popular data structures and data-intensive applications, including key-value stores and databases. Our evaluation shows that Hoop achieves low critical-path latency with small write amplification, which is close to that of a native system without persistence support. Compared with state-of-the-art crash-consistency techniques, it improves application performance by up to 1.7x, while reducing the write amplification by up to 2.1x. Hoop also demonstrates scalable data recovery capability on multi-core systems.

DOI: 10.1109/ISCA45697.2020.00055

Lelantus： fine-granularity copy-on-write operations for secure non-volatile memories

作者: Zhou, Jian and Awad, Amro and Wang, Jun
关键词: No keywords

Abstract

Bulk operations, such as Copy-on-Write (CoW), have been heavily used in most operating systems. In particular, CoW brings in significant savings in memory space and improvement in performance. CoW mainly relies on the fact that many allocated virtual pages are not written immediately (if ever written). Thus, assigning them to a shared physical page can eliminate much of the copy/initialization overheads in addition to improving the memory space efficiency. By prohibiting writes to the shared page, and merely copying the page content to a new physical page at the first write, CoW achieves significant performance and memory space advantages.Unfortunately, with the limited write bandwidth and slow writes of emerging Non-Volatile Memories (NVMs), such bulk writes can throttle the memory system. Moreover, it can add significant delays on the first write access to each page due to the need to copy or initialize a new page. Ideally, we need to enable CoW at fine-granularity, and hence only the updated cache blocks within the page need to be copied. To do this, we propose Lelantus, a novel approach that leverages secure memory metadata to allow fine-granularity CoW operations. Lelantus relies on a novel hardware-software co-design to allow tracking updated blocks of copied pages and hence delay the copy of the rest of the blocks until written. The impact of Lelantus becomes more significant when huge pages are deployed, e.g., 2MB or 1GB, as expected with emerging NVMs.

DOI: 10.1109/ISCA45697.2020.00056

MorLog： morphable hardware logging for atomic persistence in non-volatile main memory

作者: Wei, Xueliang and Feng, Dan and Tong, Wei and Liu, Jingning and Ye, Liuqing
关键词: persistence, non-volatile memory, hardware logging

Abstract

Byte-addressable non-volatile memory (NVM) is emerging as an alternative for main memory. Non-volatile main memory (NVMM) systems are required to support atomic persistence and deal with the high overhead of programming NVM cells. To this end, recent studies propose hardware logging and data encoding designs for NVMM systems. However, prior hardware logging designs incur either extra ordering constraints or redundant log data. Moreover, existing data encoding designs are unaware of the characteristics of log data, resulting in writing unnecessary log bits.In this paper, we propose a morphable hardware logging design (MorLog) that only logs the data necessary for recovery and dynamically selects encoding methods with least write overhead. We observe that (1) only the oldest undo and the newest redo data in each transaction are necessary for recovery, and (2) the log data for clean bits are clean. The first motivates our morphable logging mechanism. This mechanism logs both undo and redo data for the first update to the data in a transaction, and then logs only redo data. Undo data are eagerly written to NVMM to ensure atomicity, while redo data are buffered in a volatile log buffer and L1 caches to write only the newest redo data to NVMM. The second motivates our selective log data encoding mechanism. This mechanism simultaneously encodes log data with different methods, and writes the encoded log data with the least write cost to NVMM. We devise a differential log data compression method to exploit the characteristics of log data. This method directly discards clean bits from log data and compresses remained dirty bits. Our evaluation shows that MorLog improves performance by 72.5%, reduces NVMM write traffic by 41.1%, and decreases NVMM write energy by 49.9% compared with the state-of-the-art design.

DOI: 10.1109/ISCA45697.2020.00057

Tvarak： software-managed hardware offload for redundancy in direct-access NVM storage

作者: Kateja, Rajat and Beckmann, Nathan and Ganger, Gregory R.
关键词: redundancy, non-volatile memory, direct access

Abstract

Production storage systems complement device-level ECC (which covers media errors) with system-checksums and cross-device parity. This system-level redundancy enables systems to detect and recover from data corruption due to device firmware bugs (e.g., reading data from the wrong physical location). Direct access to NVM penalizes software-only implementations of system-level redundancy, forcing a choice between lack of data protection or significant performance penalties. We propose to offload the update and verification of system-level redundancy to Tvarak, a new hardware controller co-located with the last-level cache. Tvarak enables efficient protection of data from such bugs in memory controller and NVM DIMM firmware. Simulation-based evaluation with seven data-intensive applications shows that Tvarak is efficient. For example, Tvarak reduces Redis set-only performance by only 3%, compared to 50% reduction for a state-of-the-art software-only approach.

DOI: 10.1109/ISCA45697.2020.00058

Revisiting RowHammer： an experimental analysis of modern DRAM devices and mitigation techniques

作者: Kim, Jeremie S. and Patel, Minesh and Ya\u{g
关键词: No keywords

Abstract

RowHammer is a circuit-level DRAM vulnerability, first rigorously analyzed and introduced in 2014, where repeatedly accessing data in a DRAM row can cause bit flips in nearby rows. The RowHammer vulnerability has since garnered significant interest in both computer architecture and computer security research communities because it stems from physical circuit-level interference effects that worsen with continued DRAM density scaling. As DRAM manufacturers primarily depend on density scaling to increase DRAM capacity, future DRAM chips will likely be more vulnerable to RowHammer than those of the past. Many RowHammer mitigation mechanisms have been proposed by both industry and academia, but it is unclear whether these mechanisms will remain viable solutions for future devices, as their overheads increase with DRAM’s vulnerability to RowHammer.In order to shed more light on how RowHammer affects modern and future devices at the circuit-level, we first present an experimental characterization of RowHammer on 1580 DRAM chips (408x DDR3, 652x DDR4, and 520x LPDDR4) from 300 DRAM modules (60x DDR3, 110x DDR4, and 130x LPDDR4) with RowHammer protection mechanisms disabled, spanning multiple different technology nodes from across each of the three major DRAM manufacturers. Our studies definitively show that newer DRAM chips are more vulnerable to RowHammer: as device feature size reduces, the number of activations needed to induce a RowHammer bit flip also reduces, to as few as 9.6k (4.8k to two rows each) in the most vulnerable chip we tested.We evaluate five state-of-the-art RowHammer mitigation mechanisms using cycle-accurate simulation in the context of real data taken from our chips to study how the mitigation mechanisms scale with chip vulnerability. We find that existing mechanisms either are not scalable or suffer from prohibitively large performance overheads in projected future devices given our observed trends of RowHammer vulnerability. Thus, it is critical to research more effective solutions to RowHammer.

DOI: 10.1109/ISCA45697.2020.00059

Relaxed persist ordering using strand persistency

作者: Gogte, Vaibhav and Wang, William and Diestelhorst, Stephan and Chen, Peter M. and Narayanasamy, Satish and Wenisch, Thomas F.
关键词: strand persistency, persistent memories, memory persistency, language memory models, failure atomicity

Abstract

Emerging persistent memory (PM) technologies promise the performance of DRAM with the durability of Flash. Several language-level persistency models have emerged recently to aid programming recoverable data structures in PM. Unfortunately, these persistency models are built upon hardware primitives that impose stricter ordering constraints on PM operations than the persistency models require. Alternative solutions use fixed and inflexible hardware logging techniques to relax ordering constraints on PM operations, but do not readily apply to general synchronization primitives employed by language-level persistency models. Instead, we propose StrandWeaver, a hardware strand persistency model, to minimally constrain ordering on PM operations. StrandWeaver manages PM order within a strand, a logically independent sequence of operations within a thread. PM operations that lie on separate strands are unordered and may drain concurrently to PM. StrandWeaver implements primitives under strand persistency to allow programmers to improve concurrency and relax ordering constraints on updates as they drain to PM. Furthermore, we design mechanisms that map persistency semantics in high-level language persistency models to the primitives implemented by StrandWeaver. We demonstrate that StrandWeaver can enable greater concurrency of PM operations than existing ISA-level ordering mechanisms, improving performance by up to 1.97x (1.45x avg.).

DOI: 10.1109/ISCA45697.2020.00060

CLR-DRAM： a low-cost DRAM architecture enabling dynamic capacity-latency trade-off

作者: Luo, Haocong and Shahroodi, Taha and Hassan, Hasan and Patel, Minesh and Ya\u{g
关键词: No keywords

Abstract

DRAM is the prevalent main memory technology, but its long access latency can limit the performance of many workloads. Although prior works provide DRAM designs that reduce DRAM access latency, their reduced storage capacities hinder the performance of workloads that need large memory capacity. Because the capacity-latency trade-off is fixed at design time, previous works cannot achieve maximum performance under very different and dynamic workload demands.This paper proposes Capacity-Latency-Reconfigurable DRAM (CLR-DRAM), a new DRAM architecture that enables dynamic capacity-latency trade-off at low cost. CLR-DRAM allows dynamic reconfiguration of any DRAM row to switch between two operating modes: 1) max-capacity mode, where every DRAM cell operates individually to achieve approximately the same storage density as a density-optimized commodity DRAM chip and 2) high-performance mode, where two adjacent DRAM cells in a DRAM row and their sense amplifiers are coupled to operate as a single low-latency logical cell driven by a single logical sense amplifier.We implement CLR-DRAM by adding isolation transistors in each DRAM subarray. Our evaluations show that CLR-DRAM can improve system performance and DRAM energy consumption by 18.6% and 29.7% on average with four-core multiprogrammed workloads. We believe that CLR-DRAM opens new research directions for a system to adapt to the diverse and dynamically changing memory capacity and access latency demands of workloads.

DOI: 10.1109/ISCA45697.2020.00061

Hardware-based domain virtualization for intra-process isolation of persistent memory objects

作者: Xu, Yuanchao and Ye, ChenCheng and Solihin, Yan and Shen, Xipeng
关键词: persistent memory objects, memory protection keys, intra-process isolation

Abstract

Persistent memory has appealing properties in serving as main memory. While file access is protected by system calls, an attached persistent memory object (PMO) is one load/store away from accidental (or malicious) reads or writes, which may arise from use of just one buggy library. The recent progress in intra-process isolation could potentially protect PMO by enabling a process to partition sensitive data and code into isolated components. However, the existing intra-process isolations (e.g., Intel MPK) support isolation of only up to 16 domains, forming a major barrier for PMO protections. Although there is some recent effort trying to virtualize MPK to circumvent the limit, it suffers large overhead. This paper presents two novel architecture supports, which provide 11 - 52x higher efficiency while offering the first known domain-based protection for PMOs.

DOI: 10.1109/ISCA45697.2020.00062

Check-in： in-storage checkpointing for key-value store system leveraging flash-based SSDs

作者: Yoon, Joohyeong and Jeong, Won Seob and Ro, Won Woo
关键词: solid state drives, key-value store, in-storage processing, flash memory, checkpointing

Abstract

Persistent key-value store supports journaling and checkpointing to maintain data consistency and to prevent data loss. However, conventional data consistency mechanisms are not suitable for efficient management of flash memories in SSDs due to that they write the same data twice and induce redundant flash operations. As a result, query processing is delayed by heavy traffics during checkpointing. The checkpointing accompanies many write operations by nature, and a write operation consumes severe time and energy in SSDs; worse, it can introduce the write amplification problem and shorten the lifetime of the flash memory. In this paper, we propose an in-storage checkpointing mechanism, named Check-In, based on the cooperation between the storage engine of a host and the flash translation layer (FTL) of an SSD. Compared to the existing mechanism, our proposed mechanism reduces the tail latency due to checkpointing by 92.1% and reduces the number of duplicate writes by 94.3%. Overall, the average throughput and latency are improved by 8.1% and 10.2%, respectively.

DOI: 10.1109/ISCA45697.2020.00063

Speculative data-oblivious execution： mobilizing safe prediction for safe and efficient speculative execution

作者: Yu, Jiyong and Mantri, Namrata and Torrellas, Josep and Morrison, Adam and Fletcher, Christopher W.
关键词: speculative execution attacks, information flow, hardware

Abstract

Speculative execution attacks are an enormous security threat. In these attacks, malicious speculative execution reads and exfiltrates potentially arbitrary program data through microarchitectural covert channels. Correspondingly, prior work has shown how to comprehensively block such attacks by delaying the execution of covert channel-creating instructions until their operands are a function of non-speculative data.This paper’s premise is that it is safe to execute these potentially dangerous instructions early, improving performance, as long as their execution does not require operand-dependent hardware resource usage, i.e., is data oblivious. While secure, this idea can easily reduce, not improve, performance. Intuitively, data obliviousness implies doing the worst case work all the time. Our key idea to get net speedup is that it is safe to predict what will be, and to subsequently perform, the work needed to satisfy the common case, as long as the prediction itself does not leak privacy.We call the complete scheme—predicting the form of data-oblivious execution—Speculative Data-Oblivious Execution (SDO). We build SDO on top of a recent comprehensive and state-of-the-art protection called STT. Extending security arguments from STT, we show how the predictions do not reveal private information, enabling safe and efficient speculative execution. We evaluate the combined scheme, STT+SDO, on a set of SPEC17 workloads and find that it improves the performance of stand-alone STT by an average 36.3% to 55.1%, depending on the microarchitecture and attack model—and without changing STT’s security guarantees.

DOI: 10.1109/ISCA45697.2020.00064

Packet chasing： spying on network packets over a cache side-channel

作者: Taram, Mohammadkazem and Venkat, Ashish and Tullsen, Dean
关键词: side-channel attacks, packet processing, microarchitecture security, cache attacks, DDIO

Abstract

This paper presents Packet Chasing, an attack on the network that does not require access to the network, and works regardless of the privilege level of the process receiving the packets. A spy process can easily probe and discover the exact cache location of each buffer used by the network driver. Even more useful, it can discover the exact sequence in which those buffers are used to receive packets. This then enables packet frequency and packet sizes to be monitored through cache side channels. This allows both covert channels between a sender and a remote spy with no access to the network, as well as direct attacks that can identify, among other things, the web page access patterns of a victim on the network. In addition to identifying the potential attack, this work proposes a software-based short-term mitigation as well as a light-weight, adaptive, cache partitioning mitigation that blocks the interference of I/O and CPU requests in the last-level cache.

DOI: 10.1109/ISCA45697.2020.00065

Compact leakage-free support for integrity and reliability

作者: Taassori, Meysam and Balasubramonian, Rajeev and Chhabra, Siddhartha and Alameldeen, Alaa R. and Peddireddy, Manjula and Agarwal, Rajat and Stutsman, Ryan
关键词: memory systems, integrity verification

Abstract

The memory system is vulnerable to a number of security breaches, e.g., an attacker can interfere with program execution by disrupting values stored in memory. Modern Intel® Software Guard Extension (SGX) systems already support integrity trees to detect such malicious behavior. However, in spite of recent innovations, the bandwidth overhead of integrity+replay protection is non-trivial; state-of-the-art solutions like Synergy introduce average slowdowns of 2.3x for memory-intensive benchmarks. Prior work also implements a tree that is shared by multiple applications, thus introducing a potential side channel. In this work, we build on the Synergy and SGX baselines, and introduce three new techniques. First, we isolate each application by implementing a separate integrity tree and metadata cache for each application; this improves metadata cache efficiency and improves performance by 39%, while eliminating the potential side channel. Second, we reduce the footprint of the metadata. Synergy uses a combination of integrity and error correction metadata to provide low-overhead support for both. We share error correction metadata across multiple blocks, thus lowering its footprint (by 16x) while preventing error correction only in rare corner cases. However, we discover that shared error correction metadata, even with caching, does not improve performance. Third, we observe that thanks to its lower footprint, the error correction metadata can be embedded into the integrity tree. This reduces the metadata blocks that must be accessed to support both integrity verification and chipkill reliability. The proposed Isolated Tree with Embedded Shared Parity (ITESP) yields an overall performance improvement of 64%, relative to baseline Synergy.

DOI: 10.1109/ISCA45697.2020.00066

A bus authentication and anti-probing architecture extending hardware trusted computing base off CPU chips and beyond

作者: Xu, Zhenyu and Mauldin, Thomas and Yao, Zheyi and Pei, Shuyi and Wei, Tao and Yang, Qing
关键词: tampering, secure computer architecture, probing, physical attacks, authentication, PUF

Abstract

Tamper-proof hardware designs present a great challenge to computer architects. Most existing research limits hardware trusted computing base (TCB) to a CPU chip and anything off the CPU chip is vulnerable to probing and tampering. This paper introduces a new hardware design that provides strong defenses against physical attacks on interconnecting buses between chips in a computer system thereby extending the hardware TCB beyond CPU chips. The new approach is referred to as DIVOT: Detecting Impedance Variations Of Transmission-lines (Tx-lines). Every Tx-line in a computer system, such as a bus and interconnection wire has a unique, intrinsic, and fingerprint-like property: Impedance Inhomogeneity Pattern (IIP), i.e. the impedance distribution over distance. Such unpredictable, uncontrollable, and non-reproducible IIP fingerprints can be used to authenticate a Tx-line to ensure the confidentiality and integrity of data being transmitted. In addition, physical probes perturb the electromagnetic (EM) field around a Tx-line, leading to an altered IIP. As a result, runtime monitoring of IIPs can also be used to actively detect physical probing, snooping, and wire-tapping on buses. While the physics behind the IIP is known, the major technical breakthrough of DIVOT is the new integrated time domain reflectometer, iTDR, that is capable of carrying out in-situ and runtime monitoring of a Tx-line without interfering with normal data transfers. The iTDR is based on two innovations: analog-to-probability conversion (APC) and probability density modulation (PDM). The iTDR performs runtime IIP measurements non-invasively and is CMOS-compatible allowing it to be integrated with any interface logic connected to a bus. DIVOT is a generic, scalable, cost-effective, and low-overhead security solution for any computer system from servers to embedded computers in smart mobile devices and IoTs. To demonstrate the proposed architecture, a working prototype of DIVOT has been built on an FPGA as a proof of concept. Experimental results clearly showed the feasibility and performance of DIVOT for both hardware authentication and tamperproof applications. More specifically, the probability of correctly identifying a bus is close to 1 with an equal error rate (EER) of less than 0.06% at room temperature. We present an example design that incorporates DIVOT into an off-chip memory bus to protect against physical attacks including probing/snooping, tampering, and cold boot attacks.

DOI: 10.1109/ISCA45697.2020.00067

CHEx86： context-sensitive enforcement of memory safety via microcode-enabled capabilities

作者: Sharifi, Rasool and Venkat, Ashish
关键词: microcode, memory safety, capabilities

Abstract

This work introduces the CHEx86 processor architecture for securing applications, including legacy binaries, against a wide array of security exploits that target temporal and spatial memory safety vulnerabilities such as out-of-bounds accesses, use-after-free, double-free, and uninitialized reads, by instrumenting the code at the microcode-level, completely under-the-hood, with only limited access to source-level symbol information. In addition, this work presents a novel scheme for speculatively tracking pointer arithmetic and pointer movement, including the detection of pointer aliases in memory, at the machine code-level using a configurable set of automatically constructed rules. This architecture outperforms the address sanitizer, a state-of-the-art software-based mitigation by 59%, while eliminating porting, deployment, and verification costs that are invariably associated with recompilation.

DOI: 10.1109/ISCA45697.2020.00068

Nested enclave： supporting fine-grained hierarchical isolation with SGX

作者: Park, Joongun and Kang, Naegyeong and Kim, Taehoon and Kwon, Youngjin and Huh, Jaehyuk
关键词: trusted execution environments, trusted computing, multi-level security, SGX enclave

Abstract

Although hardware-based trusted execution environments (TEEs) have evolved to provide strong isolation with efficient hardware supports, their current monolithic model poses challenges in representing common software structures with modules produced from potentially untrusted 3rd parties. For better mapping of such modular software designs to trusted execution environments, it is necessary to extend the current monolithic model to a hierarchical one, which provides multiple inner TEEs within a TEE. For such hierarchical compartmen-talization within a TEE, this paper proposes a novel hierarchical TEE called nested enclave, which extends the enclave support from Intel SGX. Inspired by the multi-level security model, nested enclave provides multiple inner enclaves sharing the same outer enclave. Inner enclaves can access the context of the outer enclave, but they are protected from the outer enclave and non-enclave execution. Peer inner enclaves are isolated from each other while accessing the execution environment of the shared outer enclave. Both of the inner and outer enclaves are protected from vulnerable privileged software and physical attacks. Such fine-grained nested enclaves allow secure multi-tiered environments using software modules from untrusted 3rd parties. The security-sensitive modules run on the inner enclave with the higher security level, while the 3rd party modules on the outer enclave. It can be further extended to provide a separate inner module for each user to process privacy-sensitive data while sharing the same library with efficient hardware-protected communication channels. This study investigates three case scenarios implemented with an emulated nested enclave support, proving the feasibility and security improvement of the nested enclave model.

DOI: 10.1109/ISCA45697.2020.00069

RecNMP： accelerating personalized recommendation with near-memory processing

作者: Ke, Liu and Gupta, Udit and Cho, Benjamin Youngjae and Brooks, David and Chandra, Vikas and Diril, Utku and Firoozshahian, Amin and Hazelwood, Kim and Jia, Bill and Lee, Hsien-Hsin S. and Li, Meng and Maher, Bert and Mudigere, Dheevatsa and Naumov, Maxim and Schatz, Martin and Smelyanskiy, Mikhail and Wang, Xiaodong and Reagen, Brandon and Wu, Carole-Jean and Hempstead, Mark and Zhang, Xuan
关键词: No keywords

Abstract

Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, providing up to 9.8x memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2x throughput improvement and 45.8% memory energy savings.

DOI: 10.1109/ISCA45697.2020.00070

iPIM： programmable in-memory image processing accelerator using near-bank architecture

作者: Gu, Peng and Xie, Xinfeng and Ding, Yufei and Chen, Guoyang and Zhang, Weifeng and Niu, Dimin and Xie, Yuan
关键词: process-in-memory, image processing, accelerator

Abstract

Image processing is becoming an increasingly important domain for many applications on workstations and the data-center that require accelerators for high performance and energy efficiency. GPU, which is the state-of-the-art accelerator for image processing, suffers from the memory bandwidth bottleneck. To tackle this bottleneck, near-bank architecture provides a promising solution due to its enormous bank-internal bandwidth and low-energy memory access. However, previous work lacks hardware programmability, while image processing workloads contain numerous heterogeneous pipeline stages with diverse computation and memory access patterns. Enabling programmable near-bank architecture with low hardware overhead remains challenging.This work proposes iPIM, the first programmable in-memory image processing accelerator using near-bank architecture. We first design a decoupled control-execution architecture to provide lightweight programmability support. Second, we propose the SIMB (Single-Instruction-Multiple-Bank) ISA to enable flexible control flow and data access. Third, we present an end-to-end compilation flow based on Halide that supports a wide range of image processing applications and maps them to our SIMB ISA. We further develop iPIM-aware compiler optimizations, including register allocation, instruction reordering, and memory order enforcement to improve performance. We evaluate a set of representative image processing applications on iPIM and demonstrate that on average iPIM obtains 11.02x acceleration and 79.49% energy saving over an NVIDIA Tesla V100 GPU. Further analysis shows that our compiler optimizations contribute 3.19x speedup over the unoptimized baseline.

DOI: 10.1109/ISCA45697.2020.00071

Near data acceleration with concurrent host access

作者: Cho, Benjamin Y. and Kwon, Yongkee and Lym, Sangkug and Erez, Mattan
关键词: No keywords

Abstract

Near-data accelerators (NDAs) that are integrated with the main memory have the potential for significant power and performance benefits. Fully realizing these benefits requires the large available memory capacity to be shared between the host and NDAs in a way that permits both regular memory access by some applications and accelerating others with an NDA, avoids copying data, enables collaborative processing, and simultaneously offers high performance for both host and NDA. We identify and solve new challenges in this context: mitigating row-locality interference from host to NDAs, reducing read/write-turnaround overhead caused by fine-grain interleaving of host and NDA requests, architecting a memory layout that supports the locality required for NDAs and sophisticated address inter-leaving for host performance, and supporting both packetized and traditional memory interfaces. We demonstrate our approach in a simulated system that consists of a multi-core CPU and NDA-enabled DDR4 memory modules. We show that our mechanisms enable effective and efficient concurrent access using a set of microbenchmarks, then demonstrate the potential of the system for the important stochastic variance-reduced gradient (SVRG) algorithm.

DOI: 10.1109/ISCA45697.2020.00072

TIMELY： pushing data movements and interfaces in PIM accelerators towards local and in time domain

作者: Li, Weitao and Xu, Pengfei and Zhao, Yang and Li, Haitong and Xie, Yuan and Lin, Yingyan
关键词: resistive-random-access-memory (ReRAM), processing in memory, neural networks, analog processing

Abstract

Resistive-random-access-memory (ReRAM) based processing-in-memory (R2PIM) accelerators show promise in bridging the gap between Internet of Thing devices’ constrained resources and Convolutional/Deep Neural Networks’ (CNNs/DNNs’) prohibitive energy cost. Specifically, R2PIM accelerators enhance energy efficiency by eliminating the cost of weight movements and improving the computational density through ReRAM’s high density. However, the energy efficiency is still limited by the dominant energy cost of input and partial sum (Psum) movements and the cost of digital-to-analog (D/A) and analog-to-digital (A/D) interfaces. In this work, we identify three energy-saving opportunities in R2PIM accelerators: analog data locality, time-domain interfacing, and input access reduction, and propose an innovative R2PIM accelerator called TIMELY, with three key contributions: (1) TIMELY adopts analog local buffers (ALBs) within ReRAM crossbars to greatly enhance the data locality, minimizing the energy overheads of both input and Psum movements; (2) TIMELY largely reduces the energy of each single D/A (and A/D) conversion and the total number of conversions by using time-domain interfaces (TDIs) and the employed ALBs, respectively; (3) we develop an only-once input read (O2IR) mapping method to further decrease the energy of input accesses and the number of D/A conversions. The evaluation with more than 10 CNN/DNN models and various chip configurations shows that, TIMELY outperforms the baseline R2PIM accelerator, PRIME, by one order of magnitude in energy efficiency while maintaining better computational density (up to 31.2x) and throughput (up to 736.6x). Furthermore, comprehensive studies are performed to evaluate the effectiveness of the proposed ALB, TDI, and O2IR in terms of energy savings and area reduction.

DOI: 10.1109/ISCA45697.2020.00073

Hyper-AP： enhancing associative processing through a full-stack optimization

作者: Zha, Yue and Li, Jing
关键词: No keywords

Abstract

Associative processing (AP) is a promising PIM paradigm that overcomes the von Neumann bottleneck (memory wall) by virtue of a radically different execution model. By decomposing arbitrary computations into a sequence of primitive memory operations (i.e., search and write), AP’s execution model supports concurrent SIMD computations in-situ in the memory array to eliminate the need for data movement. This execution model also provides a native support for flexible data types and only requires a minimal modification on the existing memory design (low hardware complexity). Despite these advantages, the execution model of AP has two limitations that substantially increase the execution time, i.e., 1) it can only search a single pattern in one search operation and 2) it needs to perform a write operation after each search operation. In this paper, we propose the Highly Performant Associative Processor (Hyper-AP) to fully address the aforementioned limitations. The core of Hyper-AP is an enhanced execution model that reduces the number of search and write operations needed for computations, thereby reducing the execution time. This execution model is generic and improves the performance for both CMOS-based and RRAM-based AP, but it is more beneficial for the RRAM-based AP due to the substantially reduced write operations. We then provide complete architecture and micro-architecture with several optimizations to efficiently implement Hyper-AP. In order to reduce the programming complexity, we also develop a compilation framework so that users can write C-like programs with several constraints to run applications on Hyper-AP. Several optimizations have been applied in the compilation process to exploit the unique properties of Hyper-AP. Our experimental results show that, compared with the recent work IMP, Hyper-AP achieves up to 54x/4.4x better power-/area-efficiency for various representative arithmetic operations. For the evaluated benchmarks, Hyper-AP achieves 3.3x speedup and 23.8x energy reduction on average compared with IMP. Our evaluation also confirms that the proposed execution model is more beneficial for the RRAM-based AP than its CMOS-based counterpart.

DOI: 10.1109/ISCA45697.2020.00074

JPEG-ACT： accelerating deep learning via transform-based lossy compression

作者: Evans, R. David and Liu, Lufei and Aamodt, Tor M.
关键词: hardware acceleration, compression, GPU, CNN training

Abstract

A reduction in the time it takes to train machine learning models can be translated into improvements in accuracy. An important factor that increases training time in deep neural networks (DNNs) is the need to store large amounts of temporary data during the back-propagation algorithm. To enable training very large models this temporary data can be offloaded from limited size GPU memory to CPU memory but this data movement incurs large performance overheads.We observe that in one important class of DNNs, convolutional neural networks (CNNs), there is spatial correlation in these temporary values. We propose JPEG for ACTivations (JPEG-ACT), a lossy activation offload accelerator for training CNNs that works by discarding redundant spatial information. JPEG-ACT adapts the well-known JPEG algorithm from 2D image compression to activation compression. We show how to optimize the JPEG algorithm so as to ensure convergence and maintain accuracy during training. JPEG-ACT achieves 2.4x higher training performance compared to prior offload accelerators, and 1.6x compared to prior activation compression methods. An efficient hardware implementation allows JPEG-ACT to consume less than 1% of the power and area of a modern GPU.

DOI: 10.1109/ISCA45697.2020.00075

TransForm： formally specifying transistency models and synthesizing enhanced litmus tests

作者: Hossain, Naorin and Trippel, Caroline and Martonosi, Margaret
关键词: synthesis, memory transistency, memory consistency, enhanced litmus tests, axiomatic modeling

Abstract

Memory consistency models (MCMs) specify the legal ordering and visibility of shared memory accesses in a parallel program. Traditionally, instruction set architecture (ISA) MCMs assume that relevant program-visible memory ordering behaviors only result from shared memory interactions that take place between user-level program instructions. This assumption fails to account for virtual memory (VM) implementations that may result in additional shared memory interactions between user-level program instructions and both 1) system-level operations (e.g., address remappings and translation lookaside buffer invalidations initiated by system calls) and 2) hardware-level operations (e.g., hardware page table walks and dirty bit updates) during a user-level program’s execution. These additional shared memory interactions can impact the observable memory ordering behaviors of user-level programs. Thus, memory transistency models (MTMs) have been coined as a superset of MCMs to additionally articulate VM-aware consistency rules. However, no prior work has enabled formal MTM specifications, nor methods to support their automated analysis.To fill the above gap, this paper presents the TransForm framework. First, TransForm features an axiomatic vocabulary for formally specifying MTMs. Second, TransForm includes a synthesis engine to support the automated generation of litmus tests enhanced with MTM features (i.e., enhanced litmus tests, or ELTs) when supplied with a TransForm MTM specification. As a case study, we formally define an estimated MTM for Intel x86 processors, called x86t_elt, that is based on observations made by an ELT-based evaluation of an Intel x86 MTM implementation from prior work and available public documentation [23, 29]. Given x86t_elt and a synthesis bound (on program size) as input, TransForm’s synthesis engine successfully produces a complete set of ELTs (within a 9-instruction bound) including relevant hand-curated ELTs from prior work, plus 100 more.

DOI: 10.1109/ISCA45697.2020.00076

HieraGen： automated generation of concurrent, hierarchical cache coherence protocols

作者: Oswald, Nicolai and Nagarajan, Vijay and Sorin, Daniel J.
关键词: No keywords

Abstract

We present HieraGen, a new tool for automatically generating hierarchical cache coherence protocols. HieraGen’s inputs are the simple, atomic, stable state protocols for each level of the hierarchy. HieraGen’s output is a highly concurrent hierarchical protocol, in the form of the finite state machines for all of the cache and directory controllers. HieraGen thus reduces the complexity that architects face, by offloading the challenging tasks of composing protocols and managing concurrency. Experiments show that HieraGen can automatically generate correct-by-construction MOESI family of hierarchical protocols with dozens of states and hundreds of transitions. We have verified all of the generated protocols for safety and deadlock freedom using a model checker.

DOI: 10.1109/ISCA45697.2020.00077

Tailored page sizes

作者: Guvenilir, Faruk and Patt, Yale N.
关键词: No keywords

Abstract

Main memory capacity continues to soar, resulting in TLB misses becoming an increasingly significant performance bottleneck. Current coarse grained page sizes, the solution from Intel, ARM, and others, have not helped enough. We propose Tailored Page Sizes (TPS), a mechanism that allows pages of size 2n, for all n greater than a default minimum. For x86, the default minimum page size is 212 (4KB). TPS means one page table entry (PTE) for each large contiguous virtual memory space mapped to an equivalent-sized large contiguous physical frame. To make this work in a clean, seamless way, we suggest small changes to the ISA, the microarchitecture, and the O/S allocation operation. The result: TPS can eliminate approximately 98% of page walk memory accesses and 97% of all L1 TLB misses across a variety of SPEC17 and big data memory intensive benchmarks.

DOI: 10.1109/ISCA45697.2020.00078

Perforated page： supporting fragmented memory allocation for large pages

作者: Park, Chang Hyun and Cha, Sanghoon and Kim, Bokyeong and Kwon, Youngjin and Black-Schaffer, David and Huh, Jaehyuk
关键词: virtualization, virtual memory, translation lookaside buffer, memory management, address translation

Abstract

The availability of large pages has dramatically improved the efficiency of address translation for applications that use large contiguous regions of memory. However, large pages can be difficult to allocate due to fragmented memory, non-movable pages, or the need to split a large page into regular pages when part of the large page is forced to have a different permission status from the rest of the page. Furthermore, they can also be expensive due to memory bloating caused by sparse accesses to application data. In this work, we enable the allocation of large 2MB pages even in the presence of fragmented physical memory via perforated pages. Perforated pages permit the OS to punch 4KB page-sized holes in the physical address range allocated to a large page and re-map them to other addresses as needed. This not only enables the system to benefit from large pages in the presence of fragmentation, but also allows for different permissions to exist within a large page, enhancing sharing flexibility. In addition, it allows unused parts of a large page to be used elsewhere, mitigating memory bloating. To minimize changes to the system, perforated pages reuse the 4KB-level page table entries to store the hole locations and translates holes into regular 4KB pages. For performance, the proposed technique caches the translations for hole pages in the TLBs and track holes via cached bitmaps in the L2 TLB.By enabling large pages in the presence of physical memory fragmentation, perforated pages increase the applicability and resulting benefits of large pages with only minor changes to the hardware and OS. In this work, we evaluate the effectiveness of perforated pages with timing simulations under diverse and realistic fragmentation scenarios. Our results show that even with fragmented memory, perforated pages accomplish 93.2% to 99.9% of the performance achievable by ideal memory allocation, and 2.0% to 11.5% better performance over the conventional system running with fragmented memory.

DOI: 10.1109/ISCA45697.2020.00079

Buddy compression： enabling larger memory for deep learning and HPC workloads on GPUs

作者: Choukse, Esha and Sullivan, Michael B. and O’Connor, Mike and Erez, Mattan and Pool, Jeff and Nellans, David and Keckler, Stephen W.
关键词: No keywords

Abstract

GPUs accelerate high-throughput applications, which require orders-of-magnitude higher memory bandwidth than traditional CPU-only systems. However, the capacity of such high-bandwidth memory tends to be relatively small. Buddy Compression is an architecture that makes novel use of compression to utilize a larger buddy-memory from the host or disaggregated memory, effectively increasing the memory capacity of the GPU. Buddy Compression splits each compressed 128B memory-entry between the high-bandwidth GPU memory and a slower-but-larger buddy memory such that compressible memory-entries are accessed completely from GPU memory, while incompressible entries source some of their data from off-GPU memory. With Buddy Compression, compressibility changes never result in expensive page movement or re-allocation. Buddy Compression achieves on average 1.9x effective GPU memory expansion for representative HPC applications and 1.5x for deep learning training, performing within 2% of an unrealistic system with no memory limit. This makes Buddy Compression attractive for performance-conscious developers that require additional GPU memory capacity.

DOI: 10.1109/ISCA45697.2020.00080

A multi-neural network acceleration architecture

作者: Baek, Eunjin and Kwon, Dongup and Kim, Jangwoo
关键词: No keywords

Abstract

A cost-effective multi-tenant neural network execution is becoming one of the most important design goals for modern neural network accelerators. For example, as emerging AI services consist of many heterogeneous neural network executions, a cloud provider wants to serve a large number of clients using a single AI accelerator for improving its cost effectiveness. Therefore, an ideal next-generation neural network accelerator should support a simultaneous multi-neural network execution, while fully utilizing its hardware resources. However, existing accelerators which are optimized for a single neural network execution can suffer from severe resource underutilization when running multiple neural networks, mainly due to the load imbalance between computation and memory-access tasks from different neural networks.In this paper, we propose AI-MultiTasking (AI-MT), a novel accelerator architecture which enables a cost-effective, high-performance multi-neural network execution. The key idea of AI-MT is to fully utilize the accelerator’s computation resources and memory bandwidth by matching compute- and memory-intensive tasks from different networks and executing them in parallel. However, it is highly challenging to find and schedule the best load-matching tasks from different neural networks during runtime, without significantly increasing the size of on-chip memory. To overcome the challenges, AI-MT first creates fine-grain tasks at compile time by dividing each layer into multiple identical sub-layers. During runtime, AI-MT dynamically applies three sub-layer scheduling methods: memory block prefetching and compute block merging for the best resource load matching, and memory block eviction for the minimum on-chip memory footprint. Our evaluations using MLPerf benchmarks show that AI-MT achieves up to 1.57x speedup over the baseline scheduling method.

DOI: 10.1109/ISCA45697.2020.00081

SmartExchange： trading higher-cost memory storage/access for lower-cost computation

作者: Zhao, Yang and Chen, Xiaohan and Wang, Yue and Li, Chaojian and You, Haoran and Fu, Yonggan and Xie, Yuan and Wang, Zhangyang and Lin, Yingyan
关键词: weight decomposition, quantization, pruning, neural network inference accelerator, neural network compression

Abstract

We present SmartExchange, an algorithm-hardware co-design framework to trade higher-cost memory storage/access for lower-cost computation, for energy-efficient inference of deep neural networks (DNNs). We develop a novel algorithm to enforce a specially favorable DNN weight structure, where each layerwise weight matrix can be stored as the product of a small basis matrix and a large sparse coefficient matrix whose non-zero elements are all power-of-2. To our best knowledge, this algorithm is the first formulation that integrates three mainstream model compression ideas: sparsification or pruning, decomposition, and quantization, into one unified framework. The resulting sparse and readily-quantized DNN thus enjoys greatly reduced energy consumption in data movement as well as weight storage. On top of that, we further design a dedicated accelerator to fully utilize the SmartExchange-enforced weights to improve both energy efficiency and latency performance. Extensive experiments show that 1) on the algorithm level, SmartExchange outperforms state-of-the-art compression techniques, including merely sparsification or pruning, decomposition, and quantization, in various ablation studies based on nine models and four datasets; and 2) on the hardware level, SmartExchange can boost the energy efficiency by up to 6.7x and reduce the latency by up to 19.2x over four state-of-the-art DNN accelerators, when benchmarked on seven DNN models (including four standard DNNs, two compact DNN models, and one segmentation model) and three datasets.

DOI: 10.1109/ISCA45697.2020.00082

Centaur： a chiplet-based, hybrid sparse-dense accelerator for personalized recommendations

作者: Hwang, Ranggi and Kim, Taehun and Kwon, Youngeun and Rhu, Minsoo
关键词: processor architecture, neural network, machine learning, deep learning, accelerator, FPGA

Abstract

Personalized recommendations are the backbone machine learning (ML) algorithm that powers several important application domains (e.g., ads, e-commerce, etc) serviced from cloud datacenters. Sparse embedding layers are a crucial building block in designing recommendations yet little attention has been paid in properly accelerating this important ML algorithm. This paper first provides a detailed workload characterization on personalized recommendations and identifies two significant performance limiters: memory-intensive embedding layers and compute-intensive multi-layer perceptron (MLP) layers. We then present Centaur, a chiplet-based hybrid sparse-dense accelerator that addresses both the memory throughput challenges of embedding layers and the compute limitations of MLP layers. We implement and demonstrate our proposal on an Intel HARPv2, a package-integrated CPU+FPGA device, which shows a 1.7—17.2x performance speedup and 1.7—19.5x energy-efficiency improvement than conventional approaches.

DOI: 10.1109/ISCA45697.2020.00083

DeepRecSys： a system for optimizing end-to-end at-scale neural recommendation inference

作者: Gupta, Udit and Hsia, Samuel and Saraph, Vikram and Wang, Xiaodong and Reagen, Brandon and Wei, Gu-Yeon and Lee, Hsien-Hsin S. and Brooks, David and Wu, Carole-Jean
关键词: No keywords

Abstract

Neural personalized recommendation is the cornerstone of a wide collection of cloud services and products, constituting significant compute demand of cloud infrastructure. Thus, improving the execution efficiency of recommendation directly translates into infrastructure capacity saving. In this paper, we propose DeepRecSched, a recommendation inference scheduler that maximizes latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, model architectures, and underlying hardware systems. By carefully optimizing task versus data-level parallelism, DeepRecSched improves system throughput on server class CPUs by 2x across eight industry-representative models. Next, we deploy and evaluate this optimization in an at-scale production datacenter which reduces end-to-end tail latency across a wide variety of recommendation models by 30%. Finally, DeepRecSched demonstrates the role and impact of specialized AI hardware in optimizing system level performance (QPS) and power efficiency (QPS/watt) of recommendation inference.In order to enable the design space exploration of customized recommendation systems shown in this paper, we design and validate an end-to-end modeling infrastructure, DeepRecInfra. DeepRecInfra enables studies over a variety of recommendation use cases, taking into account at-scale effects, such as query arrival patterns and recommendation query sizes, observed from a production datacenter, as well as industry-representative models and tail latency targets.

DOI: 10.1109/ISCA45697.2020.00084

An in-network architecture for accelerating shared-memory multiprocessor collectives

作者: Klenk, Benjamin and Jiang, Nan and Thorson, Greg and Dennison, Larry
关键词: No keywords

Abstract

The slowdown of single-chip performance scaling combined with the growing demands of computing ever larger problems efficiently has led to a renewed interest in distributed architectures and specialized hardware. Dedicated accelerators for common or critical operations are becoming cost-effective additions to processors, peripherals, and networks. In this paper we focus on one such operation, the All-Reduce, which is both a common and critical feature of neural network training. All-Reduce is impossible to fully parallelize and difficult to amortize, so it benefits greatly from hardware acceleration.We are proposing an accelerator-centric, shared-memory network that improves All-Reduce performance through in-network reductions, as well as accelerating other collectives like Multicast. We propose switch designs to support in-network computation, including two reduction methods that offer trade-offs in implementation complexity and performance. Additionally, we propose network endpoint modifications to further improve collectives.We present simulation results for a 16 GPU system showing that our collective acceleration design improves the All-Reduce operation by up to 2x for large messages and up to 18x for small messages when compared with a state-of-the-art software algorithm, leading up to 1.4x faster DL training times for networks like Transformer. We demonstrate that this design is scalable to large systems and present results for up to 128 GPUs.

DOI: 10.1109/ISCA45697.2020.00085

DRQ： dynamic region-based quantization for deep neural network acceleration

作者: Song, Zhuoran and Fu, Bangqi and Wu, Feiyang and Jiang, Zhaoming and Jiang, Li and Jing, Naifeng and Liang, Xiaoyao
关键词: No keywords

Abstract

Quantization is an effective technique for Deep Neural Network (DNN) inference acceleration. However, conventional quantization techniques are either applied at network or layer level that may fail to exploit fine-grained quantization for further speedup, or only applied on kernel weights without paying attention to the feature map dynamics that may lead to lower NN accuracy. In this paper, we propose a dynamic region-based quantization, namely DRQ, which can change the precision of a DNN model dynamically based on the sensitive regions in the feature map to achieve greater acceleration while reserving better NN accuracy. We propose an algorithm to identify the sensitive regions and an architecture that utilizes a variable-speed mixed-precision convolution array to enable the algorithm with better performance and energy efficiency. Our experiments on a wide variety of networks show that compared to a coarse-grained quantization accelerator like “Eyeriss”, DRQ can achieve 92% performance gain and 72% energy reduction with less then 1% accuracy loss. Compared to the state-of-the-art mixed-precision quantization accelerator “OLAccel”, DRQ can also achieve 21% performance gain and 33% energy reduction with 3% prediction accuracy improvement which is quite impressive for inference.

DOI: 10.1109/ISCA45697.2020.00086

Independent forward progress of work-groups

作者: Du\c{t
关键词: No keywords

Abstract

GPUs have evolved from providing highly-constrained programmability for a single kernel to using pre-emption to ensure independent forward progress for concurrently executing kernels. However, modern GPUs do not ensure independent forward progress for kernels that use fine-grain synchronization to coordinate inter-work-group execution. Enabling independent forward progress among work-groups (WGs) is challenging as pre-empted kernels may be rescheduled with fewer hardware resources. This can lead to oversubscribed execution scenarios that deadlock current hardware even for correctly written code. Prior work addresses this problem by requiring programmers to specify resource requirements and assuming static resource allocation, which adds scheduling constraints and reduces portability.We propose a family of novel hardware approaches — trading off hardware complexity for performance — that provide independent forward progress in the presence of fine-grain inter-WG synchronization and dynamic resource allocation. Additionally, we propose new waiting atomic instructions compatible with proposed C++20 extensions. Our final design, Autonomous WorkGroups (AWG), uses hints from regular and waiting atomics to cooperatively schedule WGs within a kernel, improving efficiency and virtualizing hardware resources. In non-oversubscribed scenarios, AWG outperforms a busy-waiting baseline (which deadlocks in oversubscribed scenarios) by 12x on average for benchmarks that use different mutexes and barriers for fine-grained, WG granularity synchronization. Furthermore, AWG outperforms other solutions that do not deadlock in the oversubscribed case, such as fixed-interval round-robin context switching or naively extending monitor/mwait to GPUs, by 2.6x and 2.2x, respectively.

DOI: 10.1109/ISCA45697.2020.00087

ScoRD： a scoped race detector for GPUs

作者: Kamath, Aditya K. and George, Alvin A. and Basu, Arkaprava
关键词: software debugging, parallel programming, graphics processing units

Abstract

GPUs have emerged as a key computing platform for an ever-growing range of applications. Unlike traditional bulk-synchronous GPU programs, many emerging GPU-accelerated applications, such as graph processing, have irregular interaction among the concurrent threads. Consequently, they need complex synchronization. To enable both high performance and adequate synchronization, GPU vendors have introduced scoped synchronization operations that allow a programmer to synchronize within a subset of concurrent threads (a.k.a., scope) that she deems adequate. Scoped-synchronization avoids the performance overhead of synchronization across thousands of GPU threads while ensuring correctness when used appropriately. This flexibility, however, could be a new source of incorrect synchronization where a race can occur due to insufficient scope of the synchronization operation, and not due to missing synchronization as in a typical race.We introduce ScoRD, a race detector that enables hardware support for efficiently detecting global memory races in a GPU program, including those that arise due to insufficient scopes of synchronization operations. We show that ScoRD can detect a variety of races with a modest performance overhead (on average, 35%). In the process of this study, we also created a benchmark suite consisting of seven applications and three categories of microbenchmarks that use scoped synchronization operations.

DOI: 10.1109/ISCA45697.2020.00088

The virtual block interface： a flexible alternative to the conventional virtual memory framework

作者: Hajinazar, Nastaran and Patel, Pratyush and Patel, Minesh and Kanellopoulos, Konstantinos and Ghose, Saugata and Ausavarungnirun, Rachata and Oliveira, Geraldo F. and Appavoo, Jonathan and Seshadri, Vivek and Mutlu, Onur
关键词: No keywords

Abstract

Computers continue to diversify with respect to system designs, emerging memory technologies, and application memory demands. Unfortunately, continually adapting the conventional virtual memory framework to each possible system configuration is challenging, and often results in performance loss or requires non-trivial workarounds.To address these challenges, we propose a new virtual memory framework, the Virtual Block Interface (VBI). We design VBI based on the key idea that delegating memory management duties to hardware can reduce the overheads and software complexity associated with virtual memory. VBI introduces a set of variable-sized virtual blocks (VBs) to applications. Each VB is a contiguous region of the globally-visible VBI address space, and an application can allocate each semantically meaningful unit of information (e.g., a data structure) in a separate VB. VBI decouples access protection from memory allocation and address translation. While the OS controls which programs have access to which VBs, dedicated hardware in the memory controller manages the physical memory allocation and address translation of the VBs. This approach enables several architectural optimizations to (1) efficiently and flexibly cater to different and increasingly diverse system configurations, and (2) eliminate key inefficiencies of conventional virtual memory.We demonstrate the benefits of VBI with two important use cases: (1) reducing the overheads of address translation (for both native execution and virtual machine environments), as VBI reduces the number of translation requests and associated memory accesses; and (2) two heterogeneous main memory architectures, where VBI increases the effectiveness of managing fast memory regions. For both cases, VBI significantly improves performance over conventional virtual memory.

DOI: 10.1109/ISCA45697.2020.00089

ZnG： architecting GPU multi-processors with new flash for scalable data analysis

作者: Zhang, Jie and Jung, Myoungsoo
关键词: heterogeneous system, data movement, Z-NAND, SSD, MMU, L2 cache, GPU, DRAM

Abstract

We propose ZnG, a new GPU-SSD integrated architecture, which can maximize the memory capacity in a GPU and address performance penalties imposed by an SSD. Specifically, ZnG replaces all GPU internal DRAMs with an ultra-low-latency SSD to maximize the GPU memory capacity. ZnG further removes performance bottleneck of the SSD by replacing its flash channels with a high-throughput flash network and integrating SSD firmware in the GPU’s MMU to reap the benefits of hardware accelerations. Although flash arrays within the SSD can deliver high accumulated bandwidth, only a small fraction of such bandwidth can be utilized by GPU’s memory requests due to mismatches of their access granularity. To address this, ZnG employs a large L2 cache and flash registers to buffer the memory requests. Our evaluation results indicate that ZnG can achieve 7.5x higher performance than prior work.

DOI: 10.1109/ISCA45697.2020.00090

Commutative data reordering： a new technique to reduce data movement energy on sparse inference workloads

作者: Feinberg, Ben and Heyman, Benjamin C. and Mikhailenko, Darya and Wong, Ryan and Ho, An C. and Ipek, Engin
关键词: optimization, memory architecture

Abstract

Data movement is a significant and growing consumer of energy in modern systems, from specialized low-power accelerators to GPUs with power budgets in the hundreds of Watts. Given the importance of the problem, prior work has proposed designing interconnects on which the energy cost of transmitting a 0 is significantly lower than that of transmitting a 1. With such an interconnect, data movement energy is reduced by encoding the transmitted data such that the number of 1s is minimized. Although promising, these data encoding proposals do not take full advantage of application level semantics. As an example of a neglected optimization opportunity, consider the case of a dot product computation as part of a neural network inference task. The order in which the neural network weights are fetched and processed does not affect correctness, and can be optimized to further reduce data movement energy.This paper presents commutative data reordering (CDR), a hardware-software approach that leverages the commutative property in linear algebra to strategically select the order in which weight matrix coefficients are fetched from memory. To find a low-energy transmission order, weight ordering is modeled as an instance of one of two well-studied problems, the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. This reduction makes it possible to leverage the vast body of work on efficient approximation methods to find a good transmission order. CDR exploits the indirection inherent to sparse matrix formats such that no additional metadata is required to specify the selected order. The hardware modifications required to support CDR are minimal, and incur an area penalty of less than 0.01% when implemented on top of a mobile-class GPU. When applied to 7 neural network inference tasks running on a GPU-based system, CDR respectively reduces average DRAM IO energy by 53.1% and 22.2% over the data bus invert encoding scheme used by LPDDR4, and the recently proposed Base+XOR encoding. These savings are attained with no changes to the mobile system software and no runtime performance penalty.

DOI: 10.1109/ISCA45697.2020.00091

Echo： compiler-based GPU memory footprint reduction for LSTM RNN training

作者: Zheng, Bojian and Vijaykumar, Nandita and Pekhimenko, Gennady
关键词: LSTM RNN, GPU memory footprint reduction, DNN training

Abstract

The Long-Short-Term-Memory Recurrent Neural Networks (LSTM RNNs) are a popular class of machine learning models for analyzing sequential data. Their training on modern GPUs, however, is limited by the GPU memory capacity. Our profiling results of the LSTM RNN-based Neural Machine Translation (NMT) model reveal that feature maps of the attention and RNN layers form the memory bottleneck, and runtime is unevenly distributed across different layers when training on GPUs. Based on these two observations, we propose to recompute the feature maps of the attention and RNN layers rather than stashing them persistently in the GPU memory.While the idea of feature map recomputation has been considered before, existing solutions fail to deliver satisfactory footprint reduction, as they do not address two key challenges. For each feature map recomputation to be efficient, its effect on (1) the total memory footprint, and (2) the total execution time has to be carefully estimated. To this end, we propose Echo, a new compiler-based optimization scheme that addresses the first challenge with a practical mechanism that estimates the memory benefits of recomputation over the entire computation graph, and the second challenge by non-conservatively estimating the recomputation runtime overhead leveraging layer specifics. Echo reduces the GPU memory footprint automatically and transparently without any changes required to the training source code, and is effective for models beyond LSTM RNNs.We evaluate Echo on numerous state-of-the-art machine learning workloads, including NMT, DeepSpeech2, Transformer, and ResNet, on real systems with modern GPUs and observe footprint reduction ratios of 1.89x on average and 3.13x maximum. Such reduction can be converted into faster training with a larger batch size, savings in GPU energy consumption (e.g., training with one GPU as fast as with four), and/or an increase in the maximum number of layers under the same GPU memory budget. Echo is open-sourced as a part of the MXNet 2.0 framework.1

DOI: 10.1109/ISCA45697.2020.00092

A case for hardware-based demand paging

作者: Lee, Gyusun and Jin, Wenjing and Song, Wonsuk and Gong, Jeonghun and Bae, Jonghyun and Ham, Tae Jun and Lee, Jae W. and Jeong, Jinkyu
关键词: virtual memory, page fault, operating systems, hardware extension, demand paging, CPU architecture

Abstract

The virtual memory system is pervasive in today’s computer systems, and demand paging is the key enabling mechanism for it. At a page miss, the CPU raises an exception, and the page fault handler is responsible for fetching the requested page from the disk. The OS typically performs a context switch to run other threads as traditional disk access is slow. However, with the widespread adoption of high-performance storage devices, such as low-latency solid-state drives (SSDs), the traditional OS-based demand paging is no longer effective because a considerable portion of the demand paging latency is now spent inside the OS kernel. Thus, this paper makes a case for hardware-based demand paging that mostly eliminates OS involvement in page miss handling to provide a near-disk-access-time latency for demand paging. To this end, two architectural extensions are proposed: LBA-augmented page table that moves I/O stack operations to the control plane and Storage Management Unit that enables CPU to directly issue I/O commands without OS intervention in most cases. OS support is also proposed to detach tasks for memory resource management from the critical path. The evaluation results using both a cycle-level simulator and a real x86 machine with an ultra-low latency SSD show that the proposed scheme reduces the demand paging latency by 37.0%, and hence improves the performance of FIO read random benchmark by up to 57.1% and a NoSQL server by up to 27.3% with real-world workloads. As a side effect of eliminating OS intervention, the IPC of the user-level code is also increased by up to 7.0%.

DOI: 10.1109/ISCA45697.2020.00093