Ten lessons from three generations shaped Google’s TPUv4i
作者: Jouppi, Norman P. and Yoon, Doe Hyun and Ashcraft, Matthew and Gottscho, Mark and Jablin, Thomas B. and Kurian, George and Laudon, James and Li, Sheng and Ma, Peter and Ma, Xiaoyu and Norrie, Thomas and Patil, Nishant and Prasad, Sushma and Young, Cliff and Zhou, Zongwei and Patterson, David
关键词: No keywords
Abstract
Google deployed several TPU generations since 2015, teaching us lessons that changed our views: semi-conductor technology advances unequally; compiler compatibility trumps binary compatibility, especially for VLIW domain-specific architectures (DSA); target total cost of ownership vs initial cost; support multi-tenancy; deep neural networks (DNN) grow 1.5X annually; DNN advances evolve workloads; some inference tasks require floating point; inference DSAs need air-cooling; apps limit latency, not batch size; and backwards ML compatibility helps deploy DNNs quickly. These lessons molded TPUv4i, an inference DSA deployed since 2020.
DOI: 10.1109/ISCA52012.2021.00010
Sparsity-aware and re-configurable NPU architecture for samsung flagship mobile SoC
作者: Jang, Jun-Woo and Lee, Sehwan and Kim, Dongyoung and Park, Hyunsun and Ardestani, Ali Shafiee and Choi, Yeongjae and Kim, Channoh and Kim, Yoojin and Yu, Hyeongseok and Abdel-Aziz, Hamzah and Park, Jun-Seok and Lee, Heonsoo and Lee, Dongwoo and Kim, Myeong Woo and Jung, Hanwoong and Nam, Heewoo and Lim, Dongguen and Lee, Seungwon and Song, Joon-Ho and Kwon, Suknam and Hassoun, Joseph and Lim, SukHwan and Choi, Changkyu
关键词: sparsity, re-configurable, neural processing unit, neural network, mixed-precision, accelerator
Abstract
Of late, deep neural networks have become ubiquitous in mobile applications. As mobile devices generally require immediate response while maintaining user privacy, the demand for on-device machine learning technology is on the increase. Nevertheless, mobile devices suffer from restricted hardware resources, whereas deep neural networks involve considerable computation and communication. Therefore, the implementation of a neural-network specialized hardware accelerator, generally called neural processing unit (NPU), has started to gain attention for the mobile application processor (AP). However, NPUs for commercial mobile AP face two challenges that are difficult to realize simultaneously: execution of a wide range of applications and efficient performance.In this paper, we propose a flexible but efficient NPU architecture for a Samsung flagship mobile system-on-chip (SoC). To implement an efficient NPU, we design an energy-efficient inner-product engine that utilizes the input feature map sparsity. We propose a re-configurable MAC array to enhance the flexibility of the proposed NPU, dynamic internal memory port assignment to maximize on-chip memory bandwidth utilization, and efficient architecture to support mixed-precision arithmetic. We implement the proposed NPU using the Samsung 5nm library. Our silicon measurement experiments demonstrate that the proposed NPU achieves 290.7 FPS and 13.6 TOPS/W, when executing an 8-bit quantized Inception-v3 model [1] with a single NPU core. In addition, we analyze the proposed zero-skipping architecture in detail. Finally, we present the findings and lessons learned when implementing the commercial mobile NPU and interesting avenues for future work.
DOI: 10.1109/ISCA52012.2021.00011
Energy efficiency boost in the AI-infused POWER10 processor
作者: Thompto, Brian W. and Nguyen, Dung Q. and Moreira, Jos'{e
关键词: pre-silicon modeling, microprocessor design methodology, energy efficiency, POWER10, AI acceleration
Abstract
We present the novel micro-architectural features, supported by an innovative and novel pre-silicon methodology in the design of POWER10. The resulting projected energy efficiency boost over POWER9 is 2.6x at core level (for SPECint) and up to 3x at socket level. In addition, a new feature supporting inline AI acceleration was added to the POWER ISA and incorporated into the POWER10 processor core design. The resulting boost in SIMD/AI socket performance is projected to be up to 10x for FP32 and 21x for INT8 models of ResNet-50 and BERT-Large. In this paper, we describe the novel methodology deployed and used not only to obtain these efficiency boosts for traditional workloads, but also to infuse AI/ML/HPC capability directly into the POWER10 core.
DOI: 10.1109/ISCA52012.2021.00012
Hardware architecture and software stack for PIM based on commercial DRAM technology
作者: Lee, Sukhan and Kang, Shin-haeng and Lee, Jaehoon and Kim, Hyeonsu and Lee, Eojin and Seo, Seungwoo and Yoon, Hosang and Lee, Seungwon and Lim, Kyounghwan and Shin, Hyunsung and Kim, Jinhyun and O, Seongil and Iyer, Anand and Wang, David and Sohn, Kyomin and Kim, Nam Sung
关键词: processing in memory, neural network, accelerator, DRAM
Abstract
Emerging applications such as deep neural network demand high off-chip memory bandwidth. However, under stringent physical constraints of chip packages and system boards, it becomes very expensive to further increase the bandwidth of off-chip memory. Besides, transferring data across the memory hierarchy constitutes a large fraction of total energy consumption of systems, and the fraction has steadily increased with the stagnant technology scaling and poor data reuse characteristics of such emerging applications. To cost-effectively increase the bandwidth and energy efficiency, researchers began to reconsider the past processing-in-memory (PIM) architectures and advance them further, especially exploiting recent integration technologies such as 2.5D/3D stacking. Albeit the recent advances, no major memory manufacturer has developed even a proof-of-concept silicon yet, not to mention a product. This is because the past PIM architectures often require changes in host processors and/or application code which memory manufacturers cannot easily govern. In this paper, elegantly tackling the aforementioned challenges, we propose an innovative yet practical PIM architecture. To demonstrate its practicality and effectiveness at the system level, we implement it with a 20nm DRAM technology, integrate it with an unmodified commercial processor, develop the necessary software stack, and run existing applications without changing their source code. Our evaluation at the system level shows that our PIM improves the performance of memory-bound neural network kernels and applications by 11.2X and 3.5X, respectively. Atop the performance improvement, PIM also reduces the energy per bit transfer by 3.5X, and the overall energy efficiency of the system running the applications by 3.2X.
DOI: 10.1109/ISCA52012.2021.00013
Pioneering chiplet technology and design for the AMD EPYC™ and Ryzen™ processor families
作者: Naffziger, Samuel and Beck, Noah and Burd, Thomas and Lepak, Kevin and Loh, Gabriel H. and Subramony, Mahesh and White, Sean
关键词: processors, moore’s law, modular, industry, chiplets
Abstract
For decades, Moore’s Law has delivered the ability to integrate an exponentially increasing number of devices in the same silicon area at a roughly constant cost. This has enabled tremendous levels of integration, where the capabilities of computer systems that previously occupied entire rooms can now fit on a single integrated circuit.In recent times, the steady drum beat of Moore’s Law has started to slow down. Whereas device density historically doubled every 18–24 months, the rate of recent silicon process advancements has declined. While improvements in device scaling continue, albeit at a reduced pace, the industry is simultaneously observing increases in manufacturing costs.In response, the industry is now seeing a trend toward reversing direction on the traditional march toward more integration. Instead, multiple industry and academic groups are advocating that systems on chips (SoCs) be “disintegrated” into multiple smaller “chiplets.” This paper details the technology challenges that motivated AMD to use chiplets, the technical solutions we developed for our products, and how we expanded the use of chiplets from individual processors to multiple product families.
DOI: 10.1109/ISCA52012.2021.00014
Zero inclusion victim: isolating core caches from inclusive last-level cache evictions
作者: Chaudhuri, Mainak
关键词: inclusive cache hierarchy, inclusion victim, back-invalidation
Abstract
The most widely used last-level cache (LLC) architecture in the microprocessors has been the inclusive LLC design. The popularity of the inclusive design stems from the bandwidth optimization and simplification it offers to the implementation of the cache coherence protocols. However, inclusive LLCs have always been associated with the curse of inclusion victims. An inclusion victim is a block that must be forcefully replaced from the inner levels of the cache hierarchy when the copy of the block is replaced from the inclusive LLC. This tight coupling between the LLC victims and the inner-level cache contents leads to three major drawbacks. First, live inclusion victims can lead to severe performance degradation depending on the LLC replacement policies. Second, a process can victimize the blocks of another process in an LLC shared by multiple cores and this can be exploited to leak information through well-known eviction-based timing side-channels. An inclusive LLC makes these channels much less noisy due to the presence of inclusion victims which allow the malicious processes to control the contents of the per-core private caches through LLC evictions. Third, to reduce the impact of the aforementioned two drawbacks, the inner-level caches, particularly the mid-level cache in a three-level inclusive cache hierarchy, must be kept small even if a larger mid-level cache could have been beneficial in the absence of inclusion victims.We observe that inclusion victims are not fundamental to the inclusion property, but arise due to the way the contents of an inclusive LLC are managed. Motivated by this observation, we introduce a fundamentally new inclusive LLC design named the Zero Inclusion Victim (ZIV) LLC that guarantees freedom from inclusion victims while retaining all advantages of an inclusive LLC. This is the first inclusive LLC design proposal to offer such a guarantee, thereby completely isolating the core caches from LLC evictions. We observe that the root cause of inclusion victims is the constraint that an LLC victim must be chosen from the set pointed to by the set indexing function. The ZIV LLC relaxes this constraint only when necessary by efficiently and minimally enabling a global victim selection scheme in the inclusive LLC to avoid generation of inclusion victims. Detailed simulations conducted with a chip-multiprocessor model using multi-programmed and multi-threaded workloads show that the ZIV LLC gracefully supports large mid-level caches (e.g., half the size of the LLC) and delivers performance close to a non-inclusive LLC for different classes of LLC replacement policies. We also show that the ZIV LLC comfortably outperforms the existing related proposals and its performance lead grows with increasing mid-level cache capacity.
DOI: 10.1109/ISCA52012.2021.00015
Exploiting page table locality for agile TLB prefetching
作者: Vavouliotis, Georgios and Alvarez, Lluc and Karakostas, Vasileios and Nikas, Konstantinos and Koziris, Nectarios and Jim'{e
关键词: No keywords
Abstract
Frequent Translation Lookaside Buffer (TLB) misses incur high performance and energy costs due to page walks required for fetching the corresponding address translations. Prefetching page table entries (PTEs) ahead of demand TLB accesses can mitigate the address translation performance bottleneck, but each prefetch requires traversing the page table, triggering additional accesses to the memory hierarchy. Therefore, TLB prefetching is a costly technique that may undermine performance when the prefetches are not accurate.In this paper we exploit the locality in the last level of the page table to reduce the cost and enhance the effectiveness of TLB prefetching by fetching cache-line adjacent PTEs “for free”. We propose Sampling-Based Free TLB Prefetching (SBFP), a dynamic scheme that predicts the usefulness of these “free” PTEs and prefetches only the ones most likely to prevent TLB misses. We demonstrate that combining SBFP with novel and state-of-the-art TLB prefetchers significantly improves miss coverage and reduces most memory accesses due to page walks.Moreover, we propose Agile TLB Prefetcher (ATP), a novel composite TLB prefetcher particularly designed to maximize the benefits of SBFP. ATP efficiently combines three low-cost TLB prefetchers and disables TLB prefetching for those execution phases that do not benefit from it. Unlike state-of-the-art TLB prefetchers that correlate patterns with only one feature (e.g., strides, PC, distances), ATP correlates patterns with multiple features and dynamically enables the most appropriate TLB prefetcher per TLB miss.To alleviate the address translation performance bottleneck, we propose a unified solution that combines ATP and SBFP. Across an extensive set of industrial workloads provided by Qualcomm, ATP coupled with SBFP improves geometric speedup by 16.2%, and eliminates on average 37% of the memory references due to page walks. Considering the SPEC CPU 2006 and SPEC CPU 2017 benchmark suites, ATP with SBFP increases geometric speedup by 11.1%, and eliminates page walk memory references by 26%. Applied to big data workloads (GAP suite, XSBench), ATP with SBFP yields a geometric speedup of 11.8% while reducing page walk memory references by 5%. Over the best state-of-the-art TLB prefetcher for each benchmark suite, ATP with SBFP achieves speedups of 8.7%, 3.4%, and 4.2% for the Qualcomm, SPEC, and GAP+XSBench workloads, respectively.
DOI: 10.1109/ISCA52012.2021.00016
A cost-effective entangling prefetcher for instructions
作者: Ros, Alberto and Jimborean, Alexandra
关键词: latency, instruction prefetching, entangling, correlation, caches
Abstract
Prefetching instructions in the instruction cache is a fundamental technique for designing high-performance computers. There are three key properties to consider when designing an efficient and effective prefetcher: timeliness, coverage, and accuracy. Timeliness is essential, as bringing instructions too early increases the risk of the instructions being evicted from the cache before their use and requesting them too late can lead to the instructions arriving after they are demanded. Coverage is important to reduce the number of instruction cache misses and accuracy to ensure that the prefetcher does not pollute the cache or interacts negatively with the other hardware mechanisms.This paper presents the Entangling Prefetcher for Instructions that entangles instructions to maximize timeliness. The prefetcher works by finding which instruction should trigger the prefetch for a subsequent instruction, accounting for the latency of each cache miss. The prefetcher is carefully adjusted to account for both coverage and accuracy. Our evaluation shows that with 40KB of storage, Entangling can increase performance up to 23%, outperforming state-of-the-art prefetchers.
DOI: 10.1109/ISCA52012.2021.00017
Don’t forget the I/O when allocating your LLC
作者: Yuan, Yifan and Alian, Mohammad and Wang, Yipeng and Wang, Ren and Kurakin, Ilia and Tai, Charlie and Kim, Nam Sung
关键词: performance isolation, cache partitioning, DDIO
Abstract
In modern server CPUs, last-level cache (LLC) is a critical hardware resource that exerts significant influence on the performance of the workloads, and how to manage LLC is a key to the performance isolation and QoS in the cloud with multi-tenancy. In this paper, we argue that in addition to CPU cores, high-speed I/O is also important for LLC management. This is because of an Intel architectural innovation - Data Direct I/O (DDIO) - that directly injects the inbound I/O traffic to (part of) the LLC instead of the main memory. We summarize two problems caused by DDIO and show that (1) the default DDIO configuration may not always achieve optimal performance, (2) DDIO can decrease the performance of non-I/O workloads that share LLC with it by as high as 32%.We then present IAT, the first LLC management mechanism that treats the I/O as the first-class citizen. IAT monitors and analyzes the performance of the core/LLC/DDIO using CPU’s hardware performance counters and adaptively adjusts the number of LLC ways for DDIO or the tenants that demand more LLC capacity. In addition, IAT dynamically chooses the tenants that share its LLC resource with DDIO to minimize the performance interference by both the tenants and the I/O. Our experiments with multiple microbenchmarks and real-world applications demonstrate that with minimal overhead, IAT can effectively and stably reduce the performance degradation caused by DDIO.
DOI: 10.1109/ISCA52012.2021.00018
PF-DRAM: a precharge-free DRAM structure
作者: Rohbani, Nezam and Darabi, Sina and Sarbazi-Azad, Hamid
关键词: sense amplifier, power consumption, memory access latency, DRAM
Abstract
Although DRAM capacity and bandwidth have increased sharply by the advances in technology and standards, its latency and energy per access have remained almost constant in recent generations. The main portion of DRAM power/energy is dissipated by Read, Write, and Refresh operations, all initiated by a Precharge phase. Precharge phase not only imposes a large amount of energy consumption, but also increases the delay of closing a row in a memory block to open another one. By reduction of row-hit rate in recent workloads, especially in multi-core systems, precharge rate increases which exacerbates DRAM power dissipation and access latency. This work proposes a novel DRAM structure, called Precharge-Free DRAM (PF-DRAM), that eliminates the Precharge phase of DRAM. PF-DRAM uses the charge on bitlines from the previous Activation phase, as the starting point for the next Activation. The difference between PF-DRAM and conventional DRAM structure is limited to precharge and equalizer circuitry and simple modifications in sense amplifier, which are all limited to subarray level. PF-DRAM is compatible with the mainstream JEDEC memory standards like DDRx and HBM, with minimum modifications in memory controller. Furthermore, almost all of the previously proposed power/energy reduction techniques in DRAM are still applicable to PF-DRAM for further improvement. Our experimental results on a 8 GB memory system running SPEC CPU2017 and PAR-SEC 2.1 workloads show an average of 35.3% memory power consumption reduction (up to 54.2%) achieved by the system using PF-DRAM with respect to the system using conventional DRAM. Moreover, the overall performance is improved by 8.6%, in average (up to 24.3%). According to our analysis, all such improvements are achieved for less than 9% area overhead.
DOI: 10.1109/ISCA52012.2021.00019
Efficient multi-GPU shared memory via automatic optimization of fine-grained transfers
作者: Muthukrishnan, Harini and Nellans, David and Lustig, Daniel and Fessler, Jeffrey A. and Wenisch, Thomas F.
关键词: strong scaling, multi-GPU, heterogeneous systems, data movement, GPU memory management, GPGPU
Abstract
Despite continuing research into inter-GPU communication mechanisms, extracting performance from multi-GPU systems remains a significant challenge. Inter-GPU communication via bulk DMA-based transfers exposes data transfer latency on the GPU’s critical execution path because these large transfers are logically interleaved between compute kernels. Conversely, fine-grained peer-to-peer memory accesses during kernel execution lead to memory stalls that can exceed the GPUs’ ability to cover these operations via multi-threading. Worse yet, these sub-cacheline transfers are highly inefficient on current inter-GPU interconnects. To remedy these issues, we propose PROACT, a system enabling remote memory transfers with the programmability and pipeline advantages of peer-to-peer stores, while achieving interconnect efficiency that rivals bulk DMA transfers. Combining compile-time instrumentation with fine-grain tracking of data block readiness within each GPU, PROACT enables interconnect-friendly data transfers while hiding the transfer latency via pipelining during kernel execution. This work describes both hardware and software implementations of PROACT and demonstrates the effectiveness of a PROACT software prototype on three generations of GPU hardware and interconnects. Achieving near-ideal interconnect efficiency, PROACT realizes a mean speedup of 3.0X over single-GPU performance for 4-GPU systems, capturing 83% of available performance opportunity. On a 16-GPU NVIDIA DGX-2 system, we demonstrate an 11.0X average strong-scaling speedup over single-GPU performance, 5.3X better than a bulk DMA-based approach.
DOI: 10.1109/ISCA52012.2021.00020
RaPiD: AI accelerator for ultra-low precision training and inference
作者: Venkataramani, Swagath and Srinivasan, Vijayalakshmi and Wang, Wei and Sen, Sanchari and Zhang, Jintao and Agrawal, Ankur and Kar, Monodeep and Jain, Shubham and Mannari, Alberto and Tran, Hoang and Li, Yulong and Ogawa, Eri and Ishizaki, Kazuaki and Inoue, Hiroshi and Schaal, Marcel and Serrano, Mauricio and Choi, Jungwook and Sun, Xiao and Wang, Naigang and Chen, Chia-Yu and Allain, Allison and Bonano, James and Cao, Nianzheng and Casatuta, Robert and Cohen, Matthew and Fleischer, Bruce and Guillorn, Michael and Haynie, Howard and Jung, Jinwook and Kang, Mingu and Kim, Kyu-hyoun and Koswatta, Siyu and Lee, Saekyu and Lutz, Martin and Mueller, Silvia and Oh, Jinwook and Ranjan, Ashish and Ren, Zhibin and Rider, Scot and Schelm, Kerstin and Scheuermann, Michael and Silberman, Joel and Yang, Jie and Zalani, Vidhi and Zhang, Xin and Zhou, Ching and Ziegler, Matt and Shah, Vinay and Ohara, Moriyoshi and Lu, Pong-Fei and Curran, Brian and Shukla, Sunil and Chang, Leland and Gopalakrishnan, Kailash
关键词: reduced precision, hardware acceleration, deep neural networks
Abstract
The growing prevalence and computational demands of Artificial Intelligence (AI) workloads has led to widespread use of hardware accelerators in their execution. Scaling the performance of AI accelerators across generations is pivotal to their success in commercial deployments. The intrinsic error-resilient nature of AI workloads present a unique opportunity for performance/energy improvement through precision scaling. Motivated by the recent algorithmic advances in precision scaling for inference and training, we designed RAPID1, a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point. The 36mm2 RAPID chip fabricated in 7nm EUV technology delivers a peak 3.5 TFLOPS/W in HFP8 mode and 16.5 TOPS/W in INT4 mode at nominal voltage. Using a performance model calibrated to within 1% of the measurement results, we evaluated DNN inference using 4-bit fixed-point representation for a 4-core 1 RAPID chip system and DNN training using 8-bit floating point representation for a 768 TFLOPs AI system comprising 4 32-core RAPID chips. Our results show INT4 inference for batch size of 1 achieves 3 - 13.5 (average 7) TOPS/W and FP8 training for a mini-batch of 512 achieves a sustained 102 - 588 (average 203) TFLOPS across a wide range of applications.
DOI: 10.1109/ISCA52012.2021.00021
REDUCT: keep it close, keep it cool! efficient scaling of DNN inference on multi-core CPUs with near-cache compute
作者: Nori, Anant V. and Bera, Rahul and Balachandran, Shankar and Rakshit, Joydeep and Omer, Om J. and Abuhatzera, Avishaii and Kuttanna, Belliappa and Subramoney, Sreenivas
关键词: No keywords
Abstract
Deep Neural Networks (DNN) are used in a variety of applications and services. With the evolving nature of DNNs, the race to build optimal hardware (both in datacenter and edge) continues. General purpose multi-core CPUs offer unique attractive advantages for DNN inference at both datacenter [60] and edge [71]. Most of the CPU pipeline design complexity is targeted towards optimizing general-purpose single thread performance, and is overkill for relatively simpler, but still hugely important, data parallel DNN inference workloads. Addressing this disparity efficiently can enable both raw performance scaling and overall performance/Watt improvements for multi-core CPU DNN inference.We present REDUCT, where we build innovative solutions that bypass traditional CPU resources which impact DNN inference power and limit its performance. Fundamentally, REDUCT’s “Keep it close” policy enables consecutive pieces of work to be executed close to each other. REDUCT enables instruction delivery/decode close to execution and instruction execution close to data. Simple ISA extensions encode the fixed-iteration count loop-y workload behavior enabling an effective bypass of many power-hungry front-end stages of the wide Out-of-Order (OoO) CPU pipeline. Per core performance scales efficiently by distributing lightweight tensor compute near all caches in a multi-level cache hierarchy. This maximizes the cumulative utilization of the existing architectural bandwidth resources in the system and minimizes movement of data.Across a number of DNN models, REDUCT achieves a 2.3X increase in convolution performance/Watt with a 2X to 3.94X scaling in raw performance. Similarly, REDUCT achieves a 1.8X increase in inner-product performance/Watt with 2.8X scaling in performance. REDUCT performance/power scaling is achieved with no increase to cache capacity or bandwidth and a mere 2.63% increase in area. Crucially, REDUCT operates entirely within the CPU programming and memory model, simplifying software development, while achieving performance similar to or better than state-of-the-art Domain Specific Accelerators (DSA) for DNN inference, providing fresh design choices in the AI era.
DOI: 10.1109/ISCA52012.2021.00022
Communication algorithm-architecture co-design for distributed deep learning
作者: Huang, Jiayi and Majumder, Pritam and Kim, Sungkeun and Muzahid, Abdullah and Yum, Ki Hwan and Kim, Eun Jung
关键词: interconnection network, distributed deep learning, data-parallel training, all-reduce, algorithm-architecture co-design
Abstract
Large-scale distributed deep learning training has enabled developments of more complex deep neural network models to learn from larger datasets for sophisticated tasks. In particular, distributed stochastic gradient descent intensively invokes all-reduce operations for gradient update, which dominates communication time during iterative training epochs. In this work, we identify the inefficiency in widely used all-reduce algorithms, and the opportunity of algorithm-architecture co-design. We propose MULTITREE all-reduce algorithm with topology and resource utilization awareness for efficient and scalable all-reduce operations, which is applicable to different interconnect topologies. Moreover, we co-design the network interface to schedule and coordinate the all-reduce messages for contention-free communications, working in synergy with the algorithm. The flow control is also simplified to exploit the bulk data transfer of big gradient exchange. We evaluate the co-design using different all-reduce data sizes for synthetic study, demonstrating its effectiveness on various interconnection network topologies, in addition to state-of-the-art deep neural networks for real workload experiments. The results show that MULTITREE achieves 2.3X and 1.56X communication speedup, as well as up to 81% and 30% training time reduction compared to ring all-reduce and state-of-the-art approaches, respectively.
DOI: 10.1109/ISCA52012.2021.00023
Vector runahead
作者: Naithani, Ajeya and Ainsworth, Sam and Jones, Timothy M. and Eeckhout, Lieven
关键词: No keywords
Abstract
The memory wall places a significant limit on performance for many modern workloads. These applications feature complex chains of dependent, indirect memory accesses, which cannot be picked up by even the most advanced microarchitectural prefetchers. The result is that current out-of-order superscalar processors spend the majority of their time stalled. While it is possible to build special-purpose architectures to exploit the fundamental memory-level parallelism, a microarchitectural technique to automatically improve their performance in conventional processors has remained elusive.Runahead execution is a tempting proposition for hiding latency in program execution. However, to achieve high memory-level parallelism, a standard runahead execution skips ahead of cache misses. In modern workloads, this means it only prefetches the first cache-missing load in each dependent chain. We argue that this is not a fundamental limitation. If runahead were instead to stall on cache misses to generate dependent chain loads, then it could regain performance if it could stall on many at once. With this insight, we present Vector Runahead, a technique that prefetches entire load chains and speculatively reorders scalar operations from multiple loop iterations into vector format to bring in many independent loads at once. Vectorization of the runahead instruction stream increases the effective fetch/decode bandwidth with reduced resource requirements, to achieve high degrees of memory-level parallelism at a much faster rate. Across a variety of memory-latency-bound indirect workloads, Vector Runahead achieves a 1.79X performance speedup on a large out-of-order superscalar system, significantly improving on state-of-the-art techniques.
DOI: 10.1109/ISCA52012.2021.00024
Unlimited vector extension with data streaming support
作者: Domingos, Joao Mario and Neves, Nuno and Roma, Nuno and Tom'{a
关键词: scalable SIMD/vector processing, processor microarchitecture, instruction set extension, high-performance computing, data streaming
Abstract
Unlimited vector extension (UVE) is a novel instruction set architecture extension that takes streaming and SIMD processing together into the modern computing scenario. It aims to overcome the shortcomings of state-of-the-art scalable vector extensions by adding data streaming as a way to simultaneously reduce the overheads associated with loop control and memory access indexing, as well as with memory access latency. This is achieved through a new set of instructions that pre-configure the loop memory access patterns. These attain accurate and timely data prefetching on predictable access patterns, such as in multidimensional arrays or in indirect memory access patterns. Each of the configured data streams is associated to a general-purpose vector register, which is then used to interface with the streams. In particular, iterating over a given stream is simply achieved by reading/writing to the corresponding input/output stream, as the data is instantly consumed/produced. To evaluate the proposed UVE, a proof-of-concept gem5 implementation was integrated in an out-of-order processor model, based on the ARM Cortex-A76, thus taking into consideration the typical speculative and out-of-order execution paradigms found in high-performance computing processors. The evaluation was carried out with a set of representative kernels, by assessing the number of executed instructions, its impact on the memory bus and its overall performance. Compared to other state-of-the-art solutions, such as the upcoming ARM Scalable Vector Extension (SVE), the obtained results show that the proposed extension attains average performance speedups over 2.4 x for the same processor configuration, including vector length.
DOI: 10.1109/ISCA52012.2021.00025
Speculative vectorisation with selective replay
作者: Sun, Peng and Gabrielli, Giacomo and Jones, Timothy M.
关键词: No keywords
Abstract
While industry continues to develop SIMD vector ISAs by providing new instructions and wider data-paths, modern SIMD architectures still rely on the programmer or compiler to transform code to vector form only when it is safe. Limitations in the power of a compiler’s memory alias analysis and the presence of infrequent memory data dependences mean that whole regions of code cannot be safely vectorised without risking changing the semantics of the application, restricting the available performance.We present a new SIMD architecture to address this issue, which relies on speculation to identify and catch memory-dependence violations that occur during vector execution. Once identified, only those SIMD lanes that have used erroneous data are replayed; other lanes, both older and younger, keep the results of their latest execution. We use the compiler to mark loops with possible cross-iteration dependences and safely vectorise them by executing on our architecture, termed selective-replay vectorisation (SRV). Evaluating on a range of general-purpose and HPC benchmarks gives an average loop speedup of 2.9X, and up to 5.3X in the best case, over already-vectorised code. This leads to a whole-program speedup of up to 1.19X (average 1.06X) over already-vectorised applications.
DOI: 10.1109/ISCA52012.2021.00026
ABC-DIMM: alleviating the bottleneck of communication in DIMM-based near-memory processing with inter-DIMM broadcast
作者: Sun, Weiyi and Li, Zhaoshi and Yin, Shouyi and Wei, Shaojun and Liu, Leibo
关键词: sparse applications, near-memory processing, inter-DIMM broadcast, broadcast-process framework
Abstract
Near-Memory Processing (NMP) systems that integrate accelerators within DIMM (Dual-Inline Memory Module) buffer chips potentially provide high performance with relatively low design and manufacturing costs. However, an inevitable communication bottleneck arises when considering the main memory bus among peer DIMMs and the host CPU. This communication bottleneck roots in the bus-based nature and the limited point-to-point communication pattern of the main memory system. The aggregated memory bandwidth of DIMM-based NMP scales with the number of DIMMs. When the number of DIMMs in a channel scales up, the per-DIMM point-to-point communication bandwidth scales down, whereas the computation resources and local memory bandwidth per DIMM stay the same. For many important sparse data-intensive workloads like graph applications and sparse tensor algebra, we identify that communication among DIMMs and the host CPU easily dominates their processing procedure in previous DIMM-based NMP systems, which severely bottlenecks their performance.To tackle this challenge, we propose that inter-DIMM broadcast should be implemented and utilized in the main memory system of DIMM-based NMP. On the hardware side, the main memory bus naturally scales out with broadcast, where per-DIMM effective bandwidth of broadcast remains the same as the number of DIMMs grows. On the software side, many sparse applications can be implemented in a form such that broadcasts dominate their communication. Based on these ideas, we design <u>ABC-DIMM</u>, which Alleviates the Bottleneck of Communication in DIMM-based NMP, consisting of integral broadcast mechanisms and Broadcast-Process programming framework, with minimized modifications to commodity software-hardware stack. Our evaluation shows that ABC-DIMM offers an 8.33X geo-mean speedup over a 16-core CPU baseline, and outperforms two NMP baselines by 2.59X and 2.93X on average.
DOI: 10.1109/ISCA52012.2021.00027
Sieve: scalable in-situ DRAM-based accelerator designs for massively parallel k-mer matching
作者: Wu, Lingxi and Sharifi, Rasool and Lenjani, Marzieh and Skadron, Kevin and Venkat, Ashish
关键词: processing-in-memory, bioinformatics, accelerator
Abstract
The rapid influx of biosequence data, coupled with the stagnation of the processing power of modern computing systems, highlights the critical need for exploring high-performance accelerators that can meet the ever-increasing throughput demands of modern bioinformatics applications. This work argues that processing in memory (PIM) is an effective solution to enhance the performance of k-mer matching, a critical bottleneck stage in standard bioinformatics pipelines, that is characterized by random access patterns and low computational intensity.This work proposes three DRAM-based in-situ k-mer matching accelerator designs (one optimized for area, one optimized for throughput, and one that strikes a balance between hardware cost and performance), dubbed Sieve, that leverage a novel data mapping scheme to allow for simultaneous comparisons of millions of DNA base pairs, lightweight matching circuitry for fast pattern matching, and an early termination mechanism that prunes unnecessary DRAM row activation to reduce latency and save energy. Evaluation of Sieve using state-of-the-art workloads with real-world datasets shows that the most aggressive design provides an average of 326x/32x speedup and 74X/48x energy savings over multi-core-CPU/GPU baselines for k-mer matching.
DOI: 10.1109/ISCA52012.2021.00028
FORMS: fine-grained polarized ReRAM-based in-situ computation for mixed-signal DNN accelerator
作者: Yuan, Geng and Behnam, Payman and Li, Zhengang and Shafiee, Ali and Lin, Sheng and Ma, Xiaolong and Liu, Hang and Qian, Xuehai and Bojnordi, Mahdi Nazm and Wang, Yanzhi and Ding, Caiwen
关键词: No keywords
Abstract
Recent work demonstrated the promise of using resistive random access memory (ReRAM) as an emerging technology to perform inherently parallel analog domain in-situ matrix-vector multiplication—the intensive and key computation in deep neural networks (DNNs). One key problem is the weights that are signed values. However, in a ReRAM crossbar, weights are stored as conductance of the crossbar cells, and the in-situ computation assumes all cells on each crossbar column are of the same sign. The current architectures either use two ReRAM crossbars for positive and negative weights (PRIME), or add an offset to weights so that all values become positive (ISAAC). Neither solution is ideal: they either double the cost of crossbars, or incur extra offset circuity. To better address this problem, we propose FORMS, a fine-grained ReRAM-based DNN accelerator with algorithm/hardware co-design. Instead of trying to represent the positive/negative weights, our key design principle is to enforce exactly what is assumed in the in-situ computation—ensuring that all weights in the same column of a crossbar have the same sign. It naturally avoids the cost of an additional crossbar. Such polarized weights can be nicely generated using alternating direction method of multipliers (ADMM) regularized optimization during the DNN training, which can exactly enforce certain patterns in DNN weights. To achieve high accuracy, we divide the crossbar into logical sub-arrays and only enforce this property within the fine-grained sub-array columns. Crucially, the small sub-arrays provides a unique opportunity for input zero-skipping, which can significantly avoid unnecessary computations and reduce computation time. At the same time, it also makes the hardware much easier to implement and is less susceptible to non-idealities and noise than coarse-grained architectures. Putting all together, with the same optimized DNN models, FORMS achieves 1.50\texttimes{
DOI: 10.1109/ISCA52012.2021.00029
BOSS: bandwidth-optimized search accelerator for storage-class memory
作者: Heo, Jun and Lee, Seung Yul and Min, Sunhong and Park, Yeonhong and Jung, Sung Jun and Ham, Tae Jun and Lee, Jae W.
关键词: storage class memory, near-data processing, inverted index, hardware accelerator, full-text search
Abstract
Search is one of the most popular and important web services. The inverted index is the standard data structure adopted by most full-text search engines. Recently, custom hardware accelerators for inverted index search have emerged to demonstrate much higher throughput than the conventional CPU or GPU. However, less attention has been paid to addressing the memory capacity pressure with inverted index. The conventional DDRx DRAM memory system significantly increases the system cost to make a terabyte-scale main memory. Instead, a shared memory pool composed of storage-class memory (SCM) devices is a promising alternative for scaling memory capacity at a much lower cost. However, this SCM-based pooled memory poses new challenges caused by the limited bandwidth of both SCM devices and the shared interconnect to the host CPU. Thus, we propose BOSS, the first near-data processing (NDP) architecture for inverted index search on SCM-based pooled memory, which maintains high throughput of query processing in this bandwidth-constrained environment. BOSS mitigates the impact of low bandwidth of SCM devices by employing early-termination search algorithms, reducing the footprint of intermediate data, and introducing a programmable decompression module that can select the best compression scheme for a given inverted index. Furthermore, BOSS includes a top-k selection module in hardware to substantially reduce the host-accelerator bandwidth consumption. Compared to Apache Lucene, a production-grade search engine library, running on 8 CPU cores, BOSS achieves a geomean speedup of 8.1x on various complex query types, while reducing the average energy consumption by 189x.
DOI: 10.1109/ISCA52012.2021.00030
Satori: efficient and fair resource partitioning by sacrificing short-term benefits for long-term gains
作者: Roy, Rohan Basu and Patel, Tirthak and Tiwari, Devesh
关键词: No keywords
Abstract
Multi-core architectures have enabled data centers to increasingly co-locate multiple jobs to improve resource utilization and lower the operational cost. Unfortunately, naively co-locating multiple jobs may lead to only a modest increase in system throughput. Worse, some users may observe proportionally higher performance degradation compared to other users co-located on the same physical multi-core system. SATORI is a novel strategy to partition multi-core architectural resources to achieve two conflicting goals simultaneously: increasing system throughput and achieving fairness among the co-located jobs.
DOI: 10.1109/ISCA52012.2021.00031
Confidential serverless made efficient with <u>p</u>lug-<u>in</u> <u>e</u>nclaves
作者: Li, Mingyu and Xia, Yubin and Chen, Haibo
关键词: serverless, intel SGX, confidential computing
Abstract
Serverless computing has become a fact of life on modern clouds. A serverless function may process sensitive data from clients. Protecting such a function against untrusted clouds using hardware enclave is attractive for user privacy. In this work, we run existing serverless applications in SGX enclave, and observe that the performance degradation can be as high as 5.6X to even 422.6X. Our investigation identifies these slowdowns are related to architectural features, mainly from page-wise enclave initialization. Leveraging insights from our overhead analysis, we revisit SGX hardware design and make minimal modification to its enclave model. We extend SGX with a new primitive—region-wise plugin enclaves that can be mapped into existing enclaves to reuse attested common states amongst functions. By remapping plugin enclaves, an enclave allows in-situ processing to avoid expensive data movement in a function chain. Experiments show that our design reduces the enclave function latency by 94.74–99.57%, and boosts the autoscaling throughput by 19-179X.
DOI: 10.1109/ISCA52012.2021.00032
Flex: high-availability datacenters with zero reserved power
作者: Zhang, Chaojie and Kumbhare, Alok Gautam and Manousakis, Ioannis and Zhang, Deli and Misra, Pulkit A. and Assis, Rod and Woolcock, Kyle and Mahalingam, Nithish and Warrier, Brijesh and Gauthier, David and Kunnath, Lalu and Solomon, Steve and Morales, Osvaldo and Fontoura, Marcus and Bianchini, Ricardo
关键词: workload availability, redundant power, power capping, datacenter power management
Abstract
Cloud providers, like Amazon and Microsoft, must guarantee high availability for a large fraction of their workloads. For this reason, they build datacenters with redundant infrastructures for power delivery and cooling. Typically, the redundant resources are reserved for use only during infrastructure failure or maintenance events, so that workload performance and availability do not suffer. Unfortunately, the reserved resources also produce lower power utilization and, consequently, require more datacenters to be built. To address these problems, in this paper we propose “zero-reserved-power” datacenters and the Flex system to ensure that workloads still receive their desired performance and availability. Flex leverages the existence of software-redundant workloads that can tolerate lower infrastructure availability, while imposing minimal (if any) performance degradation for those that require high infrastructure availability. Flex mainly comprises (1) a new offline workload placement policy that reduces stranded power while ensuring safety during failure or maintenance events, and (2) a distributed system that monitors for failures and quickly reduces the power draw while respecting the workloads’ requirements, when it detects a failure. Our evaluation shows that Flex produces less than 5% stranded power and increases the number of deployed servers by up to 33%, which translates to hundreds of millions of dollars in construction cost savings per datacenter site. We end the paper with lessons from our experience bringing Flex to production in Microsoft’s datacenters.
DOI: 10.1109/ISCA52012.2021.00033
BlockMaestro: enabling programmer-transparent task-based execution in GPU systems
作者: Abdolrashidi, AmirAli and Esfeden, Hodjat Asghari and Jahanshahi, Ali and Singh, Kaustubh and Abu-Ghazaleh, Nael and Wong, Daniel
关键词: thread block scheduling, just-in-time, data dependency, SIMD, GPGPU
Abstract
As modern GPU workloads grow in size and complexity, there is an ever-increasing demand for GPU computational power. Emerging workloads contain hundreds or thousands of GPU kernel launches, which incur high overheads, and exhibit data-dependent behavior between kernels, which requires synchronization, leading to GPU under-utilization. Task-based execution models have been proposed to solve these issues, but they require significant programmer effort to port applications to proprietary task-based programming models in order to specify tasks and task dependencies. To address this need, we propose BlockMaestro, a software-hardware solution that combines command queue reordering, kernel-launch-time static analysis, and runtime hardware support to dynamically identify and resolve thread-block level data dependencies between kernels. Through static analysis of memory access patterns at kernel-launch-time, BlockMaestro can extract inter-kernel thread block-level data dependencies. BlockMaestro also introduces kernel pre-launching to reduce the kernel launch overheads experienced by multiple dependent kernels. Correctness is enforced by dynamically resolving thread block-level data dependency at runtime through hardware support. BlockMaestro achieves an average speedup of 51.76% (up to 2.92x) on data-dependent benchmarks, and requires minimal hardware overhead.
DOI: 10.1109/ISCA52012.2021.00034
Opening pandora’s box: a systematic study of new ways microarchitecture can leak private data
作者: Vicarte, Jose Rodrigo Sanchez and Shome, Pradyumna and Nayak, Nandeeka and Trippel, Caroline and Morrison, Adam and Kohlbrenner, David and Fletcher, Christopher W.
关键词: No keywords
Abstract
Microarchitectural attacks have plunged Computer Architecture into a security crisis. Yet, as the slowing of Moore’s law justifies the use of ever more exotic microarchitecture, it is likely we have only seen the tip of the iceberg.To better anticipate this security crisis, this paper performs a systematic security-centric analysis of the Computer Architecture literature. Our rationale is that when implementing current and future processors, microarchitects will (quite reasonably) look to previously-proposed ideas. Our study uncovers seven classes of microarchitectural optimization with novel security implications, proposes a conceptual framework through which to study them and demonstrates several proofs-of-concept to show their efficacy. The optimizations we study range from those that leak as much privacy as Spectre/Meltdown (but without exploiting speculative execution) to those that otherwise undermine security-critical programs in a variety of ways. Many have storied histories—ranging from industry patents to media/3rd party speculation regarding current implementation status to recent renewed interest in the academic community. This paper’s goal is to perform an early (hopefully not too late) analysis to inform their development moving forward.
DOI: 10.1109/ISCA52012.2021.00035
I see dead μops: leaking secrets via Intel/AMD micro-op caches
作者: Ren, Xida and Moody, Logan and Taram, Mohammadkazem and Jordan, Matthew and Tullsen, Dean M. and Venkat, Ashish
关键词: No keywords
Abstract
Modern Intel, AMD, and ARM processors translate complex instructions into simpler internal micro-ops that are then cached in a dedicated on-chip structure called the micro-op cache. This work presents an in-depth characterization study of the micro-op cache, reverse-engineering many undocumented features, and further describes attacks that exploit the micro-op cache as a timing channel to transmit secret information. In particular, this paper describes three attacks - (1) a same thread cross-domain attack that leaks secrets across the user-kernel boundary, (2) a cross-SMT thread attack that transmits secrets across two SMT threads via the micro-op cache, and (3) transient execution attacks that have the ability to leak an unauthorized secret accessed along a misspeculated path, even before the transient instruction is dispatched to execution, breaking several existing invisible speculation and fencing-based solutions that mitigate Spectre.
DOI: 10.1109/ISCA52012.2021.00036
TimeCache: using time to eliminate cache side channels when sharing software
作者: Ojha, Divya and Dwarkadas, Sandhya
关键词: No keywords
Abstract
Timing side channels have been used to extract cryptographic keys and sensitive documents even from trusted enclaves. Specifically, cache side channels created by reuse of shared code or data in the memory hierarchy have been exploited by several known attacks, e.g., evict+reload for recovering an RSA key and Spectre variants for leaking speculatively loaded data.In this paper, we present TimeCache, a cache design that incorporates knowledge of prior cache line access to eliminate cache side channels due to reuse of shared software (code and data). Our goal is to retain the benefits of a shared cache of allowing each process access to the entire cache and of cache occupancy by a single copy of shared software. We achieve our goal by implementing per-process cache line visibility so that the processes do not benefit from cached data brought in by another process until they have incurred a corresponding miss penalty. Our design achieves low overhead by using a novel combination of timestamps and a hardware design to allow efficient parallel comparisons of the timestamps. The solution works at all the cache levels without the need to limit the number of security domains, and defends against an attacker process running on the same core, on a another hyperthread, or on another core.Our implementation in the gem5 simulator demonstrates that the system is able to defend against RSA key extraction. We evaluate performance using SPEC2006 and PARSEC and observe the overhead of TimeCache to be 1.13% on average. Delay due to first access misses adds the majority of the overhead, with the security context bookkeeping incurred at the time of a context switch contributing 0.02% of the 1.13%.
DOI: 10.1109/ISCA52012.2021.00037
Accelerated seeding for genome sequence alignment with enumerated radix trees
作者: Subramaniyan, Arun and Wadden, Jack and Goliya, Kush and Ozog, Nathan and Wu, Xiao and Narayanasamy, Satish and Blaauw, David and Das, Reetuparna
关键词: sequence alignment, genomics, computer architecture, bioinformatics
Abstract
Read alignment is a time-consuming step in genome sequencing analysis. The most widely used software for read alignment, BWA-MEM, and the recently published faster version BWA-MEM2 are based on the seed-and-extend paradigm for read alignment. The seeding step of read alignment is a major bottleneck contributing ~40% to the overall execution time of BWA-MEM2 when aligning whole human genome reads from the Platinum Genomes dataset. This is because both BWA-MEM and BWA-MEM2 use a compressed index structure called the FMD-Index, which results in high bandwidth requirements, primarily due to its character-by-character processing of reads. For instance, to seed each read (101 DNA base-pairs stored in 37.8 bytes), the FMD-Index solution in BWA-MEM2 requires ~68.5 KB of index data.We propose a novel indexing data structure named Enumerated Radix Tree (ERT) and design a custom seeding accelerator based on it. ERT improves bandwidth efficiency of BWA-MEM2 by 4.5X while guaranteeing 100% identical output to the original software, and still fitting in 64 GB DRAM. Overall, the proposed seeding accelerator implemented on AWS F1 FPGA (f1.4xlarge) improves seeding throughput of BWA-MEM2 by 3.3X. When combined with seed-extension accelerators, we observe a 2.1X improvement in overall read alignment throughput over BWA-MEM2. The software implementation of ERT is integrated into BWA-MEM2 (ert branch: https://github.com/bwa-mem2/bwa-mem2/tree/ert) and is open sourced for the benefit of the research community.
DOI: 10.1109/ISCA52012.2021.00038
Aurochs: an architecture for dataflow threads
作者: Vilim, Matthew and Rucker, Alexander and Olukotun, Kunle
关键词: plasticine, gorgon, dataflow accelerator, database, aurochs, RDA, CGRA
Abstract
Data analytics pipelines increasingly rely on databases to select, filter, and pre-process reams of data. These databases use data structures with irregular control flow like trees and hash tables which map poorly to existing database accelerators, leaving architects with a choice between CPUS-with stagnant performance—or accelerators that handle this complexity by relying on simpler but asymptotically sub-optimal algorithms.To bridge this gap, we propose Aurochs: a reconfigurable dataflow accelerator (RDA) that matches a CPU asymptotically but outperforms it by over 100 \texttimes{
DOI: 10.1109/ISCA52012.2021.00039
PipeZK: accelerating zero-knowledge proof with a pipelined architecture
作者: Zhang, Ye and Wang, Shuo and Zhang, Xian and Dong, Jiangbin and Mao, Xingzhong and Long, Fan and Wang, Cong and Zhou, Dong and Gao, Mingyu and Sun, Guangyu
关键词: No keywords
Abstract
Zero-knowledge proof (ZKP) is a promising cryptographic protocol for both computation integrity and privacy. It can be used in many privacy-preserving applications including verifiable cloud outsourcing and blockchains. The major obstacle of using ZKP in practice is its time-consuming step for proof generation, which consists of large-size polynomial computations and multi-scalar multiplications on elliptic curves. To efficiently and practically support ZKP in real-world applications, we propose PipeZK, a pipelined accelerator with two subsystems to handle the aforementioned two intensive compute tasks, respectively. The first subsystem uses a novel dataflow to decompose large kernels into smaller ones that execute on bandwidth-efficient hardware modules, with optimized off-chip memory accesses and on-chip compute resources. The second subsystem adopts a lightweight dynamic work dispatch mechanism to share the heavy processing units, with minimized resource underutilization and load imbalance. When evaluated in 28 nm, PipeZK can achieve 10x speedup on standard cryptographic benchmarks, and 5x on a widely-used cryptocurrency application, Zcash.
DOI: 10.1109/ISCA52012.2021.00040
Taming the zoo: the unified GraphIt compiler framework for novel architectures
作者: Brahmakshatriya, Ajay and Furst, Emily and Ying, Victor A. and Hsu, Claire and Hong, Changwan and Ruttenberg, Max and Zhang, Yunming and Jung, Dai Cheol and Richmond, Dustin and Taylor, Michael B. and Shun, Julian and Oskin, Mark and Sanchez, Daniel and Amarasinghe, Saman
关键词: intermediate representations, graphs, domain-specific languages, compilers for novel architectures
Abstract
We live in a new Cambrian Explosion of hardware devices. The end of conventional processor scaling has driven research and industry practice to explore a new generation of approaches. The old DNA of architecture design, including vectors, threads, shared or private memories, coherence or message passing, dataflow or von Neumann execution, are hybridized together in new and exciting ways. Each new architecture exposes a unique hardware-level API. Performance and energy efficiency are critically dependent on how well programs can use these APIs. One approach is to implement custom libraries for each new hardware architecture and application domain. A more scalable approach is to utilize a portable compiler infrastructure tailored to the application domain that makes it easy to generate efficient code for a diverse set of architectures with minimal porting effort.We propose the Unified GraphIt Compiler framework (UGC), which does exactly this for graph applications. UGC achieves portability with reasonable effort by decoupling the architecture-independent algorithm from the architecture-specific schedules and backends. We introduce a new domain-specific intermediate representation, GraphIR, that is key to this decoupling. GraphIR encodes high-level algorithm and optimization information needed for hardware-specific code generation, making it easy to develop different backends (GraphVMs) for diverse architectures, including CPUs, GPUs, and next-generation hardware such as Swarm and the HammerBlade manycore. We also build scheduling language extensions that make it easy to expose optimization decisions like load balancing strategies, blocking for locality, and other data structure choices. We evaluate UGC on five algorithms and 10 input graphs on these 4 distinct architectures and show that UGC enables implementing optimizations that can provide up to 53X speedup over programmer-generated straightforward implementations.
DOI: 10.1109/ISCA52012.2021.00041
Supporting legacy libraries on non-volatile memory: a user-transparent approach
作者: Ye, Chencheng and Xu, Yuanchao and Shen, Xipeng and Liao, Xiaofei and Jin, Hai and Solihin, Yan
关键词: No keywords
Abstract
As mainstream computing is poised to embrace the advent of byte-addressable non-volatile memory (NVM), an important roadblock has remained largely unnoticed, support of legacy libraries on NVM. Libraries underpin modern software everywhere. As current NVM programming interfaces all designate special types and constructs for NVM objects and references, legacy libraries, being incompatible with these data types, will face major obstacles for working with future applications written for NVM. This paper introduces a simple approach to mitigating the issue. The novel approach centers around user-transparent persistent reference, a new concept that allows programmers to reference a persistent object in the same way as reference a normal (volatile) object. The paper presents the implementation of the concept, carefully examines its soundness, and describes compiler and simple architecture support for keeping performance overheads very low.
DOI: 10.1109/ISCA52012.2021.00042
Execution dependence extension (EDE): isa support for eliminating fences
作者: Shull, Thomas and Vougioukas, Ilias and Nikoleris, Nikos and Elsasser, Wendy and Torrellas, Josep
关键词: instruction ordering, fences, ISA extensions
Abstract
Fence instructions are a coarse-grained mechanism to enforce the order of instruction execution in an out-of-order pipeline. They are an overkill for cases when only one instruction must wait for the completion of one other instruction. For example, this is the case when performing undo logging in Non-Volatile Memory (NVM) systems: while the update of a variable needs to wait until the corresponding undo log entry is persisted, all other instructions can be reordered. Unfortunately, current ISAs do not provide a way to describe such an execution dependence between two instructions that have no register or memory dependences. As a result, programmers must place fences, which unnecessarily serialize many unrelated instructions.To remedy this limitation, we propose an ISA extension capable of describing these execution dependences. We call the proposal Execution Dependence Extension (EDE), and add it to Arm’s AArch64 ISA. We also present two hardware realizations of EDE that enforce execution dependences at different stages of the pipeline: one in the issue queue (IQ) and another in the write buffer (WB). We implement IQ and WB in a simulator and test them with several NVM applications. Overall, by using EDE with IQ and WB rather than fences, we attain average workload speedups of 18% and 26%, respectively.
DOI: 10.1109/ISCA52012.2021.00043
Hetero-ViTAL: a virtualization stack for heterogeneous FPGA clusters
作者: Zha, Yue and Li, Jing
关键词: No keywords
Abstract
With field-programmable gate arrays (FPGAs) being widely deployed into data centers, an efficient virtualization support is required to fully unleash the potential of cloud FPGAs. Nevertheless, existing FPGA virtualization solutions only support a homogeneous FPGA cluster comprising identical FPGA devices. Representative work such as ViTAL provides sufficient system support for scale-out acceleration and improves the overall resource utilization through a fine-grained spatial sharing. While these existing solutions (including ViTAL) can efficiently virtualize a homogeneous cluster, it is hard to extend them to virtualizing a heterogeneous cluster which comprises multiple types of FPGAs. We expect the future cloud FPGAs are likely to be more heterogeneous due to hardware rolling upgrade.In this paper, we rethink FPGA virtualization from ground up and propose HETERO-VITAL to virtualize heterogeneous FPGA clusters. We identify the conflicting requirements of runtime management and offline compilation when designing the abstraction for a heterogeneous cluster, which is also the fundamental reason why the single-level abstraction as proposed in ViTAL (and other prior works) cannot be trivially extended to the heterogeneous case. To decouple these conflicting requirements, we provide a two-level system abstraction in HETERO-VITAL. Specifically, the high-level abstraction is FPGA-agnostic and provides a simple and homogeneous view of the FPGA resources to simplify the runtime management. On the contrary, the low-level abstraction is FPGA-specific and exposes sufficient spatial resource constraints to the compilation framework to ensure the mapping quality. Rather than simply adding a layer on top of the single-level abstraction as proposed in ViTAL and other prior work, we judiciously determine how much hardware details should be exposed at each level to balance the management complexity, mapping quality and compilation cost. We then develop a compilation framework to map applications onto this two-level abstraction with several optimization techniques to further improve the mapping quality. We also provide a runtime management policy to alleviate the fragmentation issue, which becomes more severe in a heterogeneous cluster due to the distinct resource capacities of diverse FPGAs.We evaluate HETERO-VITAL on a custom-built FPGA cluster and demonstrate its effectiveness using machine learning and image processing applications. Results show that HETERO-VITAL reduces the average response time (a critical metric for QoS) by 79.2% for a heterogeneous cluster compared to the non-virtualized baseline. When virtualizing a homogeneous cluster, HETERO-VITAL also reduces the average response time by 42.0% compared with ViTAL due to a better system design.
DOI: 10.1109/ISCA52012.2021.00044
CODIC: a low-cost substrate for enabling custom in-DRAM functionalities and optimizations
作者: Orosa, Lois and Wang, Yaohua and Sadrosadati, Mohammad and Kim, Jeremie S. and Patel, Minesh and Puddu, Ivan and Luo, Haocong and Razavi, Kaveh and G'{o
关键词: No keywords
Abstract
DRAM is the dominant main memory technology used in modern computing systems. Computing systems implement a memory controller that interfaces with DRAM via DRAM commands. DRAM executes the given commands using internal components (e.g., access transistors, sense amplifiers) that are orchestrated by DRAM internal timings, which are fixed for each DRAM command. Unfortunately, the use of fixed internal timings limits the types of operations that DRAM can perform and hinders the implementation of new functionalities and custom mechanisms that improve DRAM reliability, performance and energy. To overcome these limitations, we propose enabling programmable DRAM internal timings for controlling in-DRAM components.To this end, we design CODIC, a new low-cost DRAM substrate that enables fine-grained control over four previously fixed internal DRAM timings that are key to many DRAM operations. We implement CODIC with only minimal changes to the DRAM chip and the DDRx interface. To demonstrate the potential of CODIC, we propose two new CODIC-based security mechanisms that outperform state-of-the-art mechanisms in several ways: (1) a new DRAM Physical Unclonable Function (PUF) that is more robust and has significantly higher throughput than state-of-the-art DRAM PUFs, and (2) the first cold boot attack prevention mechanism that does not introduce any performance or energy overheads at runtime.
DOI: 10.1109/ISCA52012.2021.00045
NVOverlay: enabling efficient and scalable high-frequency snapshotting to NVM
作者: Wang, Ziqi and Choo, Chul-Hwan and Kozuch, Michael A. and Mowry, Todd C. and Pekhimenko, Gennady and Seshadri, Vivek and Skarlatos, Dimitrios
关键词: snapshotting, shadow paging, non-volatile memory (NVM)
Abstract
The ability to capture frequent (per millisecond) persistent snapshots to NVM would enable a number of compelling use cases. Unfortunately, existing NVM snapshotting techniques suffer from a combination of persistence barrier stalls, write amplification to NVM, and/or lack of scalability beyond a single socket. In this paper, we present NVOverlay, which is a scalable and efficient technique for capturing frequent persistent snapshots to NVM such that they can be randomly accessed later. NVOverlay uses Coherent Snapshot Tracking to efficiently track changes to memory (since the previous snapshot) across multi-socket parallel systems, and it uses Multi-snapshot NVM Mapping to store these snapshots to NVM while avoiding excessive write amplification. Our experiments demonstrate that NVOverlay successfully hides the overhead of capturing these snapshots while reducing write amplification by 29%-47% compared with state-of-the-art logging-based snapshotting techniques.
DOI: 10.1109/ISCA52012.2021.00046
Rebooting virtual memory with midgard
作者: Gupta, Siddharth and Bhattacharyya, Atri and Oh, Yunho and Bhattacharjee, Abhishek and Falsafi, Babak and Payer, Mathias
关键词: virtual memory, virtual caches, servers, memory hierarchy, datacenters, address translation
Abstract
Computer systems designers are building cache hierarchies with higher capacity to capture the ever-increasing working sets of modern workloads. Cache hierarchies with higher capacity improve system performance but shift the performance bottleneck to address translation. We propose Midgard, an intermediate address space between the virtual and the physical address spaces, to mitigate address translation overheads without program-level changes.Midgard leverages the operating system concept of virtual memory areas (VMAs) to realize a single Midgard address space where VMAs of all processes can be uniquely mapped. The Midgard address space serves as the namespace for all data in a coherence domain and the cache hierarchy. Because real-world workloads use far fewer VMAs than pages to represent their virtual address space, virtual to Midgard translation is achieved with hardware structures that are much smaller than TLB hierarchies. Costlier Midgard to physical address translations are needed only on LLC misses, which become much less frequent with larger caches. As a consequence, Midgard shows that instead of amplifying address translation overheads, memory hierarchies with large caches can reduce address translation overheads.Our evaluation shows that Midgard achieves only 5% higher address translation overhead as compared to traditional TLB hierarchies for 4KB pages when using a 16MB aggregate LLC. Midgard also breaks even with traditional TLB hierarchies for 2MB pages when using a 256MB aggregate LLC. For cache hierarchies with higher capacity, Midgard’s address translation overhead drops to near zero as secondary and tertiary data working sets fit in the LLC, while traditional TLBs suffer even higher degrees of address translation overhead.
DOI: 10.1109/ISCA52012.2021.00047
Dv'{e
作者: Patil, Adarsh and Nagarajan, Vijay and Balasubramonian, Rajeev and Oswald, Nicolai
关键词: memory systems, coherence, DRAM
Abstract
As technologies continue to shrink, memory system failure rates have increased, demanding support for stronger forms of reliability. In this work, we take inspiration from the two-tier approach that decouples correction from detection and explore a novel extrapolation. We propose Dv'{e
DOI: 10.1109/ISCA52012.2021.00048
Enabling compute-communication overlap in distributed deep learning training platforms
作者: Rashidi, Saeed and Denton, Matthew and Sridharan, Srinivas and Srinivasan, Sudarshan and Suresh, Amoghavarsha and Nie, Jade and Krishna, Tushar
关键词: deep learning training, communication accelerator, collective communication, accelerator fabric
Abstract
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator’s compute and memory for both DL computations and communication.This work makes two key contributions. First, via real system measurements and detailed modeling, we provide an understanding of compute and memory bandwidth demands for DL compute and comms. Second, we propose a novel DL collective communication accelerator called Accelerator Collectives Engine (ACE) that sits alongside the compute and networking engines at the accelerator endpoint. ACE frees up the endpoint’s compute and memory resources for DL compute, which in turn reduces the required memory BW by 3.5X on average to drive the same network BW compared to state-of-the-art baselines. For modern DL workloads and different network sizes, ACE, on average, increases the effective network bandwidth utilization by 1.44X (up to 2.67X), resulting in an average of 1.41X (up to 1.51X), 1.12X (up to 1.17X), and 1.13X (up to 1.19X) speedup in iteration time for ResNet-50, GNMT and DLRM when compared to the best baseline configuration, respectively.
DOI: 10.1109/ISCA52012.2021.00049
CoSA: scheduling by <u>c</u>onstrained <u>o</u>ptimization for <u>s</u>patial <u>a</u>ccelerators
作者: Huang, Qijing and Kang, Minwoo and Dinh, Grace and Norell, Thomas and Kalaiah, Aravind and Demmel, James and Wawrzynek, John and Shao, Yakun Sophia
关键词: scheduling, neural networks, compiler optimizations, accelerator
Abstract
Recent advances in Deep Neural Networks (DNNs) have led to active development of specialized DNN accelerators, many of which feature a large number of processing elements laid out spatially, together with a multi-level memory hierarchy and flexible interconnect. While DNN accelerators can take advantage of data reuse and achieve high peak throughput, they also expose a large number of runtime parameters to the programmers who need to explicitly manage how computation is scheduled both spatially and temporally. In fact, different scheduling choices can lead to wide variations in performance and efficiency, motivating the need for a fast and efficient search strategy to navigate the vast scheduling space.To address this challenge, we present CoSA, a constrained-optimization-based approach for scheduling DNN accelerators. As opposed to existing approaches that either rely on designers’ heuristics or iterative methods to navigate the search space, CoSA expresses scheduling decisions as a constrained-optimization problem that can be deterministically solved using mathematical optimization techniques. Specifically, CoSA leverages the regularities in DNN operators and hardware to formulate the DNN scheduling space into a mixed-integer programming (MIP) problem with algorithmic and architectural constraints, which can be solved to automatically generate a highly efficient schedule in one shot. We demonstrate that CoSA-generated schedules significantly outperform state-of-the-art approaches by a geometric mean of up to 2.5X across a wide range of DNN networks while improving the time-to-solution by 90X.
DOI: 10.1109/ISCA52012.2021.00050
η-LSTM: co-designing highly-efficient large LSTM training via exploiting memory-saving and architectural design opportunities
作者: Zhang, Xingyao and Xia, Haojun and Zhuang, Donglin and Sun, Hao and Fu, Xin and Taylor, Michael B. and Song, Shuaiwen Leon
关键词: recurrent neural network, neural nets, machine learning, accelerator
Abstract
Recently, the recurrent neural network, or its most popular type—the Long Short Term Memory (LSTM) network—has achieved great success in a broad spectrum of real-world application domains, such as autonomous driving, natural language processing, sentiment analysis, and epidemiology. Due to the complex features of the real-world tasks, current LSTM models become increasingly bigger and more complicated for enhancing the learning ability and prediction accuracy. However, through our in-depth characterization on the state-of-the-art general-purpose deep-learning accelerators, we observe that the LSTM training execution grows inefficient in terms of storage, performance, and energy consumption, under an increasing model size. With further algorithmic and architectural analysis, we identify the root cause for large LSTM training inefficiency: massive intermediate variables. To enable a highly-efficient LSTM training solution for the ever-growing model size, we exploit some unique memory-saving and performance improvement opportunities from the LSTM training procedure, and leverage them to propose the first cross-stack training solution, η-LSTM, for large LSTM models. η-LSTM comprises both software-level and hardware-level innovations that effectively lower the memory footprint upper-bound and excessive data movements during large LSTM training, while also drastically improving training performance and energy efficiency. Experimental results on six real-world large LSTM training benchmarks demonstrate that η-LSTM reduces the required memory footprint by an average of 57.5% (up to 75.8%) and brings down the data movements for weight matrices, activation data, and intermediate variables by 40.9%, 32.9%, and 80.0%, respectively. Furthermore, it outperforms the state-of-the-art GPU implementation for LSTM training by an average of 3.99\texttimes{
DOI: 10.1109/ISCA52012.2021.00051
FlexMiner: a pattern-aware accelerator for graph pattern mining
作者: Chen, Xuhao and Huang, Tianhao and Xu, Shuotao and Bourgeat, Thomas and Chung, Chanwoo and Arvind
关键词: software/hardware co-design, pattern aware, graph pattern mining, accelerator
Abstract
Graph pattern mining (GPM) is a class of algorithms widely used in many real-world applications in bio-medicine, e-commerce, security, social sciences, etc. GPM is a computationally intensive problem with an enormous amount of coarse-grain parallelism and therefore, attractive for hardware acceleration. Unfortunately, existing GPM accelerators have not used the best known algorithms and optimizations, and thus offer questionable benefits over software implementations.We present FlexMiner, a software/hardware co-designed GPM accelerator that improves the efficiency without compromising the generality or productivity of state-of-the-art software GPM frameworks. FlexMiner exploits massive amount of coarse-grain parallelism in GPM by deploying a large number of specialized processing elements. For efficient searches, the FlexMiner hardware accepts pattern-specific execution plans, which are generated automatically by the FlexMiner compiler from the given pat-tern(s). To avoid repetitive computation on neighborhood connectivity, we provide dedicated on-chip storage to memoize reusable connectivity information in a connectivity map (c-map) which is implemented with low-cost yet high-throughput hardware. The on-chip memories in FlexMiner are managed dynamically using heuristics derived by the compiler, and thus are fully utilized. We have evaluated FlexMiner with 4 GPM applications on a wide range of real-world graphs. Our cycle-accurate simulation shows that FlexMiner with 64 PEs achieves 10.6\texttimes{
DOI: 10.1109/ISCA52012.2021.00052
PolyGraph: exposing the value of flexibility for graph processing accelerators
作者: Dadu, Vidushi and Liu, Sihao and Nowatzki, Tony
关键词: No keywords
Abstract
Because of the importance of graph workloads and the limitations of CPUs/GPUs, many graph processing accelerators have been proposed. The basic approach of prior accelerators is to focus on a single graph algorithm variant (eg. bulk-synchronous + slicing). While helpful for specialization, this leaves performance potential from flexibility on the table and also complicates understanding the relationship between graph types, workloads, algorithms, and specialization.In this work, we explore the value of flexibility in graph processing accelerators. First, we identify a taxonomy of key algorithm variants. Then we develop a template architecture (PolyGraph) that is flexible across these variants while being able to modularly integrate specialization features for each.Overall we find that flexibility in graph acceleration is critical. If only one variant can be supported, asynchronous-updates/priority-vertex-scheduling/graph-slicing is the best design, achieving 1.93X speedup over the best-performing accelerator, GraphPulse. However, static flexibility per-workload can further improve performance by 2.71X. With dynamic flexibility per-phase, performance further improves by up to 50%.
DOI: 10.1109/ISCA52012.2021.00053
Large-scale graph processing on FPGAs with caches for thousands of simultaneous misses
作者: Asiatici, Mikhail and Ienne, Paolo
关键词: nonblocking cache, graph, MOMS, FPGA, DRAM
Abstract
Efficient large-scale graph processing is crucial to many disciplines. Yet, while graph algorithms naturally expose massive parallelism opportunities, their performance is limited by the memory system because of irregular memory accesses. State-of-the-art FPGA graph processors, such as ForeGraph and FabGraph, address the memory issues by using scratchpads and regularly streaming edges from DRAM, but then they end up wasting bandwidth on unneeded data. Yet, where classic caches and scratchpads fail to deliver, FPGAs make powerful unorthodox solutions possible. In this paper, we resort to extreme nonblocking caches that handle tens of thousands of outstanding read misses. They significantly increase the ability of memory systems to coalesce multiple accelerator accesses into fewer DRAM memory requests; essentially, when latency is not the primary concern, they bring the advantages expected from a very large cache at a fraction of the cost. We prove our point with an adaptable graph accelerator running on Amazon AWS f1; our implementation takes into account all practical aspects of such a design, including the challenges involved when working with modern multidie FPGAs. Running classic algorithms (PageRank, SCC, and SSSP) on large graphs, we achieve 3X geometric mean speedup compared to state-of-the-art FPGA accelerators, 1.1–5.8X higher bandwidth efficiency and 3.0–15.3X better power efficiency than multicore CPUs, and we support much larger graphs than the state-of-the-art on GPUs.
DOI: 10.1109/ISCA52012.2021.00054
Cost-efficient overclocking in immersion-cooled datacenters
作者: Jalili, Majid and Manousakis, Ioannis and Goiri, '{I
关键词: workload performance, server overclocking, power management, datacenter cooling
Abstract
Cloud providers typically use air-based solutions for cooling servers in datacenters. However, increasing transistor counts and the end of Dennard scaling will result in chips with thermal design power that exceeds the capabilities of air cooling in the near future. Consequently, providers have started to explore liquid cooling solutions (e.g., cold plates, immersion cooling) for the most power-hungry workloads. By keeping the servers cooler, these new solutions enable providers to operate server components beyond the normal frequency range (i.e., overclocking them) all the time. Still, providers must tradeoff the increase in performance via overclocking with its higher power draw and any component reliability implications.In this paper, we argue that two-phase immersion cooling (2PIC) is the most promising technology, and build three prototype 2PIC tanks. Given the benefits of 2PIC, we characterize the impact of overclocking on performance, power, and reliability. Moreover, we propose several new scenarios for taking advantage of overclocking in cloud platforms, including oversubscribing servers and virtual machine (VM) auto-scaling. For the auto-scaling scenario, we build a system that leverages overclocking for either hiding the latency of VM creation or postponing the VM creations in the hopes of not needing them. Using realistic cloud workloads running on a tank prototype, we show that overclocking can improve performance by 20%, increase VM packing density by 20%, and improve tail latency in auto-scaling scenarios by 54%. The combination of 2PIC and overclocking can reduce platform cost by up to 13% compared to air cooling.
DOI: 10.1109/ISCA52012.2021.00055
CryoGuard: a near refresh-free robust DRAM design for cryogenic computing
作者: Lee, Gyu-Hyeon and Na, Seongmin and Byun, Ilkwon and Min, Dongmoon and Kim, Jangwoo
关键词: temperature-aware design, low-power design, emerging technologies, DRAM
Abstract
Cryogenic computing, which runs a computer device at an extremely low temperature, is highly promising thanks to the significant reduction of the wire latency and leakage current. A recently proposed cryogenic DRAM design achieved the promising performance improvement, but it also reveals that it must reduce the DRAM’s dynamic power to overcome the huge cooling cost at 77 K. Therefore, researchers now target to reduce the cryogenic DRAM’s refresh power by utilizing its significantly increased retention time driven by the reduced leakage current. To achieve the goal, however, architects should first answer many fundamental questions regarding the reliability and then design a refresh-free, but still robust cryogenic DRAM by utilizing the analysis result.In this work, we propose a near refresh-free, but robust cryogenic DRAM (NRFC-DRAM), which can almost eliminate its refresh overhead while ensuring reliable operations at 77 K. For the purpose, we first evaluate various DRAM samples of multiple vendors by conducting a thorough analysis to accurately estimate the cryogenic DRAM’s retention time and reliability. Our analysis identifies a new critical challenge such that reducing DRAM’s refresh rate can make the memory highly unreliable because normal memory operations can now appear as row-hammer attacks at 77 K. Therefore, NRFC-DRAM requires a cost-effective, cryogenic-friendly protection mechanism against the new row-hammer-like “faults” at 77 K.To resolve the challenge, we present CryoGuard, our cryogenic-friendly row-hammer protection method to ensure the NRFC-DRAM’s reliable operations at 77 K. With CryoGuard applied, NRFC-DRAM reduces the overall power consumption by 25.9% even with its cooling cost included, whereas the existing cryogenic DRAM fails to reduce the power consumption.
DOI: 10.1109/ISCA52012.2021.00056
Superconducting computing with alternating logic elements
作者: Tzimpragos, Georgios and Volk, Jennifer and Wynn, Alex and Smith, James E. and Sherwood, Timothy
关键词: xSFQ, unordered codes, superconductor electronics, pipelining, alternating logic
Abstract
Although superconducting single flux quantum (SFQ) technologies offer the potential for low-latency operation with energy dissipation of the order of attojoules per gate, their inherently pulse-driven nature and stateful cells have led to designs in which every logic gate is clocked. This means that clocked buffers must be added to equalize logic path lengths, and every gate becomes a pipeline stage. We propose a different approach, where gates are clock-free and synchronous designs have a conventional look-and-feel. Despite being clock-free, however, the gates are state machines by nature. To properly manage these state machines, the logical clock cycle is composed of two synchronous alternating phases: the first of which implements the desired function, and the second of which returns the state machines to the ground state. Moreover, to address the challenges associated with the asynchronous implementation of Boolean NOT operations in pulse-based systems, values are represented as unordered binary codes - in particular, dual-rail codes. With unordered codes, AND and OR operations are functionally complete.We demonstrate that our new approach, xSFQ, with its dual-rail construction and alternating clock phases, along with “double-pumped” logical latches and a timing optimization through latch decomposition, is capable of implementing arbitrary digital designs without gate-level pipelining and the overheads that come with it. We evaluate energy-delay tradeoffs enabled by this approach through a mix of detailed analog circuit modeling, pulse-level discrete-event simulation, and high-level pipeline efficiency analysis. The resulting systems are shown to deliver energy-delay product (EDP) gains over conventional SFQ even with pipeline hazard ratios (HR) below 1%. For hazard ratios equal to 15% and 20% and a design resembling a RISC-V RV32I core (excluding the cost of interlock logic), xSFQ achieves 22x and 31x EDP savings, respectively.
DOI: 10.1109/ISCA52012.2021.00057
Failure sentinels: ubiquitous just-in-time intermittent computation via low-cost hardware support for voltage monitoring
作者: Williams, Harrison and Moukarzel, Michael and Hicks, Matthew
关键词: No keywords
Abstract
Energy harvesting systems support the deployment of low-power microcontrollers untethered by constant power sources or batteries, enabling long-lived deployments in a variety of applications previously limited by power or size constraints. However, the limitations of harvested energy mean that even the lowest-power microcontrollers operate intermittently—waiting for the harvester to slowly charge a buffer capacitor and rapidly discharging the capacitor to support a brief burst of computation. The challenges of the intermittent operation brought on by harvested energy drive a variety of hardware and software techniques that first enabled long-running computation, then focused on improving performance. Many of the most promising systems demand dynamic updates of available energy to inform checkpointing and mode decisions.Unfortunately, existing energy monitoring solutions based on analog circuits (e.g., analog-to-digital converters) are ill-matched for the task because their signal processing focus sacrifices power efficiency for increased performance—performance not required by current or future intermittent computation systems. This results in existing solutions consuming as much energy as the microcontroller, stealing energy from useful computation. To create a low-power energy monitoring solution that provides just enough performance for intermittent computation use cases, we design and implement Failure Sentinels, an on-chip, fully-digital energy monitor. Failure Sentinels leverages the predictable propagation delay response of digital logic gates to supply voltage fluctuations to measure available energy. Our design space exploration shows that Failure Sentinels provides 30–50mV of resolution at sample rates up to 10kHz, while consuming less than 2μA of current. Experiments show that Failure Sentinels increases the energy available for software computation by up to 77%, compared to current solutions. We also implement a RISC-V-based FPGA prototype that validates our design space exploration and shows the overheads of incorporating Failure Sentinels into a system-on-chip.
DOI: 10.1109/ISCA52012.2021.00058
SPACE: locality-aware processing in heterogeneous memory for personalized recommendations
作者: Kal, Hongju and Lee, Seokmin and Ko, Gun and Ro, Won Woo
关键词: recommendation system, near memory processing, locality, heterogeneous memory, embedding layer
Abstract
Personalized recommendation systems have become a major AI application in modern data centers. The main challenges in processing personalized recommendation inferences are the large memory footprint and high bandwidth requirement of embedding layers. To overcome the capacity limit and bandwidth congestion of on-chip memory, near memory processing (NMP) can be a promising solution. Recent work on accelerating personalized recommendations proposes a DIMM-based NMP design to solve the bandwidth problem and increases memory capacity. The performance of NMP is determined by the internal bandwidth and the prior DIMM-based approach utilizes more DIMMs to achieve higher operation throughput. However, extending the number of DIMMs could eventually lead to significant power consumption due to inefficient scaling. We propose SPACE, a novel heterogeneous memory architecture, which is efficient in terms of performance and energy. SPACE exploits a compute-capable 3D-stacked DRAM with DIMMs for personalized recommendations. Prior to designing the proposed system, we give a quantitative analysis of the user/item interactions and define the two localities: gather locality and reduction locality. In gather operations, we find only a small proportion of items are highly-accessed by users, and we call this gather locality. Also, we define reduction locality as the reusability of the gathered items in reduction operations. Based on the gather locality, SPACE allocates highly-accessed embedding items to the 3D-stacked DRAM to achieve the maximum bandwidth. Subsequently, by exploiting reduction locality, we utilize the remaining space of the 3D-stacked DRAM to store and reuse repeated partial sums, thereby minimizing the required number of element-wise reduction operations. As a result, the evaluation shows that SPACE achieves 3.2X performance improvement and 56% energy saving over the previous DIMM-based NMPs leveraging 3D-stacked DRAM with a 1/8 size of DIMMs. Also, compared to the state-of-the-art DRAM cache designs with the same NMP configuration, SPACE achieves an average 32.7% of performance improvement.
DOI: 10.1109/ISCA52012.2021.00059
ELSA: hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks
作者: Ham, Tae Jun and Lee, Yejin and Seo, Seong Hoon and Kim, Soosung and Choi, Hyunji and Jung, Sung Jun and Lee, Jae W.
关键词: neural network, hardware accelerator, attention
Abstract
The self-attention mechanism is rapidly emerging as one of the most important key primitives in neural networks (NNs) for its ability to identify the relations within input entities. The self-attention-oriented NN models such as Google Transformer and its variants have established the state-of-the-art on a very wide range of natural language processing tasks, and many other self-attention-oriented models are achieving competitive results in computer vision and recommender systems as well. Unfortunately, despite its great benefits, the self-attention mechanism is an expensive operation whose cost increases quadratically with the number of input entities that it processes, and thus accounts for a significant portion of the inference runtime. Thus, this paper presents ELSA (Efficient, Lightweight Self-Attention), a hardware-software co-designed solution to substantially reduce the runtime as well as energy spent on the self-attention mechanism. Specifically, based on the intuition that not all relations are equal, we devise a novel approximation scheme that significantly reduces the amount of computation by efficiently filtering out relations that are unlikely to affect the final output. With the specialized hardware for this approximate self-attention mechanism, ELSA achieves a geomean speedup of 58.1X as well as over three orders of magnitude improvements in energy efficiency compared to GPU on self-attention computation in modern NN models while maintaining less than 1% loss in the accuracy metric.
DOI: 10.1109/ISCA52012.2021.00060
Cambricon-Q: a hybrid architecture for efficient training
作者: Zhao, Yongwei and Liu, Chang and Du, Zidong and Guo, Qi and Hu, Xing and Zhuang, Yimin and Zhang, Zhenxing and Song, Xinkai and Li, Wei and Zhang, Xishan and Li, Ling and Xu, Zhiwei and Chen, Tianshi
关键词: No keywords
Abstract
Deep neural network (DNN) training is notoriously time-consuming, and quantization is promising to improve the training efficiency with reduced bandwidth/storage requirements and computation costs. However, state-of-the-art quantized algorithms with negligible training accuracy loss, which require on-the-fly statistic-based quantization over a great amount of data (e.g., neurons and weights) and high-precision weight update, cannot be effectively deployed on existing DNN accelerators. To address this problem, we propose the first customized architecture for efficient quantized training with negligible accuracy loss, which is named as Cambricon-Q. Cambricon-Q features a hybrid architecture consisting of an ASIC acceleration core and a near-data-processing (NDP) engine. The acceleration core mainly targets at improving the efficiency of statistic-based quantization with specialized computing units for both statistical analysis (e.g., determining maximum) and data reformating, while the NDP engine avoids transferring the high-precision weights from the off-chip memory to the acceleration core. Experimental results show that on the evaluated benchmarks, Cambricon-Q improves the energy efficiency of DNN training by 6.41X and 1.62X, performance by 4.20X and 1.70X compared to GPU and TPU, respectively, with only ⩽ 0.4% accuracy degradation compared with full precision training.
DOI: 10.1109/ISCA52012.2021.00061
TENET: a framework for modeling tensor dataflow based on relation-centric notation
作者: Lu, Liqiang and Guan, Naiqing and Wang, Yuyue and Jia, Liancheng and Luo, Zizhang and Yin, Jieming and Cong, Jason and Liang, Yun
关键词: No keywords
Abstract
Accelerating tensor applications on spatial architectures provides high performance and energy-efficiency, but requires accurate performance models for evaluating various dataflow alternatives. Such modeling relies on the notation of tensor dataflow and the formulation of performance metrics. Recent proposed compute-centric and data-centric notations describe the dataflow using imperative directives. However, these two notations are less expressive and thus lead to limited optimization opportunities and inaccurate performance models.In this paper, we propose a framework TENET that models hardware dataflow of tensor applications. We start by introducing a relation-centric notation, which formally describes the hardware dataflow for tensor computation. The relation-centric notation specifies the hardware dataflow, PE interconnection, and data assignment in a uniform manner using relations. The relation-centric notation is more expressive than the compute-centric and data-centric notations by using more sophisticated affine transformations. Another advantage of relation-centric notation is that it inherently supports accurate metrics estimation, including data reuse, bandwidth, latency, and energy. TENET computes each performance metric by counting the relations using integer set structures and operators. Overall, TENET achieves 37.4% and 51.4% latency reduction for CONV and GEMM kernels compared with the state-of-the-art data-centric notation by identifying more sophisticated hardware dataflows.
DOI: 10.1109/ISCA52012.2021.00062
Ripple: profile-guided instruction cache replacement for data center applications
作者: Khan, Tanvir Ahmed and Zhang, Dexin and Sriraman, Akshitha and Devietti, Joseph and Pokam, Gilles and Litz, Heiner and Kasikci, Baris
关键词: No keywords
Abstract
Modern data center applications exhibit deep software stacks, resulting in large instruction footprints that frequently cause instruction cache misses degrading performance, cost, and energy efficiency. Although numerous mechanisms have been proposed to mitigate instruction cache misses, they still fall short of ideal cache behavior, and furthermore, introduce significant hardware overheads. We first investigate why existing I-cache miss mitigation mechanisms achieve sub-optimal performance for data center applications. We find that widely-studied instruction prefetchers fall short due to wasteful prefetch-induced cache line evictions that are not handled by existing replacement policies. Existing replacement policies are unable to mitigate wasteful evictions since they lack complete knowledge of a data center application’s complex program behavior.To make existing replacement policies aware of these eviction-inducing program behaviors, we propose Ripple, a novel software-only technique that profiles programs and uses program context to inform the underlying replacement policy about efficient replacement decisions. Ripple carefully identifies program contexts that lead to I-cache misses and sparingly injects “cache line eviction” instructions in suitable program locations at link time. We evaluate Ripple using nine popular data center applications and demonstrate that Ripple enables any replacement policy to achieve speedup that is closer to that of an ideal I-cache. Specifically, Ripple achieves an average performance improvement of 1.6% (up to 2.13%) over prior work due to a mean 19% (up to 28.6%) I-cache miss reduction.
DOI: 10.1109/ISCA52012.2021.00063
Quantifying server memory frequency margin and using it to improve performance in HPC systems
作者: Zhang, Da and Panwar, Gagandeep and Kotra, Jagadish B. and DeBardeleben, Nathan and Blanchard, Sean and Jian, Xun
关键词: memory system, memory frequency margin, fault tolerance, availability, HPC
Abstract
To maintain strong reliability, memory manufacturers label server memories at much slower data rates than the highest data rates at which they can still operate correctly for most (e.g., 99.999%+ of) accesses; we refer to the gap between these two data rates as memory frequency margin. While many prior works have studied memory latency margins in a different context of consumer memories, none has publicly studied memory frequency margin (either for consumer or server memories).To close this knowledge gap in the public domain, we perform the first public study to characterize frequency margins in commodity server memory modules. Through our large-scale study, we find that under standard voltage and cooling, they can operate 27% faster, on average, without error(s) for 99.999%+ of accesses even at high temperatures.The current practice of conservatively operating server memory is far from ideal; it slows down 99.999%+ of accesses to benefit the <0.001% of accesses that would be erroneous at a faster data rate. An ideal system should only pay this reliability tax for the <0.001% of accesses that actually need it.Towards unleashing ideal performance, our second contribution is performing the first exploration on exploiting server memory frequency margin to maximize performance. We focus on High-Performance Computing (HPC) systems, where performance is paramount. We propose exploiting HPC systems’ abundant free memory in the common case to store copies of every data block and operate the copies unreliably fast to speedup common-case accesses; we use the safely-operated original blocks for recovery when the unsafely-operated copies become corrupted. We refer to our idea as Heterogeneously-accessed Dual Module Redundancy (Hetero-DMR).Hetero-DMR improves node-level performance by 18%, on average across two CPU memory hierarchies and six HPC benchmark suites, while weighted by different frequency margins and different levels of memory utilization. We also use a real system to emulate the speedup of Hetero-DMR over a conventional system; it closely matches simulation. Our system-wide simulations show applying Hetero-DMR to an HPC system provides 1.4x average speedup on job turnaround time. To facilitate adoption, Hetero-DMR also rigorously preserves system reliability and works for commodity DIMMs and CPU-memory interfaces.
DOI: 10.1109/ISCA52012.2021.00064
Revamping storage class memory with hardware automated memory-over-storage solution
作者: Zhang, Jie and Kwon, Miryeong and Gouk, Donghyun and Koh, Sungjoon and Kim, Nam Sung and Kandemir, Mahmut Taylan and Jung, Myoungsoo
关键词: No keywords
Abstract
Large persistent memories such as NVDIMM have been perceived as a disruptive memory technology, because they can maintain the state of a system even after a power failure and allow the system to recover quickly. However, overheads incurred by a heavy software-stack intervention seriously negate the benefits of such memories. First, to significantly reduce the software stack overheads, we propose HAMS, a hardware automated Memory-over-Storage (MoS) solution. Specifically, HAMS aggregates the capacity of NVDIMM and ultra-low latency flash archives (ULL-Flash) into a single large memory space, which can be used as a working memory expansion or persistent memory expansion, in an OS-transparent manner. HAMS resides in the memory controller hub and manages its MoS address pool over conventional DDR and NVMe interfaces; it employs a simple hardware cache to serve all the memory requests from the host MMU after mapping the storage space of ULL-Flash to the memory space of NVDIMM. Second, to make HAMS more energy-efficient and reliable, we propose an “advanced HAMS” which removes unnecessary data transfers between NVDIMM and ULL-Flash after optimizing the datapath and hardware modules of HAMS. This approach unleashes the ULL-Flash and its NVMe controller from the storage box and directly connects the HAMS datapath to NVDIMM over the conventional DDR4 interface. Our evaluations show that HAMS and advanced HAMS can offer 97% and 119% higher system performance than a software-based NVDIMM design, while costing 41% and 45% lower energy, respectively.
DOI: 10.1109/ISCA52012.2021.00065
NASGuard: a novel accelerator architecture for robust neural architecture search (NAS) networks
作者: Wang, Xingbin and Zhao, Boyan and Hou, Rui and Awad, Amro and Tian, Zhihong and Meng, Dan
关键词: robust NAS network, adversarial example, DNN accelerator
Abstract
Due to the wide deployment of deep learning applications in safety-critical systems, robust and secure execution of deep learning workloads is imperative. Adversarial examples, where the inputs are carefully designed to mislead the machine learning model is among the most challenging attacks to detect and defeat. The most dominant approach for defending against adversarial examples is to systematically create a network architecture that is sufficiently robust. Neural Architecture Search (NAS) has been heavily used as the de facto approach to design robust neural network models, by using the accuracy of detecting adversarial examples as a key metric of the neural network’s robustness. While NAS has been proven effective in improving the robustness (and accuracy in general), the NAS-generated network models run noticeably slower on typical DNN accelerators than the hand-crafted networks, mainly because DNN accelerators are not optimized for robust NAS-generated models. In particular, the inherent multi-branch nature of NAS-generated networks causes unacceptable performance and energy overheads.To bridge the gap between the robustness and performance efficiency of deep learning applications, we need to rethink the design of AI accelerators to enable efficient execution of robust (auto-generated) neural networks. In this paper, we propose a novel hardware architecture, NASGuard, which enables efficient inference of robust NAS networks. NASGuard leverages a heuristic multi-branch mapping model to improve the efficiency of the underlying computing resources. Moreover, NASGuard addresses the load imbalance problem between the computation and memory-access tasks from multi-branch parallel computing. Finally, we propose a topology-aware performance prediction model for data prefetching, to fully exploit the temporal and spatial localities of robust NAS-generated architectures. We have implemented NASGuard with Verilog RTL. The evaluation results show that NASGuard achieves an average speedup of 1.74X over the baseline DNN accelerator.
DOI: 10.1109/ISCA52012.2021.00066
NASA: accelerating neural network design with a NAS processor
作者: Ma, Xiaohan and Si, Chang and Wang, Ying and Liu, Cheng and Zhang, Lei
关键词: No keywords
Abstract
Neural network search (NAS) projects a promising direction to automate the design process of efficient and powerful neural network architectures. Nevertheless, the NAS techniques have to dynamically generate a large number of candidate neural networks, and iteratively train and evaluate these on-line generated network architectures, thus they are extremely time-consuming even when deployed on large GPU clusters, which dramatically hinders the adoption of NAS. Though recently there are many specialized architectures proposed to accelerate the training or inference of neural networks, we observe that existing neural network accelerators are typically targeted at static neural network architectures, and they are not suitable to accelerate the evaluation of the dynamical neural network candidates evolving during the NAS process, which cannot be deployed onto current accelerators via the off-line compilation.To enable rapid and energy-efficient NAS in compact singlechip solutions, we propose NASA, a specialized architecture for one-shot based NAS acceleration. It is able to generate, schedule, and evaluate the candidate neural network architectures for the target machine learning workload with high speed, significantly alleviating the processing bottleneck of one-shot NAS. Motivated by the observation that there are considerable computation sharing opportunities among the different neural network candidates generated in one-shot NAS, NASA is equipped with an on-chip network fusion unit to remove the redundant computation during the network mapping stage. In addition, the NASA accelerator can partition and re-schedule the candidate neural network architectures at fine-granularity to maximize the chance of data reuse and improve the utilization of the accelerator arrays integrated to accelerate network evaluation. According to our experiments on multiple one-shot NAS tasks, NASA achieves 33.52X performance speedup and 214.33X energy consumption reduction on average when compared to a CPU-GPU system.
DOI: 10.1109/ISCA52012.2021.00067
PMNet: in-network data persistence
作者: Seemakhupt, Korakit and Liu, Sihang and Senevirathne, Yasas and Shahbaz, Muhammad and Khan, Samira
关键词: tail latency, switch, programmable network, persistent memory, in-network processing, data center, RPC, NIC
Abstract
To guarantee data persistence, storage workloads (such as key-value stores and databases) typically use a synchronous protocol that places the network and server stack latency on the critical path of request processing. The use of the fast and byte-addressable persistent memory (PM) has helped mitigate the storage overhead of the server stack; yet, networking is still a dominant factor in the end-to-end latency of request processing. Emerging programmable network devices can reduce network latency by moving parts of the applications’ compute into the network (e.g., caching results for read requests); however, for update requests, the client still has to stall on the server to commit the updates, persistently.In this work, we introduce in-network data persistence that extends the data-persistence domain from servers to the network, and present PMNet, a programmable data plane (e.g., switch or NIC) with PM for persisting data in the network. PMNet logs incoming update requests and acknowledges clients directly without having them wait on the server to commit the request. In case of a failure, the logged requests act as redo logs for the server to recover. We implement PMNet on an FPGA and evaluate its performance using common PM workloads, including key-value stores and PM-backed applications. Our evaluation shows that PMNet can improve the throughput of update requests by 4.31X on average, and the 99th-percentile tail latency by 3.23X.
DOI: 10.1109/ISCA52012.2021.00068
Exploiting long-distance interactions and tolerating atom loss in neutral atom quantum architectures
作者: Baker, Jonathan M. and Litteken, Andrew and Duckering, Casey and Hoffmann, Henry and Bernien, Hannes and Chong, Frederic T.
关键词: quantum computing, neutral atoms, compiler
Abstract
Quantum technologies currently struggle to scale beyond moderate scale prototypes and are unable to execute even reasonably sized programs due to prohibitive gate error rates or coherence times. Many software approaches rely on heavy compiler optimization to squeeze extra value from noisy machines but are fundamentally limited by hardware. Alone, these software approaches help to maximize the use of available hardware but cannot overcome the inherent limitations posed by the underlying technology.An alternative approach is to explore the use of new, though potentially less developed, technology as a path towards scalability. In this work we evaluate the advantages and disadvantages of a Neutral Atom (NA) architecture. NA systems offer several promising advantages such as long range interactions and native multiqubit gates which reduce communication overhead, overall gate count, and depth for compiled programs. Long range interactions, however, impede parallelism with restriction zones surrounding interacting qubit pairs. We extend current compiler methods to maximize the benefit of these advantages and minimize the cost.Furthermore, atoms in an NA device have the possibility to randomly be lost over the course of program execution which is extremely detrimental to total program execution time as atom arrays are slow to load. When the compiled program is no longer compatible with the underlying topology, we need a fast and efficient coping mechanism. We propose hardware and compiler methods to increase system resilience to atom loss dramatically reducing total computation time by circumventing complete reloads or full recompilation every cycle.
DOI: 10.1109/ISCA52012.2021.00069
Software-hardware co-optimization for computational chemistry on superconducting quantum processors
作者: Li, Gushu and Shi, Yunong and Javadi-Abhari, Ali
关键词: superconducting quantum processor, software-hardware co-optimization, quantum computing, computational chemistry
Abstract
Computational chemistry is the leading application to demonstrate the advantage of quantum computing in the near term. However, large-scale simulation of chemical systems on quantum computers is currently hindered due to a mismatch between the computational resource needs of the program and those available in today’s technology. In this paper we argue that significant new optimizations can be discovered by co-designing the application, compiler, and hardware. We show that multiple optimization objectives can be coordinated through the key abstraction layer of Pauli strings, which are the basic building blocks of computational chemistry programs. In particular, we leverage Pauli strings to identify critical program components that can be used to compress program size with minimal loss of accuracy. We also leverage the structure of Pauli string simulation circuits to tailor a novel hardware architecture and compiler, leading to significant execution overhead reduction by up to 99%. While exploiting the high-level domain knowledge reveals significant optimization opportunities, our hardware/software framework is not tied to a particular program instance and can accommodate the full family of computational chemistry problems with such structure. We believe the co-design lessons of this study can be extended to other domains and hardware technologies to hasten the onset of quantum advantage.
DOI: 10.1109/ISCA52012.2021.00070
Designing calibration and expressivity-efficient instruction sets for quantum computing
作者: Lao, Lingling and Murali, Prakash and Martonosi, Margaret and Browne, Dan
关键词: quantum computing, instruction set architecture, compilation
Abstract
Near-term quantum computing (QC) systems have limited qubit counts, high gate (instruction) error rates, and typically support a minimal instruction set having one type of two-qubit gate (2Q). To reduce program instruction counts and improve application expressivity, vendors have proposed, and shown proof-of-concept demonstrations of richer instruction sets such as XY gates (Rigetti) and fSim gates (Google). These instruction sets comprise of families of 2Q gate types parameterized by continuous qubit rotation angles. That is, it allows a large set of different physical operations to be realized on the qubits, based on the input angles. However, having such a large number of gate types is problematic because each gate type has to be calibrated periodically, across the full system, to obtain high fidelity implementations. This results in substantial recurring calibration overheads even on current systems which use only a few gate types. Our work aims to navigate this tradeoff between application expressivity and calibration overhead, and identify what instructions vendors should implement to get the best expressivity with acceptable calibration time.Studying this tradeoff is challenging because of the diversity in QC application requirements, the need to optimize applications for widely different hardware gate types and noise variations across gate types. Therefore, our work develops NuOp, a flexible compilation pass based on numerical optimization, to efficiently decompose application operations into arbitrary hardware gate types. Using NuOp and four important quantum applications, we study the instruction set proposals of Rigetti and Google, with realistic noise simulations and a calibration model. Our experiments show that implementing 4–8 types of 2Q gates is sufficient to attain nearly the same expressivity as a full continuous gate family, while reducing the calibration overhead by two orders of magnitude. With several vendors proposing rich gate families as means to higher fidelity, our work has potential to provide valuable instruction set design guidance for near-term QC systems.
DOI: 10.1109/ISCA52012.2021.00071
Albireo: energy-efficient acceleration of convolutional neural networks via silicon photonics
作者: Shiflett, Kyle and Karanth, Avinash and Bunescu, Razvan and Louri, Ahmed
关键词: silicon photonics, optical computing, hardware acceleration, deep neural networks
Abstract
With the end of Dennard scaling, highly-parallel and specialized hardware accelerators have been proposed to improve the throughput and energy-efficiency of deep neural network (DNN) models for various applications. However, collective data movement primitives such as multicast and broadcast that are required for multiply-and-accumulate (MAC) computation in DNN models are expensive, and require excessive energy and latency when implemented with electrical networks. This consequently limits the scalability and performance of electronic hardware accelerators. Emerging technology such as silicon photonics can inherently provide efficient implementation of multicast and broadcast operations, making photonics more amenable to exploit parallelism within DNN models. Moreover, when coupled with other unique features such as low energy consumption, high channel capacity with wavelength-division multiplexing (WDM), and high speed, silicon photonics could potentially provide a viable technology for scaling DNN acceleration.In this paper, we propose Albireo, an analog photonic architecture for scaling DNN acceleration. By characterizing photonic devices such as microring resonators (MRRs) and Mach-Zehnder modulators (MZM) using photonic simulators, we develop realistic device models and outline their capability for system level acceleration. Using the device models, we develop an efficient broadcast combined with multicast data distribution by leveraging parameter sharing through unique WDM dot product processing. We evaluate the energy and throughput performance of Albireo on DNN models such as ResNet18, MobileNet and VGG16. When compared to current state-of-the-art electronic accelerators, Albireo increases throughput by 110 X, and improves energy-delay product (EDP) by an average of 74 X with current photonic devices. Furthermore, by considering moderate and aggressive photonic scaling, the proposed Albireo design shows that EDP can be reduced by at least 229 X.
DOI: 10.1109/ISCA52012.2021.00072
IntroSpectre: a pre-silicon framework for discovery and analysis of transient execution vulnerabilities
作者: Ghaniyoun, Moein and Barber, Kristin and Zhang, Yinqian and Teodorescu, Radu
关键词: No keywords
Abstract
Transient execution vulnerabilities originate in the extensive speculation implemented in modern high-performance microprocessors. Identifying all possible vulnerabilities in complex designs is very challenging. One of the challenges stems from the lack of visibility into the transient micro-architectural state of the processor. Prior work has used covert channels to identify data leakage from transient state, which limits the systematic discovery of all potential leakage sources.This paper presents INTROSPECTRE, a pre-silicon framework for early discovery of transient execution vulnerabilities. INTROSPECTRE addresses the lack of visibility into the micro-architectural processor state by integrating into the register transfer level (RTL) design flow, gaining full access to the internal state of the processor. Full visibility into the processor state enables INTROSPECTRE to perform a systematic leakage analysis that includes all micro-architectural structures, allowing it to identify potential leakage that may not be reachable with known side channels. We implement INTROSPECTRE on an RTL simulator and use it to perform transient leakage analysis on the RISC-V BOOM processor. We identify multiple transient leakage scenarios, most of which had not been highlighted on this processor design before.
DOI: 10.1109/ISCA52012.2021.00073
Maya: using formal control to obfuscate power side channels
作者: Pothukuchi, Raghavendra Pradyumna and Pothukuchi, Sweta Yamini and Voulgaris, Petros G. and Schwing, Alexander and Torrellas, Josep
关键词: power side channels, physical side channels, obfuscation, machine learning, control theory
Abstract
The security of computers is at risk because of information leaking through their power consumption. Attackers can use advanced signal measurement and analysis to recover sensitive data from this side channel.To address this problem, this paper presents Maya, a simple and effective defense against power side channels. The idea is to use formal control to re-shape the power dissipated by a computer in an application-transparent manner—preventing attackers from learning any information about the applications that are running. With formal control, a controller can reliably keep power close to a desired target function even when runtime conditions change unpredictably. By selecting the target function intelligently, the controller can make power to follow any desired shape, appearing to carry activity information which, in reality, is unrelated to the application. Maya can be implemented in privileged software, firmware, or simple hardware. In this paper, we implement Maya on three machines using privileged threads only, and show its effectiveness and ease of deployment. Maya has already thwarted a newly-developed remote power attack.
DOI: 10.1109/ISCA52012.2021.00074
Demystifying the system vulnerability stack: transient fault effects across the layers
作者: Papadimitriou, George and Gizopoulos, Dimitris
关键词: transient faults, system vulnerability stack, silent data corruptions, microprocessors, microarchitecture-level fault injection, crash
Abstract
In this paper, we revisit the system vulnerability stack for transient faults. We reveal severe pitfalls in widely used vulnerability measurement approaches, which separate the hardware and the software layers. We rely on microarchitecture level fault injection to derive very tight full-system vulnerability measurements. For our architectural and microarchitectural measurements, we employ GeFIN, a state-of-the-art fault injector built on top of the gem5 simulator, while for software level measurements we employ the LLFI fault injector. Analyzing two different Arm ISAs and two different microarchitectures for each ISA, we quantify the sources and the magnitude of error of architecture and software level vulnerability evaluation methods, which aim to reproduce the effects of hardware faults. We show that widely applied methodologies for system resilience evaluation fail to capture important fault manifestation and propagation aspects and lead to misleading findings, which report opposite vulnerability results than a comprehensive cross-layer analysis. To justify the validity of our findings we employ a state-of-the-art software-based fault tolerance technique and evaluate its impact at all layers through a case study. Our evaluation shows that although higher-level methods can report significant vulnerability improvements (up to 3.8x vulnerability reduction), the actual cross-layer vulnerability of the protected system can be degraded (increased) by up to 30% for the selected benchmarks. Our analysis firmly suggests that only accurate methodologies for full-system vulnerability evaluation of a microprocessor can guide informed transient faults protection decisions either at the hardware or at the software layer.
DOI: 10.1109/ISCA52012.2021.00075
No-FAT: architectural support for low overhead memory safety checks
作者: Ziad, Mohamed Tarek Ibn and Arroyo, Miguel A. and Manzhosov, Evgeny and Piersma, Ryan and Sethumadhavan, Simha
关键词: systems security, spectre-V1, microarchitecture, memory safety, fuzzing, bounds checking
Abstract
Memory safety continues to be a significant software reliability and security problem, and low overhead and low complexity hardware solutions have eluded computer designers. In this paper, we explore a pathway to deployable memory safety defenses. Our technique builds on a recent trend in software: the usage of binning memory allocators. We observe that if memory allocation sizes (e.g., malloc sizes) are made an architectural feature, then it is possible to overcome many of the thorny issues with traditional approaches to memory safety such as compatibility with unsecured software and significant performance degradation. We show that our architecture, No-FAT, incurs an overhead of 8% on SPEC CPU2017 benchmarks, and our VLSI measurements show low power and area overheads. Finally, as No-FAT’s hardware is aware of the memory allocation sizes, it effectively mitigates certain speculative attacks (e.g., Spectre-V1) with no additional cost. When our solution is used for pre-deployment fuzz testing it can improve fuzz testing bandwidth by an order of magnitude compared to state-of-the-art approaches.
DOI: 10.1109/ISCA52012.2021.00076
Ghost routing to enable oblivious computation on memory-centric networks
作者: Ro, Yeonju and Jin, Seongwook and Huh, Jaehyuk and Kim, John
关键词: routing, oblivious computation, memory-centric network
Abstract
With offloading of data to the cloud, ensuring privacy and securing data has become more important. However, encrypting data alone is insufficient as the memory address itself can leak sensitive information. In this work, we exploit packetized memory interface to provide secure memory access and support oblivious computation in a system with multiple memory modules interconnected with a multi-hop, memory-centric network. While the memory address can be encrypted with a packetized memory interface, simply encrypting the address does not provide full oblivious computation since coarse-grain memory access patterns can be leaked. In this work, we first propose a scalable encryption microarchitecture with source-based routing where the packet is only encrypted once at source and latency overhead in intermediate routers is minimized. We then define secure routing in memory-centric networks to enable oblivious computation such that memory access patterns across the memory modules are completely obfuscated. We explore different naive secure routing algorithms to ensure oblivious computation but they come with high performance overhead. To minimize performance overhead, we propose ghost packets that replace dummy packets with existing network traffic. We also propose Ghost routing that batches multiple ghost packets together to minimize bandwidth loss from naive secure routing while exploiting random routing.
DOI: 10.1109/ISCA52012.2021.00077
QUAC-TRNG: high-throughput true random number generation using quadruple row activation in commodity DRAM chips
作者: Olgun, Ataberk and Patel, Minesh and Ya\u{g
关键词: No keywords
Abstract
True random number generators (TRNG) sample random physical processes to create large amounts of random numbers for various use cases, including security-critical cryptographic primitives, scientific simulations, machine learning applications, and even recreational entertainment. Unfortunately, not every computing system is equipped with dedicated TRNG hardware, limiting the application space and security guarantees for such systems. To open the application space and enable security guarantees for the overwhelming majority of computing systems that do not necessarily have dedicated TRNG hardware (e.g., processing-in-memory systems), we develop QUAC-TRNG, a new high-throughput TRNG that can be fully implemented in commodity DRAM chips, which are key components in most modern systems.QUAC-TRNG exploits the new observation that a carefully-engineered sequence of DRAM commands activates four consecutive DRAM rows in rapid succession. This QUadruple ACtivation (QUAC) causes the bitline sense amplifiers to non-deterministically converge to random values when we activate four rows that store conflicting data because the net deviation in bitline voltage fails to meet reliable sensing margins.We experimentally demonstrate that QUAC reliably generates random values across 136 commodity DDR4 DRAM chips from one major DRAM manufacturer. We describe how to develop an effective TRNG (QUAC-TRNG) based on QUAC. We evaluate the quality of our TRNG using the commonly-used NIST statistical test suite for randomness and find that QUAC-TRNG successfully passes each test. Our experimental evaluations show that QUAC-TRNG reliably generates true random numbers with a throughput of 3.44 Gb/s (per DRAM channel), outperforming the state-of-the-art DRAM-based TRNG by 15.08X and 1.41X for basic and throughput-optimized versions, respectively. We show that QUAC-TRNG utilizes DRAM bandwidth better than the state-of-the-art, achieving up to 2.03X the throughput of a throughput-optimized baseline when scaling bus frequencies to 12 GT/s.
DOI: 10.1109/ISCA52012.2021.00078
A RISC-V in-network accelerator for flexible high-performance low-power packet processing
作者: Di Girolamo, Salvatore and Kurth, Andreas and Calotoiu, Alexandru and Benz, Thomas and Schneider, Timo and Ber'{a
关键词: specialized architecture, sPIN, packet processing, in-network compute
Abstract
The capacity of offloading data and control tasks to the network is becoming increasingly important, especially if we consider the faster growth of network speed when compared to CPU frequencies. In-network compute alleviates the host CPU load by running tasks directly in the network, enabling additional computation/communication overlap and potentially improving overall application performance. However, sustaining bandwidths provided by next-generation networks, e.g., 400 Gbit/s, can become a challenge. sPIN is a programming model for in-NIC compute, where users specify handler functions that are executed on the NIC, for each incoming packet belonging to a given message or flow. It enables a CUDA-like acceleration, where the NIC is equipped with lightweight processing elements that process network packets in parallel. We investigate the architectural specialties that a sPIN NIC should provide to enable high-performance, low-power, and flexible packet processing. We introduce PsPIN, a first open-source sPIN implementation, based on a multi-cluster RISC-V architecture and designed according to the identified architectural specialties. We investigate the performance of PsPIN with cycle-accurate simulations, showing that it can process packets at 400 Gbit/s for several use cases, introducing minimal latencies (26 ns for 64 B packets) and occupying a total area of 18.5 mm2 (22 nm FDSOI).
DOI: 10.1109/ISCA52012.2021.00079
Leaky buddies: cross-component covert channels on integrated CPU-GPU systems
作者: Dutta, Sankha Baran and Naghibijouybari, Hoda and Abu-Ghazaleh, Nael and Marquez, Andres and Barker, Kevin
关键词: No keywords
Abstract
Graphics Processing Units (GPUs) are ubiquitous components used across the range of today’s computing platforms, from phones and tablets, through personal computers, to high-end server class platforms. With the increasing importance of graphics and video workloads, recent processors are shipped with GPU devices that are integrated on the same chip. Integrated GPUs share some resources with the CPU and as a result, there is a potential for microarchitectural attacks from the GPU to the CPU or vice versa. We consider the potential for covert channel attacks that arise either from shared microarchitectural components (such as caches) or through shared contention domains (e.g., shared buses). We illustrate these two types of channels by developing two reliable covert channel attacks. The first covert channel uses the shared LLC cache in Intel’s integrated GPU architectures. The second is a contention based channel targeting the ring bus connecting the CPU and GPU to the LLC. This is the first demonstrated microarchitectural attack crossing the component boundary (GPU to CPU or vice versa). Cross-component channels introduce a number of new challenges that we had to overcome since they occur across heterogeneous components that use different computation models and are interconnected using asymmetric memory hierarchies. We also exploit GPU parallelism to increase the bandwidth of the communication, even without relying on a common clock. The LLC based channel achieves a bandwidth of 120 kbps with a low error rate of 2%, while the contention based channel delivers up to 400 kbps with a 0.8% error rate. We also demonstrate a proof-of-concept prime-and-probe side channel attack that probes the full LLC from the GPU.
DOI: 10.1109/ISCA52012.2021.00080
IChannels: exploiting current management mechanisms to create covert channels in modern processors
作者: Haj-Yahya, Jawad and Kim, Jeremie S. and Ya\u{g
关键词: No keywords
Abstract
To operate efficiently across a wide range of workloads with varying power requirements, a modern processor applies different current management mechanisms, which briefly throttle instruction execution while they adjust voltage and frequency to accommodate for power-hungry instructions (PHIs) in the instruction stream. Doing so 1) reduces the power consumption of non-PHI instructions in typical workloads and 2) optimizes system voltage regulators’ cost and area for the common use case while limiting current consumption when executing PHIs.However, these mechanisms may compromise a system’s confidentiality guarantees. In particular, we observe that multi-level side-effects of throttling mechanisms, due to PHI-related current management mechanisms, can be detected by two different software contexts (i.e., sender and receiver) running on 1) the same hardware thread, 2) co-located Simultaneous Multi-Threading (SMT) threads, and 3) different physical cores.Based on these new observations on current management mechanisms, we develop a new set of covert channels, IChannels, and demonstrate them in real modern Intel processors (which span more than 70% of the entire client and server processor market). Our analysis shows that IChannels provides more than 24X the channel capacity of state-of-the-art power management covert channels. We propose practical and effective mitigations to each covert channel in IChannels by leveraging the insights we gain through a rigorous characterization of real systems.
DOI: 10.1109/ISCA52012.2021.00081
ZeR\O{
作者: Ziad, Mohamed Tarek Ibn and Arroyo, Miguel A. and Manzhosov, Evgeny and Sethumadhavan, Simha
关键词: pointer integrity, memory safety, exploit mitigation, code-reuse defenses, caches
Abstract
A large class of today’s systems require high levels of availability and security. Unfortunately, state-of-the-art security solutions tend to induce crashes and raise exceptions when under attack, trading off availability for security. In this work, we propose ZeR\O{
DOI: 10.1109/ISCA52012.2021.00082
NN-baton: DNN workload orchestration and chiplet granularity exploration for multichip accelerators
作者: Tan, Zhanhong and Cai, Hongyu and Dong, Runpei and Ma, Kaisheng
关键词: scheduling, neural network, design space exploration, deep learning, chiplet, accelerator, MCM
Abstract
The revolution of machine learning poses an unprecedented demand for computation resources, urging more transistors on a single monolithic chip, which is not sustainable in the Post-Moore era. The multichip integration with small functional dies, called chiplets, can reduce the manufacturing cost, improve the fabrication yield, and achieve die-level reuse for different system scales. DNN workload mapping and hardware design space exploration on such multichip systems are critical, but missing in the current stage.This work provides a hierarchical and analytical framework to describe the DNN mapping on a multichip accelerator and analyze the communication overhead. Based on this framework, we propose an automatic tool called NN-Baton with a pre-design flow and a post-design flow. The pre-design flow aims to guide the chiplet granularity exploration with given area and performance budgets for the target workload. The post-design flow focuses on the workload orchestration on different computation levels - package, chiplet, and core - in the hierarchy. Compared to Simba, NN-Baton generates mapping strategies that save 22.5%~44% energy under the same computation and memory configurations. The architecture exploration demonstrates that area is a decisive factor for the chiplet granularity. For a 2048-MAC system under a 2 mm2 chiplet area constraint, the 4-chiplet implementation with 4 cores and 16 lanes of 8-size vector-MAC is always the top-pick computation allocation across several benchmarks. In contrast, the optimal memory allocation policy in the hierarchy typically depends on the neural network models.
DOI: 10.1109/ISCA52012.2021.00083
Snafu: an ultra-low-power, energy-minimal CGRA-generation framework and architecture
作者: Gobieski, Graham and Atli, Ahmet Oguz and Mai, Kenneth and Lucia, Brandon and Beckmann, Nathan
关键词: ultra-low power, reconfigurable computing, internet of things (IoT), energy-minimal design, dataflow, CGRA
Abstract
Ultra-low-power (ULP) devices are becoming pervasive, enabling many emerging sensing applications. Energy-efficiency is paramount in these applications, as efficiency determines device lifetime in battery-powered deployments and performance in energy-harvesting deployments. Unfortunately, existing designs fall short because ASICs’ upfront costs are too high and prior ULP architectures are too inefficient or inflexible.We present SNAFU, the first framework to flexibly generate ULP coarse-grain reconfigurable arrays (CGRAs). SNAFU provides a standard interface for processing elements (PE), making it easy to integrate new types of PEs for new applications. Unlike prior high-performance, high-power CGRAs, SNAFU is designed from the ground up to minimize energy consumption while maximizing flexibility. SNAFU saves energy by configuring PEs and routers for a single operation to minimize switching activity; by minimizing buffering within the fabric; by implementing a statically routed, bufferless, multi-hop network; and by executing operations in-order to avoid expensive tag-token matching.We further present SNAFU-ARCH, a complete ULP system that integrates an instantiation of the SNAFU fabric alongside a scalar RISC-V core and memory. We implement SNAFU in RTL and evaluate it on an industrial sub-28 nm FinFET process across a suite of common sensing benchmarks. SNAFU-ARCH operates at <1mW, orders-of-magnitude less power than most prior CGRAs. SNAFU-ARCH uses 41% less energy and runs 4.4X faster than the prior state-of-the-art general-purpose ULP architecture. Moreover, we conduct three comprehensive case-studies to quantify the cost of programmability in SNAFU. We find that SNAFU-ARCH is close to ASIC designs built in the same technology, using just 2.6X more energy on average.
DOI: 10.1109/ISCA52012.2021.00084
SARA: scaling a reconfigurable dataflow accelerator
作者: Zhang, Yaqi and Zhang, Nathan and Zhao, Tian and Vilim, Matt and Shahbaz, Muhammad and Olukotun, Kunle
关键词: scalability, plasticine, domain-specific compiler, RDA, CGRA
Abstract
The need for speed in modern data-intensive workloads and the rise of “dark silicon” in the semiconductor industry are pushing for larger, faster, and more energy and area-efficient architectures, such as Reconfigurable Dataflow Accelerators (RDAs). Nevertheless, challenges remain in developing mechanisms to effectively utilize the compute power of these large-scale RDAs.To address these challenges, we present SARA, a compiler that employs a novel mapping strategy to efficiently utilize large-scale RDAs. Starting from a single-threaded imperative abstraction, SARA spatially maps a program onto RDA’s distributed resources, exploiting dataflow parallelism within and across hyperblocks to saturate the compute throughput of an RDA. SARA introduces (a) compiler-managed memory consistency (CMMC), a control paradigm that hierarchically pipelines a nested and data-dependent control-flow graph onto a dataflow architecture, and (b) a compilation flow that decomposes the program graph across distributed heterogeneous resources to hide low-level RDA constraints from programmers. Our evaluation shows that SARA achieves close to perfect performance scaling on a recently proposed RDA-Plasticine. Over a mix of deep-learning, graph-processing, and streaming applications, SARA achieves a 1.9X geo-mean speedup over a Tesla V100 GPU using only 12% of the silicon area.
DOI: 10.1109/ISCA52012.2021.00085
HASCO: towards agile <u>ha</u>rdware and <u>s</u>oftware <u>co</u>-design for tensor computation
作者: Xiao, Qingcheng and Zheng, Size and Wu, Bingzhe and Xu, Pengcheng and Qian, Xuehai and Liang, Yun
关键词: No keywords
Abstract
Tensor computations overwhelm traditional general-purpose computing devices due to the large amounts of data and operations of the computations. They call for a holistic solution composed of both hardware acceleration and software mapping. Hardware/software (HW/SW) co-design optimizes the hardware and software in concert and produces high-quality solutions. There are two main challenges in the co-design flow. First, multiple methods exist to partition tensor computation and have different impacts on performance and energy efficiency. Besides, the hardware part must be implemented by the intrinsic functions of spatial accelerators. It is hard for programmers to identify and analyze the partitioning methods manually. Second, the overall design space composed of HW/SW partitioning, hardware optimization, and software optimization is huge. The design space needs to be efficiently explored.To this end, we propose an agile co-design approach HASCO that provides an efficient HW/SW solution to dense tensor computation. We use tensor syntax trees as the unified IR, based on which we develop a two-step approach to identify partitioning methods. For each method, HASCO explores the hardware and software design spaces. We propose different algorithms for the explorations, as they have distinct objectives and evaluation costs. Concretely, we develop a multi-objective Bayesian optimization algorithm to explore hardware optimization. For software optimization, we use heuristic and Q-learning algorithms. Experiments demonstrate that HASCO achieves a 1.25X to 1.44X latency reduction through HW/SW co-design compared with developing the hardware and software separately.
DOI: 10.1109/ISCA52012.2021.00086
SpZip: architectural support for effective data compression in irregular applications
作者: Yang, Yifan and Emer, Joel S. and Sanchez, Daniel
关键词: No keywords
Abstract
Irregular applications, such as graph analytics and sparse linear algebra, exhibit frequent indirect, data-dependent accesses to single or short sequences of elements that cause high main memory traffic and limit performance. Data compression is a promising way to accelerate irregular applications by reducing memory traffic. However, software compression adds substantial overheads, and prior hardware compression techniques work poorly on the complex access patterns of irregular applications.We present SpZip, an architectural approach that makes data compression practical for irregular algorithms. SpZip accelerates the traversal, decompression, and compression of the data structures used by irregular applications. In addition, these activities run in a decoupled fashion, hiding both memory access and decompression latencies. To support the wide range of access patterns in these applications, SpZip is programmable, and uses a novel Dataflow Configuration Language to specify programs that traverse and generate compressed data. Our SpZip implementation leverages dataflow execution and time-multiplexing to implement programmability cheaply. We evaluate SpZip on a simulated multicore system running a broad set of graph and linear algebra algorithms. SpZip outperforms prior state-of-the art software-only (hardware-accelerated) systems by gmean 3.0X (1.5X) and reduces memory traffic by 1.7X (1.4X). These benefits stem from both reducing data movement due to compression, and offloading expensive traversal and (de)compression operations.
DOI: 10.1109/ISCA52012.2021.00087
Dual-side sparse tensor core
作者: Wang, Yang and Zhang, Chen and Xie, Zhiqiang and Guo, Cong and Liu, Yunxin and Leng, Jingwen
关键词: pruning, neural networks, graphics processing units, general sparse matrix-matrix multiplication, convolution
Abstract
Leveraging sparsity in deep neural network (DNN) models is promising for accelerating model inference. Yet existing GPUs can only leverage the sparsity from weights but not activations, which are dynamic, unpredictable, and hence challenging to exploit. In this work, we propose a novel architecture to efficiently harness the dual-side sparsity (i.e., weight and activation sparsity). We take a systematic approach to understand the (dis)advantages of previous sparsity-related architectures and propose a novel, unexplored paradigm that combines outer-product computation primitive and bitmap-based encoding format. We demonstrate the feasibility of our design with minimal changes to the existing production-scale inner-product-based Tensor Core. We propose a set of novel ISA extensions and co-design the matrix-matrix multiplication and convolution algorithms, which are the two dominant computation patterns in today’s DNN models, to exploit our new dual-side sparse Tensor Core. Our evaluation shows that our design can fully unleash the dualside DNN sparsity and improve the performance by up to one order of magnitude with small hardware overhead.
DOI: 10.1109/ISCA52012.2021.00088
RingCNN: exploiting algebraically-sparse ring tensors for energy-efficient CNN-based computational imaging
作者: Huang, Chao-Tsung
关键词: regular sparsity, hardware accelerator, convolutional neural network, computational imaging
Abstract
In the era of artificial intelligence, convolutional neural networks (CNNs) are emerging as a powerful technique for computational imaging. They have shown superior quality for reconstructing fine textures from badly-distorted images and have potential to bring next-generation cameras and displays to our daily life. However, CNNs demand intensive computing power for generating high-resolution videos and defy conventional sparsity techniques when rendering dense details. Therefore, finding new possibilities in regular sparsity is crucial to enable large-scale deployment of CNN-based computational imaging.In this paper, we consider a fundamental but yet well-explored approach—algebraic sparsity—for energy-efficient CNN acceleration. We propose to build CNN models based on ring algebra that defines multiplication, addition, and non-linearity for n-tuples properly. Then the essential sparsity will immediately follow, e.g. n-times reduction for the number of real-valued weights. We define and unify several variants of ring algebras into a modeling framework, RingCNN, and make comparisons in terms of image quality and hardware complexity. On top of that, we further devise a novel ring algebra which minimizes complexity with component-wise product and achieves the best quality using directional ReLU. Finally, we design an accelerator, eRingCNN, to accommodate to the proposed ring algebra, in particular with regular ring-convolution arrays for efficient inference and on-the-fly directional ReLU blocks for fixed-point computation. We implement two configurations, n = 2 and 4 (50% and 75% sparsity), with 40 nm technology to support advanced denoising and super-resolution at up to 4K UHD 30 fps. Layout results show that they can deliver equivalent 41 TOPS using 3.76 W and 2.22 W, respectively. Compared to the real-valued counterpart, our ring convolution engines for n = 2 achieve 2.00 x energy efficiency and 2.08 x area efficiency with similar or even better image quality. With n = 4, the efficiency gains of energy and area are further increased to 3.84X and 3.77X with only 0.11 dB drop of peak signal-to-noise ratio (PSNR). The results show that RingCNN exhibits great architectural advantages for providing near-maximum hardware efficiencies and graceful quality degradation simultaneously.
DOI: 10.1109/ISCA52012.2021.00089
GoSPA: an energy-efficient high-performance globally optimized sparse convolutional neural network accelerator
作者: Deng, Chunhua and Sui, Yang and Liao, Siyu and Qian, Xuehai and Yuan, Bo
关键词: sparse, hardware accelerator, convolution, CNN, ASIC
Abstract
The co-existence of activation sparsity and model sparsity in convolutional neural network (CNN) models makes sparsity-aware CNN hardware designs very attractive. The existing sparse CNN accelerators utilize intersection operation to search and identify the key positions of the matched entries between two sparse vectors, and hence avoid unnecessary computations. However, these state-of-the-art designs still suffer from three major architecture-level drawbacks, including 1) hardware cost for the intersection operation is high; 2) frequent stalls of computation phase due to strong data dependency between intersection and computation phases; and 3) unnecessary data transfer incurred by the explicit intersection operation.By leveraging the knowledge of the complete sparse 2-D convolution, this paper proposes two key ideas that overcome all of the three drawbacks. First, an implicit on-the-fly intersection is proposed to realize the optimal solution for intersection between one static stream and one dynamic stream, which is the case for sparse neural network inference. Second, by leveraging the global computation structure of 2–D convolution, we propose a specialized computation reordering to ensure that the activation is only transferred if necessary and only once.Based on these two key ideas, we develop GoSPA, an energy-efficient high-performance Globally Optimized SParse CNN Accelerator. GoSPA is implemented with CMOS 28nm technology. Compared with the state-of-the-art sparse CNN architecture, GoSPA achieves average 1.38X, 1.28X, 1.23X, 1.17X, 1.21X and 1.28X speedup on AlexNet, VGG, GoogLeNet, MobileNet, ResNet and ResNeXt workloads, respectively. Also, GoSPA achieves 5.38X, 4.96X, 4.79X, 5.02X, 4.86X and 2.06X energy efficiency improvement on AlexNet, VGG, GoogLeNet, MobileNet, ResNet and ResNeXt, respectively. In more comprehensive comparison including DRAM access, GoSPA also shows significant performance improvement over the existing designs.
DOI: 10.1109/ISCA52012.2021.00090