ASPLOS 2025

Accelerating Number Theoretic Transform with Multi-GPU Systems for Efficient Zero Knowledge Proof

作者: Ji, Zhuoran and Zhao, Jianyu and Gao, Peimin and Yin, Xiangkai and Ju, Lei
关键词: multi-gpu systems, number theoretic transform, zero knowledge proof

Abstract

Zero-knowledge proofs validate statements without revealing any information, pivotal for applications such as verifiable outsourcing and digital currencies. However, their broad adoption is limited by the prolonged proof generation times, mainly due to two operations: Multi-Scalar Multiplication (MSM) and Number Theoretic Transform (NTT). While MSM has been efficiently accelerated using multi-GPU systems, NTT has not, due to the high inter-GPU communication overhead incurred by its permutation data access pattern.This paper identifies the necessity of multi-GPU NTT support for end-to-end proof generation. It introduces UniNTT, an NTT algorithm tailored for multi-GPU systems. The data access pattern of NTT incurs communication across all levels of the multi-GPU hierarchy (i.e., warp, thread block, GPU, and multi-GPU), complicating the implementation of multi-GPU NTT. To this end, UniNTT proposes a novel, overhead-free decomposition approach that recursively decomposes an NTT into smaller NTTs, enabling all hierarchy levels execute the same NTT computations at different scales. It promotes a uniform design of NTT optimizations based on an abstract hardware model, which are then tailored and applied to different levels of the hierarchy. UniNTT not only simplifies the optimization process but also shows that optimizations typically specific to one level can also be effectively generalized to others. Experiments show that UniNTT achieves an average 4.26\texttimes{

DOI: 10.1145/3669940.3707241

Accelerating Retrieval-Augmented Generation

作者: Quinn, Derrick and Nouri, Mohammad and Patel, Neel and Salihu, John and Salemi, Alireza and Lee, Sukhan and Zamani, Hamed and Alian, Mohammad
关键词: database acceleration, dense retrieval, retrieval-augmented generation (rag)

Abstract

An evolving solution to address hallucination and enhance accuracy in large language models (LLMs) is Retrieval-Augmented Generation (RAG), which involves augmenting LLMs with information retrieved from an external knowledge source, such as the web. This paper profiles several RAG execution pipelines and demystifies the complex interplay between their retrieval and generation phases. We demonstrate that while exact retrieval schemes are expensive, they can reduce inference time compared to approximate retrieval variants because an exact retrieval model can send a smaller but more accurate list of documents to the generative model while maintaining the same end-to-end accuracy. This observation motivates the acceleration of the exact nearest neighbor search for RAG.In this work, we design Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators. IKS offers 13.4–27.9\texttimes{

DOI: 10.1145/3669940.3707264

AnA： An Attentive Autonomous Driving System

作者: Choe, Wonkyo and Wang, Rongxiang and Lin, Felix Xiaozhu
关键词: autonomous driving systems, edge computing

Abstract

In an autonomous driving system (ADS), the perception module is crucial to driving safety and efficiency. Unfortunately, the perception in today’s ADS remains oblivious to driving decisions, contrasting to how humans drive. Our idea is to refactor ADS so that (1) the ADS guides its perception with the driving knowledge in situ; (2) the perception differentiates between awareness and attention. We propose a system called AnA with three novel mechanisms: (1) a query interface for the planning to express its interest in perception; (2) a query executor that maps queries to an optimal set of perception tasks; (3) a monitor for handling abnormal task executions with driving knowledge. On challenging driving benchmarks, AnA outperforms competitive baselines: it responds to adversarial events timely, reducing collisions by 2x; it reduces compute usage by 44% without compromising driving safety. We attribute AnA’s efficacy to its attentive driving, a human-like behavior that improves resource proportionality.

DOI: 10.1145/3669940.3707261

AnyKey： A Key-Value SSD for All Workload Types

作者: Park, Chanyoung and Lee, Jungho and Liu, Chun-Yi and Kang, Kyungtae and Kandemir, Mahmut Taylan and Choi, Wonil
关键词: key-value solid-state drive, log-structured merge-tree, storage management software, tail latency

Abstract

Key-value solid-state drives (KV-SSDs) are considered as a potential storage solution for large-scale key-value (KV) store applications. Unfortunately, the existing KV-SSD designs are tuned for a specific type of workload, namely, those in which the size of the values are much larger than the size of the keys. Interestingly, there also exists another type of workload, in practice, in which the sizes of keys are relatively large. We re-evaluate the current KV-SSD designs using such unexplored workloads and document their significantly-degraded performance. Observing that the performance problem stems from the increased size of the metadata, we subsequently propose a novel KV-SSD design, called AnyKey, which prevents the size of the metadata from increasing under varying sizes of keys. Our detailed evaluation using a wide range of real-life workloads indicates that AnyKey outperforms the state-of-the-art KV-SSD design under different types of workloads with varying sizes of keys and values.

DOI: 10.1145/3669940.3707279

ARC： Warp-level Adaptive Atomic Reduction in GPUs to Accelerate Differentiable Rendering

作者: Durvasula, Sankeerth and Zhao, Adrian and Chen, Fan and Liang, Ruofan and Sanjaya, Pawan Kumar and Guan, Yushi and Giannoula, Christina and Vijaykumar, Nandita
关键词: atomics, differentiable rendering, gaussian splatting, graphics processing unit, machine learning

Abstract

Differentiable rendering is widely used in emerging applications that represent any 3D scene as a model trained using gradient descent from 2D images. Recent works (e.g., 3D Gaussian Splatting) use rasterization to enable rendering photo-realistic imagery at high speeds from these learned 3D models. These rasterization-based differentiable rendering methods have been demonstrated to be very promising, providing state-of-art quality for various important tasks. However, training a model to represent a scene is still time-consuming even on powerful GPUs. In this work, we observe that the gradient computation step during model training is a significant bottleneck due to the large number of atomic operations. These atomics overwhelm the atomic units in the L2 cache of GPUs, causing long stalls.To address this, we leverage the observations that during gradient computation: (1) for most warps, all threads atomically update the same memory locations; and (2) warps generate varying amount of atomic traffic. We propose ARC, a primitive that accelerates atomic operations based on two key ideas: First, we enable warp-level reduction at the GPU cores using registers to leverage the locality in intra-warp atomic updates. Second, we distribute atomic computation between the cores and the L2 atomic units to increase the throughput of atomic computation. We propose two implementations of ARC: ARC-HW, a hardware-based approach and ARC-SW, a software-only approach. We demonstrate significant speedups with ARC of 2.6\texttimes{

DOI: 10.1145/3669940.3707238

Automatic Tracing in Task-Based Runtime Systems

作者: Yadav, Rohan and Bauer, Michael and Broman, David and Garland, Michael and Aiken, Alex and Kjolstad, Fredrik
关键词: dynamic analysis, runtime systems, tracing

Abstract

Implicitly parallel task-based runtime systems often perform dynamic analysis to discover dependencies in and extract parallelism from sequential programs. Dependence analysis becomes expensive as task granularity drops below a threshold. Tracing techniques have been developed where programmers annotate repeated program fragments (traces) issued by the application, and the runtime system memoizes the dependence analysis for those fragments, greatly reducing overhead when the fragments are executed again. However, manual trace annotation can be brittle and not easily applicable to complex programs built through the composition of independent components. We introduce Apophenia, a system that automatically traces the dependence analysis of task-based runtime systems, removing the burden of manual annotations from programmers and enabling new and complex programs to be traced. Apophenia identifies traces dynamically through a series of dynamic string analyses, which find repeated program fragments in the stream of tasks issued to the runtime system. We show that Apophenia is able to come between 0.92x–1.03x the performance of manually traced programs, and is able to effectively trace previously untraced programs to yield speedups of between 0.91x–2.82x on the Perlmutter and Eos supercomputers.

DOI: 10.1145/3669940.3707237

BatchZK： A Fully Pipelined GPU-Accelerated System for Batch Generation of Zero-Knowledge Proofs

作者: Lu, Tao and Chen, Yuxun and Wang, Zonghui and Wang, Xiaohang and Chen, Wenzhi and Zhang, Jiaheng
关键词: gpu acceleration, linear-time encoder, merkle tree, pipeline, sumcheck protocol, zero-knowledge proof

Abstract

Zero-knowledge proof (ZKP) is a cryptographic primitive that enables one party to prove the validity of a statement to other parties without disclosing any secret information. With its widespread adoption in applications such as blockchain and verifiable machine learning, the demand for generating zero-knowledge proofs has increased dramatically. In recent years, considerable efforts have been directed toward developing GPU-accelerated systems for proof generation. However, these previous systems only explored efficiently generating a single proof by reducing latency rather than batch generation to provide high throughput.We propose a fully pipelined GPU-accelerated system for batch generation of zero-knowledge proofs. Our system has three features to improve throughput. First, we design a pipelined approach that enables each GPU thread to continuously execute its designated proof generation task without being idle. Second, our system supports recent efficient ZKP protocols with their computational modules: sum-check protocol, Merkle tree, and linear-time encoder. We customize these modules to fit our pipelined execution. Third, we adopt a dynamic loading method for the data required for proof generation, reducing the required device memory. Moreover, multi-stream technology enables the overlap of data transfers and GPU computations, reducing overhead caused by data exchanges between host and device memory.We implement our system and evaluate it on various GPU cards. The results show that our system achieves more than 259.5\texttimes{

DOI: 10.1145/3669940.3707270

ByteFS： System Support for (CXL-based) Memory-Semantic Solid-State Drives

作者: Li, Shaobo and Zhou, Yirui (Eric) and Ren, Hao and Huang, Jian
关键词: byte-addressable ssd, cxl, file system, memory-semantic storage

Abstract

Unlike non-volatile memory that resides on the processor memory bus, memory-semantic solid-state drives (SSDs) support both byte and block access granularity via PCIe or CXL interconnects. They provide scalable memory capacity using NAND flash at a much lower cost. In addition, they have different performance characteristics for their dual byte/block interface respectively, while offering essential memory semantics for upper-level software. Such a byte-accessible storage device provides new implications on the software system design.In this paper, we develop a new file system, named ByteFS, by rethinking the design primitives of file systems and SSD firmware to exploit the advantages of both byte and block-granular data accesses. ByteFS supports byte-granular data persistence to retain the persistence nature of SSDs. It extends the core data structure of file systems by enabling dual byte/block-granular data accesses. To facilitate the support for byte-granular writes, ByteFS manages the internal DRAM of SSD firmware in a log-structured manner and enables data coalescing to reduce the unnecessary I/O traffic to flash chips. ByteFS also enables coordinated data caching between the host page cache and SSD cache for best utilizing the precious memory resource. We implement ByteFS on both a real programmable SSD and an emulated memory-semantic SSD for sensitivity study. Compared to state-of-the-art file systems for non-volatile memory and conventional SSDs, ByteFS outperforms them by up to 2.7\texttimes{

DOI: 10.1145/3669940.3707250

Cinnamon： A Framework for Scale-Out Encrypted AI

作者: Jayashankar, Siddharth and Chen, Edward and Tang, Tom and Zheng, Wenting and Skarlatos, Dimitrios
关键词: accelerators, encrypted ai, fully homomorphic encryption, parallelism

Abstract

Fully homomorphic encryption (FHE) is a promising cryptographic solution that enables computation on encrypted data, but its adoption remains a challenge due to steep performance overheads. Although recent FHE architectures have made valiant efforts to narrow the performance gap, they not only have massive monolithic chip designs but also only target small ML workloads. We present Cinnamon, a framework for accelerating state-of-the-art ML workloads that are encrypted using FHE. Cinnamon accelerates encrypted computing by exploiting parallelism at all levels of a program, using novel algorithms, compilers, and hardware techniques to create a scale-out design for FHE as opposed to a monolithic chip design. Our evaluation of the Cinnamon framework on small programs shows a 2.3\texttimes{

DOI: 10.1145/3669940.3707260

ClosureX： Compiler Support for Correct Persistent Fuzzing

作者: Ranjan, Rishi and Paterson, Ian and Hicks, Matthew
关键词: fuzzing, security and privacy, software testing, system security

Abstract

Fuzzing is a widely adopted and pragmatic methodology for bug hunting as a means of software hardening. Research reveals that increasing fuzzing throughput directly increases bug discovery rate. The highest performance fuzzing strategy is persistent fuzzing, which reuses a single process for all test cases by looping back to the start upon completion, instead of exiting. This eliminates all process creation, initialization, and tear-down costs—which are on-par with execution cost. Unfortunately, persistent fuzzing leads to semantically inconsistent program states because process state changes from one test case remain for subsequent test cases. This semantic inconsistency results in missed crashes, false crashes, and overall incorrectness that undermines fuzzer effectiveness.We observe that existing fuzzing execution mechanisms exist on a continuum, based on the amount of state that gets discarded and restored between test cases. We present ClosureX, a fuzzing execution mechanism that sits at a new spot on this state restoration continuum, where only test-case-execution-specific state is reset. This fine-grain state restoration provides near-persistent performance with the correctness of heavyweight state restoration. We construct ClosureX as a set of LLVM passes that integrate with AFL++. Our evaluation on ten popular open-source fuzzing targets show that ClosureX maintains semantic correctness, while increasing test case execution rate by over 3.5x, on average, compared to AFL++. ClosureX also finds bugs more consistently and 1.9x faster than AFL++, with ClosureX discovering 15 0-day bugs (4 CVEs).

DOI: 10.1145/3669940.3707281

Coach： Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms

作者: Reidys, Benjamin and Zardoshti, Pantea and Goiri, '{I
关键词: cloud computing, memory oversubscription, resource management, temporal patterns

Abstract

Cloud platforms remain underutilized despite multiple proposals to improve their utilization (e.g., disaggregation, harvesting, and oversubscription). Our characterization of the resource utilization of virtual machines (VMs) in Azure reveals that, while CPU is the main underutilized resource, we need to provide a solution to manage all resources holistically. We also observe that many VMs exhibit complementary temporal patterns, which can be leveraged to improve the oversubscription of underutilized resources.Based on these insights, we propose Coach: a system that exploits temporal patterns for all-resource oversubscription in cloud platforms. Coach uses long-term predictions and an efficient VM scheduling policy to exploit temporally complementary patterns. We introduce a new general-purpose VM type, called CoachVM, where we partition each resource allocation into a guaranteed and an oversubscribed portion. Coach monitors the oversubscribed resources to detect contention and mitigate any potential performance degradation. We focus on memory management, which is particularly challenging due to memory’s sensitivity to contention and the overhead required to reassign it between CoachVMs. Our experiments show that Coach enables platforms to host up to ~26% more VMs with minimal performance degradation.

DOI: 10.1145/3669940.3707226

Composing Distributed Computations Through Task and Kernel Fusion

作者: Yadav, Rohan and Sundram, Shiv and Lee, Wonchan and Garland, Michael and Bauer, Michael and Aiken, Alex and Kjolstad, Fredrik
关键词: composable software, distributed programming

Abstract

We introduce Diffuse, a system that dynamically performs task and kernel fusion in distributed, task-based runtime systems. The key component of Diffuse is an intermediate representation of distributed computation that enables the necessary analyses for the fusion of distributed tasks to be performed in a scalable manner. We pair task fusion with a JIT compiler to fuse together the kernels within fused tasks. We show empirically that Diffuse’s intermediate representation is general enough to be a target for two real-world, task-based libraries (cuPyNumeric and Legate Sparse), letting Diffuse find optimization opportunities across function and library boundaries. Diffuse accelerates unmodified applications developed by composing task-based libraries by 1.86x on average (geo-mean), and by between 0.93x–10.7x on up to 128 GPUs. Diffuse also finds optimization opportunities missed by the original application developers, enabling high-level Python programs to match or exceed the performance of an explicitly parallel MPI library.

DOI: 10.1145/3669940.3707216

Concerto： Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning

作者: Cheng, Shenggan and Lin, Shengjie and Diao, Lansong and Wu, Hao and Wang, Siyu and Si, Chang and Liu, Ziming and Zhao, Xuanlei and Du, Jiangsu and Lin, Wei and You, Yang
关键词: collective communication, distributed deep learning, fine-grained overlap, gpus

Abstract

With the exponential growth of deep learning (DL), there arises an escalating need for scalability. Despite significant advancements in communication hardware capabilities, the time consumed by communication remains a bottleneck during training. The existing various optimizations are coupled within parallel systems to implement specific computation-communication overlap. These approaches pose challenges in terms of performance, programmability, and generality. In this paper, we introduce Concerto, a compiler framework designed to address these challenges by automatically optimizing and scheduling communication. We formulate the scheduling problem as a resource-constrained project scheduling problem and use off-the-shelf solver to get the near-optimal scheduling. And use auto-decomposition to create overlap opportunity for critical (synchronous) communication. Our evaluation shows Concerto can match or outperform state-of-the-art parallel frameworks, including Megatron-LM, JAX/XLA, DeepSpeed, and Alpa, all of which include extensive hand-crafted optimization. Unlike previous works, Concerto decouples the parallel approach and communication optimization, then can generalize to a wide variety of parallelisms without manual optimization.

DOI: 10.1145/3669940.3707223

Cooperative Graceful Degradation in Containerized Clouds

作者: Agrawal, Kapil and Abdu Jyothi, Sangeetha
关键词: cloud resilience, graceful degradation, service level objectives (slos)

Abstract

Cloud resilience is crucial for cloud operators and the myriad of applications that rely on the cloud. Today, we lack a mechanism that enables cloud operators to perform graceful degradation of applications while satisfying the application’s availability requirements. In this paper, we put forward a vision for automated cloud resilience management with cooperative graceful degradation between applications and cloud operators. First, we investigate techniques for graceful degradation and identify an opportunity for cooperative graceful degradation in public clouds. Second, leveraging criticality tags on containers, we propose diagonal scaling—turning off non-critical containers during capacity crunch scenarios—to maximize the availability of critical services. Third, we design Phoenix, an automated cloud resilience management system that maximizes critical service availability of applications while also considering operator objectives, thereby improving the overall resilience of the infrastructure during failures. We experimentally show that the Phoenix controller running atop Kubernetes can improve critical service availability by up to 2\texttimes{

DOI: 10.1145/3669940.3707244

Copper and Wire： Bridging Expressiveness and Performance for Service Mesh Policies

作者: Saxena, Divyanshu and Zhang, William and Pailoor, Shankara and Dillig, Isil and Akella, Aditya
关键词: cloud computing, microservices, service mesh

Abstract

Distributed microservice applications require a convenient means of controlling L7 communication between services. Service meshes have emerged as a popular approach to achieving this. However, current service mesh frameworks are difficult to use – they burden developers in realizing even simple communication policies, lack compatibility with diverse dataplanes, and introduce performance and resource overheads. We identify the root causes of these drawbacks and propose a ground-up new mesh architecture that overcomes them. We develop novel abstractions for mesh communication, a new mesh policy language centered on these abstractions to enable expressive policies, and a novel control plane that enables using minimal dataplane resources for policy enforcement. We develop the precise semantics of our language abstractions and demonstrate how our control plane can use them to execute policies correctly and optimally. We build and evaluate a prototype on realistic workloads and policies and open-source production traces. Our results show that complex policies can be specified in up to 6.75\texttimes{

DOI: 10.1145/3669940.3707257

作者: Xu, Jiahui and Josipovic, Lana
关键词: dataflow circuits, high-level synthesis, resource sharing

Abstract

Dynamically scheduled high-level synthesis (HLS) automatically translates software code (e.g., C/C++) to dataflow circuits-networks of compute units that communicate via handshake signals. These signals schedule the circuit during runtime, allowing them to handle irregular control flow or unpredictable memory accesses efficiently, thus giving them performance merit over statically scheduled circuits produced by standard HLS tools.To make HLS of dataflow circuits attractive and practical, we need various resource-optimization strategies to complement their performance advantage. A crucial technique is resource sharing: scarce and expensive resources (e.g., floating-point arithmetic units) are shared between multiple operations. However, this approach faces unique challenges in dataflow circuits, as an uncareful sharing strategy leads to performance degradation and circuit deadlock.This work presents CRUSH, a strategy that enables efficient functional unit sharing in dynamically scheduled HLS. CRUSH systematically avoids sharing-introduced deadlocks: it decouples interactions of operations in the shared resource to break resource dependencies. CRUSH maintains the benefit of dynamism: it does not constrain circuit execution with a complex deadlock avoidance mechanism and seizes sharing opportunities enabled by out-of-order access to the shared unit. CRUSH is practical: it employs scalable and effective heuristics for sharing decisions. Compared to a prior strategy, CRUSH achieves an average reduction of 12% DSPs, 15% FFs, and 90% optimization runtime. CRUSH has been integrated into the Dynamatic HLS compiler (https://github.com/EPFL-LAP/dynamatic).

DOI: 10.1145/3669940.3707273

DarwinGame： Playing Tournaments for Tuning Applications in Noisy Cloud Environments

作者: Basu Roy, Rohan and Gadepally, Vijay and Tiwari, Devesh
关键词: performance interference, performance tuning, tuning in cloud

Abstract

This work introduces a new subarea of performance tuning – performance tuning in a shared interference-prone computing environment. We demonstrate that existing tuners are significantly suboptimal by design because of their inability to account for interference during tuning. Our solution, DarwinGame, employs a tournament-based design to systematically compare application executions with different tunable parameter configurations, enabling it to identify the relative performance of different tunable parameter configurations in a noisy environment. Compared to existing solutions, DarwinGame achieves more than 27% reduction in execution time, with less than 0.5% performance variability. DarwinGame is the first performance tuner that will help developers tune their applications in shared, interference-prone, cloud environments.

DOI: 10.1145/3669940.3707259

Debugger Toolchain Validation via Cross-Level Debugging

作者: Yang, Yibiao and Sun, Maolin and Wu, Jiangchang and Li, Qingyang and Zhou, Yuming
关键词: cross-level debugging, debugger, validation

Abstract

Ensuring the correctness of debugger toolchains is of paramount importance, as they play a vital role in understanding and resolving programming errors during software development. Bugs hidden within these toolchains can significantly mislead developers. Unfortunately, comprehensive testing of debugger toolchains is lacking due to the absence of effective test oracles. Existing studies on debugger toolchain validation have primarily focused on validating the debug information within optimized executables by comparing the traces between debugging optimized and unoptimized executables (i.e., different executables) in the debugger, under the assumption that the traces obtained from debugging unoptimized executables serve as a reliable oracle. However, these techniques suffer from inherent limitations, as compiler optimizations can drastically alter source code elements, variable representations, and instruction order, rendering the traces obtained from debugging different executables incomparable and failing to uncover bugs in debugger toolchains when debugging unoptimized executables. To address these limitations, we propose a novel concept called Cross-Level Debugging (CLD) for validating the debugger toolchain. CLD compares the traces obtained from debugging the same executable using source-level and instruction-level strategies within the same debugger. The core insight of CLD is that the execution traces obtained from different debugging levels for the same executable should adhere to specific relationships, regardless of whether the executable is generated with or without optimization. We formulate three key relations in CLD: reachability preservation of program locations, order preservation for reachable program locations, and value consistency at program locations, which apply to traces at different debugging levels. We implement Devil, a practical framework that employs these relations for debugger toolchain validation. We evaluate the effectiveness of Devil using two widely used production debugger toolchains, GDB and LLDB. Ultimately, Devil successfully identified 27 new bug reports, of which 18 have been confirmed and 12 have been fixed by developers.

DOI: 10.1145/3669940.3707271

Design and Operation of Shared Machine Learning Clusters on Campus

作者: Xu, Kaiqiang and Sun, Decang and Wang, Hao and Ren, Zhenghang and Wan, Xinchen and Liao, Xudong and Wang, Zilong and Zhang, Junxue and Chen, Kai
关键词: multi-tenant cluster operations, resource management, shared gpu cluster

Abstract

The rapid advancement of large machine learning (ML) models has driven universities worldwide to invest heavily in GPU clusters. Effectively sharing these resources among multiple users is essential for maximizing both utilization and accessibility. However, managing shared GPU clusters presents significant challenges, ranging from system configuration to fair resource allocation among users.This paper introduces SING, a full-stack solution tailored to simplify shared GPU cluster management. Aimed at addressing the pressing need for efficient resource sharing with limited staffing, SING enhances operational efficiency by reducing maintenance costs and optimizing resource utilization. We provide a comprehensive overview of its four extensible architectural layers, explore the features of each layer, and share insights from real-world deployment, including usage patterns and incident management strategies.As part of our commitment to advancing shared ML cluster management, we open-source SING’s resources to support the development and operation of similar systems.

DOI: 10.1145/3669940.3707266

Dilu： Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity

作者: Lv, Cunchi and Shi, Xiao and Lei, Zhengyu and Huang, Jinyue and Tan, Wenting and Zheng, Xiaohui and Zhao, Xiaofang
关键词: co-scaling, gpu resourcing-on-demand, introspective elasticity, serverless deep learning

Abstract

Serverless computing, with its ease of management, auto-scaling, and cost-effectiveness, is widely adopted by deep learning (DL) applications. DL workloads, especially with large language models, require substantial GPU resources to ensure QoS. However, it is prone to produce GPU fragments (e.g., 15%-94%) in serverless DL systems due to the dynamicity of workloads and coarse-grained static GPU allocation mechanisms, gradually eroding the profits offered by serverless elasticity. Different from classical serverless systems that only scale horizontally, we present introspective elasticity (IE), a fine-grained and adaptive two-dimensional co-scaling mechanism to support GPU resourcing-on-demand for serverless DL tasks. Based on this insight, we build Dilu, a cross-layer and GPU-based serverless DL system with IE support. First, Dilu provides multi-factor profiling for DL tasks with efficient pruning search methods. Second, Dilu adheres to the resourcing-complementary principles in scheduling to improve GPU utilization with QoS guarantees. Third, Dilu adopts an adaptive 2D co-scaling method to enhance the elasticity of GPU provisioning in real time. Evaluations show that it can dynamically adjust the resourcing of various DL functions with low GPU fragmentation (10%-46% GPU defragmentation), high throughput (up to 1.8\texttimes{

DOI: 10.1145/3669940.3707251

D-VSync： Decoupled Rendering and Displaying for Smartphone Graphics

作者: Wu, Yuanpei and Du, Dong and Xu, Chao and Xia, Yubin and Fu, Ming and Zang, Binyu and Chen, Haibo
关键词: graphics system, operating system, rendering service, vertical synchronization

Abstract

Rendering service, which typically orchestrates screen display and UI through Vertical Synchronization (VSync), is an indispensable system service for user experiences of smartphone OSes (e.g., Android, OpenHarmony, and iOS). The recent trend of large high-frame-rate screens, stunning visual effects, and physics-based animations has placed unprecedented pressure on the VSync-based rendering architecture, leading to higher frame drops and longer rendering latency.This paper proposes Decoupled Vertical Synchronization (D-VSync), which decouples execution and displaying in the rendering service. D-VSync allows frames to be rendered a number of VSync periods before being physically displayed on the screen. The key insight behind D-VSync to resolve the limitation of VSync is that, the decoupling enables sporadic long frames to utilize the computational power saved by common short frames, therefore providing a larger time window to tolerate workload fluctuations. Evaluation results of 75 common OS use cases and apps on OpenHarmony (Mate 40 Pro, Mate 60 Pro), 25 popular apps on Android (Google Pixel 5), and simulations of 15 mobile games show that compared to VSync, D-VSync on average reduces frame drops by 72.7%, user-perceptible stutters by 72.3%, and rendering latency by 31.1%, with only 0.13%-0.37% more power consumption. D-VSync has been integrated into HarmonyOS NEXT.

DOI: 10.1145/3669940.3707235

Early Termination for Hyperdimensional Computing Using Inferential Statistics

作者: Yi, Pu (Luke) and Yang, Yifan and Lee, Chae Young and Achour, Sara
关键词: edge machine learning, emerging hardware technologies, hyperdimensional computing, program optimization, unconventional computing

Abstract

Hyperdimensional Computing (HDC) is a brain-inspired, lightweight computing paradigm that has shown great potential for inference on the edge and on emerging hardware technologies, achieving state-of-the-art accuracy on certain classification tasks. HDC classifiers are inherently error resilient and support early termination of inference to approximate classification results. Practitioners have developed heuristic methods to terminate inference early for individual inputs, reducing the computation of inference at the cost of accuracy. These techniques lack statistical guarantees and may unacceptably degrade classification accuracy or terminate inference later than is needed to obtain an accuracy result.We present Omen, the first dynamic HDC optimizer that uses inferential statistics to terminate inference early while providing accuracy guarantees. To realize Omen, we develop a statistical view of HDC that reframes HD computations as statistical sampling and testing tasks, enabling the use of statistical tests. We evaluate Omen on 19 benchmark instantiations of four classification tasks. Omen is computationally efficient, delivering up to 7.21–12.18\texttimes{

DOI: 10.1145/3669940.3707254

Earth+： On-Board Satellite Imagery Compression Leveraging Historical Earth Observations

作者: Du, Kuntai and Cheng, Yihua and Olsen, Peder and Noghabi, Shadi and Jiang, Junchen
关键词: downlink optimization, earth observation satellites, reference image, satellite imagery compression

Abstract

Due to limited downlink (satellite-to-ground) capacity, over 90% of the images captured by the earth-observation satellites are not downloaded to the ground. To overcome the downlink limitation, we present Earth+, a new on-board satellite imagery compression system that identifies and downloads only changed areas in each image compared to latest on-board reference images of the same location. The key of Earth+ is that it obtains latest on-board reference images by letting the ground stations upload images recently captured by all satellites in the constellation. To our best knowledge, Earth+ is the first system that leverages images across an entire satellite constellation to enable more images to be downloaded to the ground (by better satellite imagery compression). Our evaluation shows that to download images of the same area, Earth+ can reduce the downlink usage by 3.3\texttimes{

DOI: 10.1145/3669940.3707222

EDM： An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation

作者: Su, Weigao and Shrivastav, Vishal
关键词: ethernet phy, in-network scheduler, memory disaggregation

Abstract

Achieving low remote memory access latency remains the primary challenge in realizing memory disaggregation over Ethernet within the datacenters. We present EDM that attempts to overcome this challenge using two key ideas. First, while existing network protocols for remote memory access over the Ethernet, such as TCP/IP and RDMA, are implemented on top of the Ethernet MAC layer, EDM takes a radical approach by implementing the entire network protocol stack for remote memory access within the Physical layer (PHY) of the Ethernet. This overcomes fundamental latency and bandwidth overheads imposed by the MAC layer, especially for small memory messages. Second, EDM implements a centralized, fast, in-network scheduler for memory traffic within the PHY of the Ethernet switch. Inspired by the classic Parallel Iterative Matching (PIM) algorithm, the scheduler dynamically reserves bandwidth between compute and memory nodes by creating virtual circuits in the PHY, thus eliminating queuing delay and layer 2 packet processing delay at the switch for memory traffic, while maintaining high bandwidth utilization. Our FPGA testbed demonstrates that EDM’s network fabric incurs a latency of only ~300 ns for remote memory access in an unloaded network, which is an order of magnitude lower than state-of-the-art Ethernet-based solutions such as RoCEv2 and comparable to emerging PCIe-based solutions such as CXL. Larger-scale network simulations indicate that even at high network loads, EDM’s average latency remains within 1.3x its unloaded latency.

DOI: 10.1145/3669940.3707221

Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs

作者: Azami, Noushin and Fallin, Alex and Burtscher, Martin
关键词: cpu and gpu parallelization, data compression, floating-point data, lossless compression

Abstract

The amount of scientific data being produced, transferred, and processed increases rapidly. Whereas GPUs have made faster processing possible, storage limitations and slow data transfers remain key bottlenecks. Data compression can help, but only if it does not create a new bottleneck. This paper presents four new lossless compression algorithms for single- and double-precision data that compress well and are fast even though they are fully compatible between CPUs and GPUs. Averaged over many SDRBench inputs, our implementations outperform most of the 18 compressors from the literature we compare to in compression ratio, compression throughput, and decompression throughput. Moreover, they outperform all of them in either throughput or compression ratio on the two CPUs and two GPUs we used for evaluation. For example, on an RTX 4090 GPU, our fastest code compresses and decompresses at over 500 GB/s while delivering one of the highest compression ratios.

DOI: 10.1145/3669940.3707280

Enhancing CGRA Efficiency Through Aligned Compute and Communication Provisioning

作者: Li, Zhaoying and Dangi, Pranav and Yin, Chenyang and Bandara, Thilini Kaushalya and Juneja, Rohan and Tan, Cheng and Bai, Zhenyu and Mitra, Tulika
关键词: coarse-grained reconfigurable array (cgra), dataflow computing, motifs

Abstract

Coarse-grained Reconfigurable Arrays (CGRAs) are domain-agnostic accelerators that enhance the energy efficiency of resource-constrained edge devices. The CGRA landscape is diverse, exhibiting trade-offs between performance, efficiency, and architectural specialization. However, CGRAs often overprovision communication resources relative to their modest computing capabilities. This occurs because the theoretically provisioned programmability for CGRAs often proves superfluous in practical implementations.In this paper, we propose Plaid, a novel CGRA architecture and compiler that aligns compute and communication capabilities, thereby significantly improving energy and area efficiency while preserving its generality and performance. We demonstrate that the dataflow graph, representing the target application, can be decomposed into smaller, recurring communication patterns called motifs. The primary contribution is the identification of these structural motifs within the dataflow graphs and the development of an efficient collective execution and routing strategy tailored to these motifs. The Plaid architecture employs a novel collective processing unit that can execute multiple operations of a motif and route related data dependencies together. The Plaid compiler can hierarchically map the dataflow graph and judiciously schedule the motifs. Our design achieves a 43% reduction in power consumption and 46% area savings compared to the baseline high-performance spatio-temporal CGRA, all while preserving its generality and performance levels. In comparison to the baseline energy-efficient spatial CGRA, Plaid offers a 1.4\texttimes{

DOI: 10.1145/3669940.3707230

Exo 2： Growing a Scheduling Language

作者: Ikarashi, Yuka and Qian, Kevin and Droubi, Samir and Reinking, Alex and Bernstein, Gilbert Louis and Ragan-Kelley, Jonathan
关键词: high-performance computing, meta-programming, performance engineering, user-schedulable languages

Abstract

User-schedulable languages (USLs) help programmers productively optimize programs by providing safe means of transforming them. Current USLs are designed to give programmers exactly the control they want, while automating all other concerns. However, there is no universal answer for what performance-conscious programmers want to control, how they want to control it, and what they want to automate, even in relatively narrow domains. We claim that USLs should, instead, be designed to grow. We present Exo 2, a scheduling language that enables users to define new scheduling operations externally to the compiler. By composing a set of trusted, fine-grained primitives, users can safely write their own scheduling library to build up desired automation. We identify actions (ways of modifying code), inspection (ways of interrogating code), and references (ways of pointing to code) as essential for any user-extensible USL. We fuse these ideas into a new mechanism called Cursors that enables the creation of scheduling libraries in user code. We demonstrate libraries that amortize scheduling effort across more than 80 high-performance kernels, reducing total scheduling code by an order of magnitude and delivering performance competitive with state-of-the-art implementations on three different platforms.

DOI: 10.1145/3669940.3707218

Fast On-device LLM Inference with NPUs

作者: Xu, Daliang and Zhang, Hao and Yang, Liming and Liu, Ruiqi and Huang, Gang and Xu, Mengwei and Liu, Xuanzhe
关键词: large language model, mobile computing, npu

Abstract

On-device inference for Large Language Models (LLMs), driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably high inference latency, often bottlenecked by the prefill stage in tasks like screen UI understanding.We present llm.npu, the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading to reduce prefill latency. llm.npu enhances NPU offloading efficiency by re-constructing the prompt and model in three levels: (1) At prompt level, it divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, it identifies and extracts significant outliers to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, it schedules Transformer blocks in an out-of-order manner to the CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. Compared to competitive baselines, llm.npu achieves 22.4x faster prefill speed and 30.7x energy savings on average, and up to 32.8x speedup in an end-to-end real-world application. For the first time, llm.npu achieves more than 1,000 tokens/sec prefilling for a billion-sized model.

DOI: 10.1145/3669940.3707239

Faster Chaitin-like Register Allocation via Grammatical Decompositions of Control-Flow Graphs

作者: Cai, Xuran and Goharshady, Amir Kafshdar and Hitarth, S. and Lam, Chun Kit
关键词: control-flow graphs, graph decompositions, register allocation, sparsity

Abstract

It is well-known that control-flow graphs (CFGs) of structured programs are sparse. This sparsity has been previously formalized in terms of graph parameters such as treewidth and pathwidth and used to design faster parameterized algorithms for numerous compiler optimization, model checking and program analysis tasks.In this work, we observe that the known graph sparsity parameters fail to exactly capture the kind of sparsity exhibited by CFGs. For example, while all structured CFGs have a treewidth of at most 7, not every graph with a treewidth of 7 or less is realizable as a CFG. As a result, current parameterized algorithms are solving the underlying graph problems over a more general family of graphs than the CFGs.To address this problem, we design a new but natural concept of graph decomposition based on a grammar that precisely captures the set of graphs that can be realized as CFGs of programs. We show that our notion of decomposition enables the same type of dynamic programming algorithms that are often used in treewidth/pathwidth-based methods. As two concrete applications, using our grammatical decomposition of CFGs, we provide asymptotically more efficient algorithms for two variants of the classical problem of register allocation as defined by Chaitin, i.e. assigning program variables to a limited number of registers such that variables with intersecting lifetimes are not assigned to the same register. Note that Chaitin’s formulation of register allocation does not allow live-range splitting. Our algorithms are asymptotically faster not only in comparison with the non-parameterized solutions for these problems, but also compared to the state-of-the-art treewidth/pathwidth-based approaches in the literature. For minimum-cost register allocation over a fixed number of registers, we provide an algorithm with a runtime of O(|G| ⋅ |𝕈| 5 ⋅ r) where |G| is the size of the program, 𝕈 is the set of program variables and r is the number of registers. In contrast, the previous treewidth-based algorithm had a runtime of O(|G| ⋅ |𝕈| 16 ⋅ r). For the decision problem of spill-free register allocation, our algorithm’s runtime is O(|G| ⋅ r5 ⋅ r + 5) whereas the previous works had a runtime of O(|G| ⋅ r16 ⋅ r).Finally, we provide extensive experimental results on spill-free register allocation, showcasing the scalability of our approach in comparison to previous state-of-the-art methods. Most notably, our approach can handle real-world instances with up to 20 registers, whereas previous works could only scale to 8. This is a significant improvement since most ubiquitous architectures, such as the x86 family, have 16 registers. For such architectures, our approach is the first-ever exact algorithm that scales up to solve the real-world instances of spill-free register allocation.

DOI: 10.1145/3669940.3707286

FleetIO： Managing Multi-Tenant Cloud Storage with Multi-Agent Reinforcement Learning

作者: Sun, Jinghan and Reidys, Benjamin and Li, Daixuan and Chang, Jichuan and Snir, Marc and Huang, Jian
关键词: cloud storage, reinforcement learning, storage virtualization

Abstract

Cloud platforms have been virtualizing storage devices like flash-based solid-state drives (SSDs) to make effective use of storage resources. They enable either software-isolated instance or hardware-isolated instance for facilitating the storage sharing between multi-tenant applications. However, for decades, they have to combat the fundamental tussle between the performance isolation and resource utilization. They suffer from either long tail latency caused by weak isolation or low storage utilization caused by strong isolation.In this paper, we present FleetIO, a learning-based storage virtualization framework that employs reinforcement learning (RL) for managing virtualized SSDs. FleetIO explores the unique features of RL to handle the dynamic changes of application workloads and storage states, and integrates the storage scheduling into the RL decision-making process. It achieves both performance isolation and improved storage utilization by enabling dynamic fine-grained storage harvesting across collocated application instances, while minimizing its negative impact on their service-level objectives (SLOs). FleetIO clusters workloads into different types (e.g., latency-sensitive and bandwidth-intensive) based on the collected I/O traces at runtime, and fine-tunes the RL reward functions for each type of workloads. We implement FleetIO on a real programmable SSD board and evaluate it with diverse cloud applications. We show that FleetIO improves the overall storage utilization of the shared SSD by up to 1.4\texttimes{

DOI: 10.1145/3669940.3707229

Forecasting GPU Performance for Deep Learning Training and Inference

作者: Lee, Seonho and Phanishayee, Amar and Mahajan, Divya
关键词: deep learning, gpu performance forecasting, ml for systems, training and inference

Abstract

Deep learning kernels exhibit a high level of predictable memory accesses and compute patterns, making GPU’s architecture well-suited for their execution. Moreover, software and runtime system for GPUs further enable optimizations that aim to better utilize the stream multiprocessors, on-chip bandwidth, multiple levels of cache hierarchy, and off-chip high-bandwidth memory. In the context of deep learning, the entire space of models and GPUs is constantly evolving, as newer models emerge with simultaneous upgrades to the device. However, access to newer GPUs is often limited, raising important questions about the performance of new model architectures on existing GPUs, existing models on new GPUs, and new model architectures on new GPUs. To address these questions, we introduce NeuSight, a forecasting framework to predict the performance of a diverse range of deep learning models, for both training and inference, on unseen GPUs, without requiring actual execution of the target model on the target GPU. The framework leverages both GPU hardware behavior and software library optimizations to estimate the end-to-end performance of these models. We observe that prior work in this area suffers from high absolute error percentages when forecasting performance on unseen models and new GPUs, as they attempt to model the complex task of predicting the latency of a deep learning kernel on a GPU directly using a machine learning approach. Instead, with NeuSight, we decompose the prediction into smaller problems, while bounding the prediction through fundamental performance laws. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. Tile-granularity predictions are determined using a machine learning approach and aggregated to estimate the end-to-end latency. As such, NeuSight outperforms prior work across a variety of deep learning workloads and the most up-to-date GPUs. It reduces the percentage error from 121.4% and 30.8% to 2.3% in predicting the latency of GPT3 model for training and inference on H100, in comparison to state-of-the-art prior work, respectively, where GPT3 and H100 were not used to train any framework.

DOI: 10.1145/3669940.3707265

Frugal： Efficient and Economic Embedding Model Training with Commodity GPUs

作者: Xie, Minhui and Zeng, Shaoxun and Guo, Hao and Gao, Shiwei and Lu, Youyou
关键词: deep learning model training, embedding models, heterogeneous computing, machine learning system

Abstract

Embedding models show superiority in learning representations of massive ID-type features in sparse learning scenarios such as recommendation systems (e.g., user/item IDs) and graph learning (e.g., node/edge IDs). Commodity GPUs are highly favored for their cost-efficient computing power, which is ideally suited for the low computing demand of memory-intensive embedding models. However, directly running embedding model training on commodity GPUs yields poor performance because of their deficient communication resources (including low communication bandwidth and no PCIe P2P support).This paper presents Frugal, an embedding model training system tailored for commodity GPUs. Based on the observation that the communication between commodity GPUs must be bounced on host memory (due to no PCIe P2P support), the key idea of Frugal is proactively flushing, where each GPU proactively flushes its own parameters that other GPUs will access into host memory, thereby decoupling half of the communication overhead to non-critical paths. To alleviate the communication contention of proactively flushing on foreground training processes, Frugal assigns priorities to each flush operation, and prioritizes flushing parameters that GPUs will access while deferring others. Further, Frugal tailors a two-level priority queue to ensure high scalability for operations involving priorities. Frugal has been applied to train embedding models including recommendation models and graph embedding. Experiments indicate that Frugal can significantly increase training throughput on commodity GPUs, and achieve similar throughput compared to existing systems on datacenter GPUs with 4.0-4.3\texttimes{

DOI: 10.1145/3669940.3707245

FSMoE： A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

作者: Pan, Xinglin and Lin, Wenxiang and Zhang, Lin and Shi, Shaohuai and Tang, Zhenheng and Wang, Rui and Li, Bo and Chu, Xiaowen
关键词: distributed deep learning, large language model, mixture-of-experts, scheduling, training system

Abstract

Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable ver- satile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters. Experimental results show that 1) our FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations (with up to a 1.42\texttimes{

DOI: 10.1145/3669940.3707272

Fusion： An Analytics Object Store Optimized for Query Pushdown

作者: Lu, Jianan and Raina, Ashwini and Cidon, Asaf and Freedman, Michael J.
关键词: data analytics, distributed storage, erasure codes

Abstract

The prevalence of disaggregated storage in public clouds has led to increased latency in modern OLAP cloud databases, particularly when handling ad-hoc and highly-selective queries on large objects. To address this, cloud databases have adopted computation pushdown, executing query predicates closer to the storage layer. However, existing pushdown solutions are inefficient in erasure-coded storage. Cloud storage employs erasure coding that partitions analytics file objects into fixed-sized blocks and distributes them across storage nodes. Consequently, when a specific part of the object is queried, the storage system must reassemble the object across nodes, incurring significant network latency.In this work, we present Fusion, an object store for analytics that is optimized for query pushdown on erasure-coded data. It co-designs its erasure coding and file placement topologies, taking into account popular analytics file formats (e.g., Parquet). Fusion employs a novel stripe construction algorithm that prevents fragmentation of computable units within an object, and minimizes storage overhead during erasure coding. Compared to existing erasure-coded stores, Fusion improves median and tail latency by 64% and 81%, respectively, on TPC-H, and up to 40% and 48% respectively, on real-world SQL queries. Fusion achieves this while incurring a modest 1.2% storage overhead compared to the optimal.

DOI: 10.1145/3669940.3707234

GraphPipe： Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

作者: Jeon, Byungsoo and Wu, Mengdi and Cao, Shiyi and Kim, Sunghyun and Park, Sunghyun and Aggarwal, Neeraj and Unger, Colin and Arfeen, Daiyaan and Liao, Peiyuan and Miao, Xupeng and Alizadeh, Mohammad and Ganger, Gregory R. and Chen, Tianqi and Jia, Zhihao
关键词: deep neural network, distributed systems, parallelism, training

Abstract

Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device (e.g. GPU). Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into multiple stages, which concurrently perform DNN computation for different micro-batches of training samples in a pipeline fashion. However, existing pipeline-parallel approaches only consider sequential pipeline stages and thus ignore the topology of a DNN, resulting in missed model-parallel opportunities.This paper presents graph pipeline parallelism (GPP), a new pipeline-parallel scheme that partitions a DNN into pipeline stages whose dependencies are identified by a directed acyclic graph. GPP generalizes existing sequential pipeline parallelism and preserves the inherent topology of a DNN to enable concurrent execution of computationally-independent operators, resulting in reduced memory requirement and improved GPU performance. In addition, we develop GraphPipe, a distributed system that exploits GPP strategies to enable performant and scalable DNN training. GraphPipe partitions a DNN into a graph of stages, optimizes micro-batch schedules for these stages, and parallelizes DNN training using the discovered GPP strategies. Evaluation on a variety of DNNs shows that GraphPipe outperforms existing pipeline-parallel systems such as PipeDream and Piper by up to 1.6\texttimes{

DOI: 10.1145/3669940.3707220

HALO： Loop-aware Bootstrapping Management for Fully Homomorphic Encryption

作者: Cheon, Seonyoung and Lee, Yongwoo and Youm, Hoyun and Kim, Dongkwan and Yun, Sungwoo and Jeong, Kunmo and Lee, Dongyoon and Kim, Hanjun
关键词: bootstrapping, ckks, compiler, fully homomorphic encryption, loop optimization, privacy-preserve machine learning

Abstract

Thanks to the computation ability on encrypted data, fully homomorphic encryption (FHE) is an attractive solution for privacy-preserving computation. Despite its advantages, FHE suffers from limited applicability in small programs because repeated FHE multiplications deplete the level of a ciphertext, which is finite. Bootstrapping reinitializes the level, thus allowing support for larger programs. However, its high computational overhead and the risk of level underflow require sophisticated bootstrapping placement, thereby increasing the programming burden. Although a recently proposed compiler automatizes the bootstrapping placement, its applicability is still limited due to lack of loop support.This work proposes the first loop-aware bootstrapping management compiler, called HALO, which optimizes bootstrapping placement in an FHE program with a loop. To correctly support bootstrapping-enabled loops, HALO matches encryption types and levels between live-in and loop-carried ciphertexts in the loops. To reduce the bootstrapping overheads, HALO decreases the number of bootstrapping within a loop body by packing the loop-carried variables to a single ciphertext, reduces wasted levels in a short loop body by unrolling the loop, and optimizes the bootstrapping latency by adjusting the target level of bootstrapping as needed. For seven machine learning programs with flat and nested loops, HALO shows 27% performance speedup compared to the state-of-the-art compiler that places bootstrapping operations on fully unrolled loops. In addition, HALO reduces the compilation time and code size by geometric means of 209.12x and 11.0x compared to the compiler, respectively.

DOI: 10.1145/3669940.3707275

Helix： Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow

作者: Mei, Yixuan and Zhuang, Yonghao and Miao, Xupeng and Yang, Juncheng and Jia, Zhihao and Vinayak, Rashmi
关键词: cloud computing, distributed systems, large language model serving, system for ml

Abstract

This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving in heterogeneous GPU clusters. The key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem on directed, weighted graphs, whose nodes represent GPU instances and edges capture both GPU and network heterogeneity through their capacities. Helix then uses a mixed integer linear programming (MILP) algorithm to discover highly optimized strategies to serve LLMs on heterogeneous GPUs. This approach allows Helix to jointly optimize model placement and request scheduling, two highly entangled tasks in heterogeneous LLM serving. Our evaluation on several heterogeneous clusters ranging from 24 to 42 GPU nodes shows that Helix improves serving throughput by up to 3.3x and reduces prompting and decoding latency by up to 66% and 24%, respectively, compared to existing approaches. Helix is available at https://github.com/Thesys-lab/Helix-ASPLOS25.

DOI: 10.1145/3669940.3707215

H-Houdini： Scalable Invariant Learning

作者: Dinesh, Sushant and Zhu, Yongye and Fletcher, Christopher W.
关键词: abductive reasoning, constant-time programming, formal verification, hardware attacks and defenses, hardware verification, incremental invariant learning, invariant learning, microarchitectural security, scalable verification

Abstract

Formal verification is a critical task in hardware design today. Yet, while there has been significant progress in improving technique automation and efficiency, scaling to large hardware designs remains a significant challenge.We address this challenge by proposing H-HOUDINI: a new algorithm for (mostly) push-button inductive invariant learning that scales to large hardware designs. H-HOUDINI combines the strengths of Machine Learning Inspired Synthesis (MLIS) and SAT-based Incremental Learning. The key advance is a method that replaces the monolithic SMT-style checks made by MLIS with a carefully-constructed hierarchy of smaller, incremental SMT checks that can be parallelized, memoized and reassembled into the original ‘monolithic’ invariant in a correct-by-construction fashion.We instantiate H-HOUDINI as VeloCT, a framework that proves hardware security properties by learning relational invariants. We benchmark VeloCT on the ‘safe instruction set synthesis’ problem in microarchitectural security. Here, VeloCT automatically (with no expert annotations) learns an invariant for the RISC-V Rocketchip in under 10s (2880x faster than state of the art). Further, VeloCT is the first work to scale to the RISC-V out-of-order BOOM and can (mostly-automatically) verify all BOOM variants (ranging from Small to Mega) in between 6.95 minutes to 199.1 minutes.

DOI: 10.1145/3669940.3707263

Instruction-Aware Cooperative TLB and Cache Replacement Policies

作者: Chasapis, Dimitrios and Vavouliotis, Georgios and Jim'{e
关键词: address translation, replacement policy, tlb management, translation lookaside buffer, virtual memory

Abstract

Modern server and data center applications are characterized not only by big datasets, but also by large instruction footprints that incur frequent cache and Translation Lookaside Buffer (TLB) misses due to instruction accesses. Instruction TLB misses are particularly problematic since they cause pipeline stalls that significantly harm performance.This paper proposes cooperative last-level TLB (STLB) and L2 cache (L2C) replacement policies targeting workloads with large instruction footprints. We propose the Instruction Translation Prioritization (iTP), an STLB replacement policy that maximizes the number of instruction hits in the STLB at the expense of increasing data page walks. To compensate for the increase of data page walks, we propose the extended Page Table Prioritization (xPTP), a new L2C replacement policy that amplifies the benefits of iTP by effectively reducing L2C misses due to data page walks. Our proposal, iTP+xPTP, combines iTP at STLB and xPTP at L2C. In addition, iTP+xPTP employs an adaptive mechanism that switches between xPTP and LRU policies at L2C based on the pressure placed on the virtual memory subsystem. Our proposal improves single-core geometric mean performance by 18.9% over a baseline that uses the LRU replacement policy at both STLB and L2C across a set of contemporary server workloads. Under SMT co-location, the corresponding performance uplift is 11.4%. Finally, we show that our proposal outperforms the state-of-the-art STLB and cache replacement policies.

DOI: 10.1145/3669940.3707247

Marionette： A RowHammer Attack via Row Coupling

作者: Baek, Seungmin and Wi, Minbok and Park, Seonyong and Nam, Hwayong and Kim, Michael Jaemin and Kim, Nam Sung and Ahn, Jung Ho
关键词: coupled row, dram, reliability, rowhammer, rowpress, security

Abstract

A body of recent work has revealed that two different rows in a DRAM bank, from the perspective of a processor-memory interface, are connected to the same wordline but two separate row buffers (bitline sense amplifiers) in certain DRAM chips. Such a pair of rows is referred to as a ‘‘coupled-row pair.’’ Coupled-row pairs pose a substantial security threat as RowHammer bitflips can be caused not only by the conventional, adjacent aggressor rows but also by their coupled rows that are distant in physical addressWe investigate the impact of a coupled row on both FPGA-based infrastructure and server systems. In RowHammer attacks, coupled rows have hammering strength nearly identical to aggressor rows, with these attacks invisible to conventional, processor-side mitigation solutions. By exploiting these observations, we present Marionette, a new type of RowHammer attack that exploits coupled rows to extend the existing RowHammer attack surface.First, coupled rows enable an attacker to evade two types of existing software-based RowHammer defenses: tracking- and isolation-based defenses. We induce RowHammer bitflips successfully against tracking-based RowHammer defenses by silently hammering coupled rows. We also identify the feasibility of RowHammer bitflips in an isolation-based inter-VM RowHammer defense by breaking DRAM-subarray-level isolation. Second, we successfully conduct an existing RowHammer exploit in a server under the tracking-based RowHammer defense. In a native server system, Marionette enhances the success rate of the RowHammer exploit by up to 1.66x. Lastly, we explore lightweight mitigation schemes for Marionette by exposing the coupled-row relationship to systems.

DOI: 10.1145/3669940.3707242

Medusa： Accelerating Serverless LLM Inference with Materialization

作者: Zeng, Shaoxun and Xie, Minhui and Gao, Shiwei and Chen, Youmin and Lu, Youyou
关键词: llm, machine learning system, serverless computing

Abstract

Serverless is a promising paradigm to provide scalable, cost-efficient, and easy-to-use model inference services. However, the cold start of model inference functions requires loading models to the devices, which incurs high latencies and undermines the benefits of serverless computing. In LLMs, things get even worse since two extra stages are introduced: a KV cache initialization stage that profiles and anticipates memory reservation for KV cache, and a capturing stage which dynamically constructs CUDA graphs for different batch sizes. Both stages are paramount to the inference performance, but become the main culprit of cold start latency.This paper proposes Medusa to mitigate the long cold start latency through state materialization. Instead of dynamic profiling and construction in the runtime, Medusa materializes the CUDA graphs as well as the information needed by the KV cache initialization in the offline phase, and restores them efficiently in the online phase. Medusa further introduces two novel techniques – offline-online cooperated parameters restoration and triggering-kernels enhanced kernel address restoration – to tackle non-deterministic issues in CUDA graphs. Medusa successfully materializes and restores CUDA graphs across 10 models (with a total of 139364 CUDA graph nodes), and reduces the latency of model loading by 42.5%. Under real-world LLM inference workloads, Medusa reduces the tail latency of the time to first token (TTFT) by 53.0%.

DOI: 10.1145/3669940.3707285

MetaSapiens： Real-Time Neural Rendering with Efficiency-Aware Pruning and Accelerated Foveated Rendering

作者: Lin, Weikai and Feng, Yu and Zhu, Yuhao
关键词: foveated rendering, gaussian splatting, hardware accelerator

Abstract

Point-Based Neural Rendering (PBNR) is emerging as a promising class of rendering techniques, which are permeating all aspects of society, driven by a growing demand for real-time, photorealistic rendering in AR/VR and digital twins. Achieving real-time PBNR on mobile devices is challenging.This paper proposes MetaSapiens, a PBNR system that for the first time delivers real-time neural rendering on mobile devices while maintaining human visual quality. MetaSapiens combines three techniques. First, we present an efficiency-aware pruning technique to optimize rendering speed. Second, we introduce a Foveated Rendering (FR) method for PBNR, leveraging humans’ low visual acuity in peripheral regions to relax rendering quality and improve rendering speed. Finally, we propose an accelerator design for FR, addressing the load imbalance issue in (FR-based) PBNR. Our evaluation shows that our system achieves an order of magnitude speedup over existing PBNR models without sacrificing subjective visual quality, as confirmed by a user study. The code and demo are available at: https://horizon-lab.org/metasapiens/.

DOI: 10.1145/3669940.3707227

Mint： Cost-Efficient Tracing with All Requests Collection via Commonality and Variability Analysis

作者: Huang, Haiyu and Chen, Cheng and Chen, Kunyi and Chen, Pengfei and Yu, Guangba and He, Zilong and Wang, Yilun and Zhang, Huxing and Zhou, Qi
关键词: cheng chen, guangba yu, haiyu huang, huxing zhang, kunyi chen, pengfei chen, qi zhou., yilun wang, zilong he

Abstract

Distributed traces contain valuable information but are often massive in volume, posing a core challenge in tracing framework design: balancing the tradeoff between preserving essential trace information and reducing trace volume. To address this tradeoff, previous approaches typically used a ‘1 or 0’ sampling strategy: retaining sampled traces while completely discarding unsampled ones. However, based on an empirical study on real-world production traces, we discover that the ‘1 or 0’ strategy actually fails to effectively balance this tradeoff.To achieve a more balanced outcome, we shift the strategy from the ‘1 or 0’ paradigm to the ‘commonality + variability’ paradigm. The core of ‘commonality + variability’ paradigm is to first parse traces into common patterns and variable parameters, then aggregate the patterns and filter the parameters. We propose a cost-efficient tracing framework, Mint, which implements the ‘commonality + variability’ paradigm on the agent side to enable all requests capturing. Our experiments show that Mint can capture all traces and retain more trace information while optimizing trace storage (reduced to an average of 2.7%) and network overhead (reduced to an average of 4.2%). Moreover, experiments also demonstrate that Mint is lightweight enough for production use.

DOI: 10.1145/3669940.3707287

MOAT： Securely Mitigating Rowhammer with Per-Row Activation Counters

作者: Qureshi, Moinuddin and Qazi, Salman
关键词: abo, dos, dram, prac, rowhammer, security

Abstract

Rowhammer has worsened over the last decade. Existing in-DRAM solutions, such as TRR, were broken with simple patterns. In response, the DDR5 specifications have been extended to support Per-Row Activation Counting (PRAC), with counters inlined with each row, and ALERT-Back-Off (ABO) to stop the memory controller if the DRAM needs more time to mitigate. Although PRAC+ABO represents a strong advance in Rowhammer protection, they are just a framework, and the actual security is dependent on the implementation.In this paper, we first show that a prior work, Panopticon (which formed the basis for PRAC+ABO), is insecure, as our Jailbreak pattern can cause 1150 activations on an attack row for Panopticon configured for a threshold of 128. We then propose MOAT, a provably secure design, which uses two internal thresholds: ETH, an Eligibility Threshold for mitigating a row, and ATH, an ALERT Threshold for initiating an ABO. As JEDEC specifications permit a few activations between consecutive ALERTs, we also study how an attacker can exploit such activations to inflict more activations than ATH on an attack row and thus increase the tolerated Rowhammer threshold. Our analysis shows that MOAT configured with ATH=64 can safely tolerate a Rowhammer threshold of 99. Finally, we also study performance attacks and denial-of-service due to ALERTs. Our evaluations, with SPEC and GAP workloads, show that MOAT with ATH=64 incurs an average slowdown of 0.27% and 7 bytes of SRAM per bank.

DOI: 10.1145/3669940.3707278

MoE-Lightning： High-Throughput MoE Inference on Memory-constrained GPUs

作者: Cao, Shiyi and Liu, Shu and Griggs, Tyler and Schafhalter, Peter and Liu, Xiaoxuan and Sheng, Ying and Gonzalez, Joseph E. and Zaharia, Matei and Stoica, Ion
关键词: batch inference, cpu offloading, moe

Abstract

Efficient deployment of large language models, particularly Mixture of Experts (MoE) models, on resource-constrained platforms presents significant challenges in terms of computational efficiency and memory utilization. The MoE architecture, renowned for its ability to increase model capacity without a proportional increase in inference cost, greatly reduces the token generation latency compared with dense models. However, the large model size makes MoE models inaccessible to individuals without high-end GPUs. In this paper, we propose a high-throughput MoE batch inference system, MoE-Lightning, that significantly outperforms past work. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization, and a performance model, HRM, based on a Hierarchical Roofline Model we introduce to help find policies with higher throughput than existing systems. MoE-Lightning can achieve up to (10.3x) higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB). When the theoretical system throughput is bounded by the GPU memory, MoE-Lightning can reach the throughput upper bound with 2-3x less CPU memory, significantly increasing resource utilization. MoE-Lightning also supports efficient batch inference for much larger MoEs (e.g., Mixtral 8x22B and DBRX) on multiple low-cost GPUs (e.g., 2–4 T4s).

DOI: 10.1145/3669940.3707267

MVQ： Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization

作者: Li, Shuaiting and Wang, Chengxuan and Deng, Juncan and Wang, Zeyu and Ye, Zewen and Wang, Zongsheng and Shen, Haibin and Huang, Kejie
关键词: neural networks, pruning, systolic arrays, vector quantization

Abstract

Vector quantization(VQ) is a hardware-friendly DNN compression method that can reduce the storage cost and weight-loading datawidth of hardware accelerators. However, conventional VQ techniques lead to significant accuracy loss because the important weights are not well preserved. To tackle this problem, a novel approach called MVQ is proposed, which aims at better approximating important weights with a limited number of codewords. At the algorithm level, our approach removes the less important weights through N:M pruning and then minimizes the vector clustering error between the remaining weights and codewords by the masked k-means algorithm. Only distances between the unpruned weights and the codewords are computed, which are then used to update the codewords. At the architecture level, our accelerator implements vector quantization on an EWS (Enhanced weight stationary) CNN accelerator and proposes a sparse systolic array design to maximize the benefits brought by masked vector quantization.Our algorithm is validated on various models for image classification, object detection, and segmentation tasks. Experimental results demonstrate that MVQ not only outperforms conventional vector quantization methods at comparable compression ratios but also reduces FLOPs. Under ASIC evaluation, our MVQ accelerator boosts energy efficiency by 2.3\texttimes{

DOI: 10.1145/3669940.3707268

Nazar： Monitoring and Adapting ML Models on Mobile Devices

作者: Hao, Wei and Wang, Zixi and Hong, Lauren and Li, Lingxiao and Karayanni, Nader and Dasbach-Prisk, AnMei and Mao, Chengzhi and Yang, Junfeng and Cidon, Asaf
关键词: drift adaptation, machine learning system design, mobile devices, root cause analysis

Abstract

ML models are increasingly run locally on mobile devices for low-latency inference and offline operation. However, it is hard for ML operators to track on-device model accuracy, which can degrade unpredictably (e.g. due to local data drift). We design Nazar, the first end-to-end system for continuously monitoring and adapting models on mobile devices without requiring feedback from users. Our key observation is that accuracy degradation is often due to a specific root cause, which may affect a large group of devices. Once Nazar detects a degradation affecting a large number of devices, it automatically pinpoints the root causes and adapts the model specifically to them. Evaluation on two computer vision datasets shows that Nazar consistently boosts accuracy compared to existing approaches by up to 19.4%.

DOI: 10.1145/3669940.3707246

Optimizing Datalog for the GPU

作者: Sun, Yihao and Shovon, Ahmedur Rahman and Gilray, Thomas and Kumar, Sidharth and Micinski, Kristopher
关键词: analytic databases, datalog, gpu

Abstract

Modern Datalog engines (e.g., LogicBlox, Souffl'{e

DOI: 10.1145/3669940.3707274

Optimizing Quantum Circuits, Fast and Slow

作者: Xu, Amanda and Molavi, Abtin and Tannu, Swamit and Albarghouthi, Aws
关键词: quantum computing, quantum-circuit optimization, superoptimization, unitary synthesis

Abstract

Optimizing quantum circuits is critical: the number of quantum operations needs to be minimized for a successful evaluation of a circuit on a quantum processor. In this paper we unify two disparate ideas for optimizing quantum circuits, rewrite rules, which are fast standard optimizer passes, and unitary synthesis, which is slow, requiring a search through the space of circuits. We present a clean, unifying framework for thinking of rewriting and resynthesis as abstract circuit transformations. We then present a radically simple algorithm, guoq, for optimizing quantum circuits that exploits the synergies of rewriting and resynthesis. Our extensive evaluation demonstrates the ability of guoq to strongly outperform existing optimizers on a wide range of benchmarks.

DOI: 10.1145/3669940.3707240

PartIR： Composing SPMD Partitioning Strategies for Machine Learning

作者: Alabed, Sami and Belov, Daniel and Chrzaszcz, Bart and Franco, Juliana and Grewe, Dominik and Maclaurin, Dougal and Molloy, James and Natan, Tom and Norman, Tamara and Pan, Xiaoyue and Paszke, Adam and Rink, Norman A. and Schaarschmidt, Michael and Sitdikov, Timur and Swietlik, Agnieszka and Vytiniotis, Dimitrios and Wee, Joel
关键词: distributed systems, language/compiler support for machine learning, machine learning model partitioning, parallelism, scalability, spmd, systems for machine learning, systems for tensor computing

Abstract

Training modern large neural networks (NNs) requires a combination of parallelization strategies, including data, model, or optimizer sharding. To address the growing complexity of these strategies, we introduce PartIR, a hardware-and-runtime agnostic NN partitioning system. PartIR is: 1) Expressive: It allows for the composition of multiple sharding strategies, whether user-defined or automatically derived; 2) Decoupled: the strategies are separate from the ML implementation; and 3) Predictable: It follows a set of well-defined general rules to partition the NN. PartIR utilizes a schedule-like API that incrementally rewrites the ML program intermediate representation (IR) after each strategy, allowing simulators and users to verify the strategy’s performance. PartIR has been successfully used both for training large models and across diverse model architectures, demonstrating its predictability, expressiveness, and performance.

DOI: 10.1145/3669940.3707284

PCcheck： Persistent Concurrent Checkpointing for ML

作者: Strati, Foteini and Friedman, Michal and Klimovic, Ana
关键词: computing methodologies, fault tolerant training

Abstract

Training large-scale machine learning (ML) models is expensive and time-intensive, consuming many hardware accelerators for days or weeks. As the scale of hardware deployments and training time continue to grow, the probability of failures also increases. The desire to use cheaper cloud resources, such as spot VMs, to lower costs also dramatically increases the frequency of failures. The standard approach to deal with failures is to periodically pause training and checkpoint model parameters to persistent storage. Unfortunately, today’s checkpointing mechanisms introduce high overhead when applied at high frequencies, yet frequent checkpointing is necessary to avoid long recovery times.We present a concurrent checkpointing mechanism, PCcheck, that allows frequent checkpointing with minimal overhead. Our framework supports persisting checkpoints to SSD and persistent main memory (PMEM) for both single-machine and distributed settings. PCcheck enables checkpointing as frequently as every 10 iterations for detailed monitoring and fast recovery times in case of failures, while maintaining minimal (3%) overhead on training throughput.

DOI: 10.1145/3669940.3707255

Performance Prediction of On-NIC Network Functions with Multi-Resource Contention and Traffic Awareness

作者: Wu, Shaofeng and Su, Qiang and Niu, Zhixiong and Xu, Hong
关键词: network function, performance prediction, resource contention, smartnic

Abstract

Network function (NF) offloading on SmartNICs has been widely used in modern data centers, offering benefits in host resource saving and programmability. Co-running NFs on the same SmartNICs can cause performance interference due to contention of onboard resources. To meet performance SLAs while ensuring efficient resource management, operators need mechanisms to predict NF performance under such contention. However, existing solutions lack SmartNIC-specific knowledge and exhibit limited traffic awareness, leading to poor accuracy for on-NIC NFs.This paper proposes Yala, a novel performance predictive system for on-NIC NFs. Yala builds upon the key observation that co-located NFs contend for multiple resources, including onboard accelerators and the memory subsystem. It also facilitates traffic awareness according to the behaviors of individual resources to maintain accuracy as the external traffic attributes vary. Evaluation using Bluefield SmartNICs shows that Yala improves the prediction accuracy by 78.8% and reduces SLA violations by 92.2% compared to state-of-the-art approaches, and enables new practical usecases.

DOI: 10.1145/3669940.3707232

PipeLLM： Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption

作者: Tan, Yifan and Tan, Cheng and Mi, Zeyu and Chen, Haibo
关键词: confidential virtual machine, large language model, nvidia confidential computing

Abstract

Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8% and 88.2% throughput drop when serving OPT-30B and OPT-66B, respectively. To address this challenge, we introduce PipeLLM, a user-transparent runtime system. PipeLLM removes the overhead by overlapping the encryption and GPU computation through pipelining-an idea inspired by the CPU instruction pipelining-thereby effectively concealing the latency increase caused by encryption. The primary technical challenge is that, unlike CPUs, the encryption module lacks prior knowledge of the specific data needing encryption until it is requested by the GPUs. To this end, we propose speculative pipelined encryption to predict the data requiring encryption by analyzing the serving patterns of LLMs. Further, we have developed an efficient, low-cost pipeline relinquishing approach for instances of incorrect predictions. Our experiments show that compared with vanilla systems without confidential computing (e.g., vLLM, PEFT, and FlexGen), PipeLLM incurs modest overhead (<19.6% in throughput) across various LLM sizes, from 13B to 175B. PipeLLM’s source code is available at https://github.com/SJTU-IPADS/PipeLLM.

DOI: 10.1145/3669940.3707224

pulse： Accelerating Distributed Pointer-Traversals on Disaggregated Memory

作者: Tang, Yupeng and Lee, Seung-seob and Bhattacharjee, Abhishek and Khandelwal, Anurag
关键词: disaggregated memory, fpgas, near-memory processing, pointer-traversals, programmable networks

Abstract

Caches at CPU nodes in disaggregated memory architectures amortize the high data access latency over the network. However, such caches are fundamentally unable to improve performance for workloads requiring pointer traversals across linked data structures. We argue for accelerating these pointer traversals closer to disaggregated memory in a manner that preserves expressiveness for supporting various linked structures, ensures energy efficiency and performance, and supports distributed execution. We design pulse, a distributed pointer-traversal framework for rack-scale disaggregated memory to meet all the above requirements. Our evaluation of pulse shows that it enables low-latency, high-throughput, and energy-efficient execution for a wide range of pointer traversal workloads on disaggregated memory that fare poorly with caching alone.

DOI: 10.1145/3669940.3707253

QECC-Synth： A Layout Synthesizer for Quantum Error Correction Codes on Sparse Architectures

作者: Yin, Keyi and Zhang, Hezi and Fang, Xiang and Shi, Yunong and Humble, Travis S. and Li, Ang and Ding, Yufei
关键词: qec circuit synthesis, quantum computing, quantum error correction

Abstract

Quantum Error Correction (QEC) codes are essential for achieving fault-tolerant quantum computing (FTQC). However, their implementation faces significant challenges due to disparity between required dense qubit connectivity and sparse hardware architectures. Current approaches often either underutilize QEC circuit features or focus on manual designs tailored to specific codes and architectures, limiting their capability and generality. In response, we introduce QECC-Synth, an automated compiler for QEC code implementation that addresses these challenges. We leverage the ancilla bridge technique tailored to the requirements of QEC circuits and introduces a systematic classification of its design space flexibilities. We then formalize this problem using the MaxSAT framework to optimize these flexibilities. Evaluation shows that our method significantly outperforms existing methods while demonstrating broader applicability across diverse QEC codes and hardware architectures.

DOI: 10.1145/3669940.3707236

RANGE-BLOCKS： A Synchronization Facility for Domain-Specific Architectures

作者: Kumar, Anagha Molakalmur Anil and Prasanna, Aditya and Shriraman, Arrvindh
关键词: domain-specific architectures, hierarchical data-structures, locks, synchronization

Abstract

Current domain-specific architectures (DSAs) work predominantly with static data structures and find it challenging to insert or remove data (they only support in-place updates). However, as DSAs target real-world applications, it is neces- sary to support mutable and dynamically resizable data structures. DSAs cannot support dynamic data structures since they lack a synchronization facility. DSAs are forced to either use address-based atomics or batch updates on the host. Unfortunately, both approaches introduce prohibitive performance penalties and require large caches for the locks. Range-blocks (RBlox) develops a hardware synchronization facility for DSAs to support dynamic data structures. Our idea is to use key ranges to capture synchronization boundaries and tap into the inherent parallelism of the data-structure layout. We make two novel observations that enable a practical hardware implementation: i) Range locks are symbolic and can compactly represent mutexes on multiple nested objects. Thus, any operation requires fewer range locks, and a small on-chip table suffices (2kb) compared to large caches (256kb) for address-based locks [79, 81]. ii) Ranges also explicitly represent the region of interest, and we can instantly achieve mutual exclusion (instead of relying on ordering). On a 128-tile dataflow DSA, we improve performance by 15\texttimes{

DOI: 10.1145/3669940.3707225

RASSM： Residue-based Acceleration of Single Sparse Matrix Computation via Adaptive Tiling

作者: Jain, Anirudh and Gupta, Pulkit and Conte, Thomas M.
关键词: auto-tiling, caching, multicore, sddmm, sparse computations, sparse matrix, sparse signatures, spmm

Abstract

Single-Sparse-Matrix Kernels (SSMKs) such as SpMM, SDDMM, SpMV, and SpTS form the backbone of applications such as data analytics, graph processing, finite-element analysis, machine learning (including GNNs and LLMs), etc. This paper introduces Residue-based Acceleration of Single Sparse Matrix Computation via Adaptive Tiling (RASSM), an input-dependent, adaptive 2-dimensional tiling technique for SSMKs. The adaptation leverages the concept of a residue matrix: a data structure that compactly captures the pattern of non-zeros in the sparse matrix. With residues, we show it is possible to make intelligent decisions on adaptive tile sizes, resulting in increased cache occupancy. Residues allow for optimizations across both spatial and temporal locality.RASSM improves data movement and overall performance as compared to prior techniques. For example, using spatial analysis for SpMM on commodity server CPUs, RASSM has 1.30X speedup over MKL, 1.32X over J-Stream, 1.20X over ASpT, 1.11X over CSF-4 uniform-shape, and 1.10X over CSF-4 uniform-occupancy. RASSM with temporal analysis improves this to 1.36X (vs. MKL), 1.38X (vs. J-Stream), 1.26X (vs. ASpT), 1.17X (vs. CSF-4 uniform-shape), and 1.16X (vs. CSF-4 uniform-occupancy).

DOI: 10.1145/3669940.3707219

ReSBM： Region-based Scale and Minimal-Level Bootstrapping Management for FHE via Min-Cut

作者: Liu, Yan and Lai, Jianxin and Li, Long and Sui, Tianxiang and Xiao, Linjie and Yuan, Peng and Zhang, Xiaojing and Zhu, Qing and Chen, Wenguang and Xue, Jingling
关键词: bootstrapping, ckks, fhe, machine learning, rns-ckks, scale management

Abstract

The RNS-CKKS scheme in Fully Homomorphic Encryption (FHE) supports crucial features for privacy-preserving machine learning, such as fixed-point arithmetic and SIMD-style vectorization. Yet, managing the escalation of ciphertext scales from homomorphic multiplications, which risks capacity overflow, along with bootstrapping, presents significant challenges. These complexities are exacerbated by the need to efficiently handle scale and bootstrapping at compile time while ensuring rapid encrypted inference.In this paper, we present ReSBM, a novel compiler technique that simultaneously optimizes scale and bootstrapping for encrypted inference under RNS-CKKS. By partitioning a program’s data flow graph (DFG) into regions with a uniform multiplicative depth of one, RESBM ensures that placements of Scale Management Operations (SMOs) and bootstraps affect only the latency of a region, not the scales and levels of its live-out ciphertexts. Our region-based approach tackles the NP-hard challenge of optimal bootstrapping placement with hierarchical strategies: (1) optimal intra-region SMO and bootstrapping placement using min-cut, (2) bootstrapping-guided rescaling region identification across a sequence of regions, culminating in tentative bootstrapping at two terminal regions, and (3) minimal-level bootstrap placement across the DFG, elevating ciphertexts only to the necessary minimal level. Validation across a variety of complex models on CPUs shows that ReSBM not only compiles these models more rapidly than a leading method but also boosts encrypted inference efficiency by an average of 12.1% when compared to another leading method. Consequently, ReSBM substantially improves the practical deployment of large models for encrypted inference, surpassing existing methods in terms of both compilation speed and inference performance.

DOI: 10.1145/3669940.3707276

Rethinking Java Performance Analysis

作者: Blackburn, Stephen M. and Cai, Zixian and Chen, Rui and Yang, Xi and Zhang, John and Zigman, John
关键词: garbage collection, java, performance analysis

Abstract

Representative workloads and principled methodologies are the foundation of performance analysis, which in turn provides the empirical grounding for much of the innovation in systems research. However, benchmarks are hard to maintain, methodologies are hard to develop, and our field moves fast. The tension between our fast-moving fields and their need to maintain their methodological foundations is a serious challenge. This paper explores that challenge through the lens of Java performance analysis. Lessons we draw extend to other languages and other fields of computer science.In this paper we: i) introduce a complete overhaul of the DaCapo benchmark suite, [6] characterizing 22 new and/or refreshed workloads across 47 dimensions, using principal components analysis to demonstrate their diversity, ii) demonstrate new methodologies and how they are integrated into an easy to use framework, iii) use this framework to conduct an analysis of the state of the art in production Java performance, and iv) motivate the need to invest in renewed methodologies and workloads, using as an example a review of contemporary production garbage collector performance.We highlight the danger of allowing methodologies to lag innovation and respond with a suite and new methodologies that nudge forward some of our field’s methodological foundations. We offer guidance on maintaining the empirical rigor we need to encourage profitable research directions and quickly identify unprofitable ones.

DOI: 10.1145/3669940.3707217

Robustness Verification for Checking Crash Consistency of Non-volatile Memory

作者: Han, Zhilei and He, Fei
关键词: crash consistency, non-volatile memory, persistent memory, program verification, robustness

Abstract

The emerging non-volatile memory (NVM) technologies provide competitive performance with DRAM and ensure data persistence in the event of system failure. However, it exhibits weak behaviour in terms of the order in which stores are committed to NVMs, and therefore requires extra efforts from developers to flush pending writes. To ensure correctness of this error-prone task, it is crucial to develop a rigid method to check crash consistency of programs running on NVM devices. Most existing solutions are testing-based and rely on user guidance to dynamically detect such deficiencies. In this paper, we present a fully automated method to verify robustness, a newly established property for ensuring crash consistency of such programs. The method is based on the observation that, reachability of a post-crash non-volatile state under a given pre-crash execution can be reduced to validity of the pre-crash execution with additional ordering constraints. Our robustness verification algorithm employs a search-based framework to explore all partial executions and states, and checks if any non-volatile state is reachable under certain pre-crash execution. Once a reachable non-volatile state is obtained, we further check its reachability under memory consistency model. The algorithm is implemented in a prototype tool PMVerify that leverages symbolic encoding of the program and utilizes an SMT solver to efficiently explore all executions and states. The method is integrated into the DPLL(T) framework to optimize the robustness checking algorithm. Experiments on the PMDK example benchmark show that PMVerify is competitive with the state-of-the-art dynamic tool, PSan, in terms of robustness violation detection.

DOI: 10.1145/3669940.3707269

RTL Verification for Secure Speculation Using Contract Shadow Logic

作者: Tan, Qinhan and Yang, Yuheng and Bourgeat, Thomas and Malik, Sharad and Yan, Mengjia
关键词: formal verification, hardware-software contract, shadow logic, speculative execution attacks

Abstract

Modern out-of-order processors face speculative execution attacks. Despite various proposed software and hardware mitigations to prevent such attacks, new attacks keep arising from unknown vulnerabilities. Thus, a formal and rigorous evaluation of the ability of hardware designs to deal with speculative execution attacks is urgently desired.This paper proposes a formal verification technique called Contract Shadow Logic that can considerably improve RTL verification scalability with little manual effort while being applicable to different defense mechanisms. In this technique, we leverage computer architecture design insights to improve verification performance for checking security properties formulated as software-hardware contracts for secure speculation. Our verification scheme is accessible to computer architects and requires minimal formal-method expertise.We evaluate our technique on multiple RTL designs, including three out-of-order processors. The experimental results demonstrate that our technique exhibits a significant advantage in finding attacks on insecure designs and deriving complete proofs on secure designs, when compared to the baseline and two state-of-the-art verification schemes, LEAVE and UPEC.

DOI: 10.1145/3669940.3707243

Segue & ColorGuard： Optimizing SFI Performance and Scalability on Modern Architectures

作者: Narayan, Shravan and Garfinkel, Tal and Johnson, Evan and Yedidia, Zachary and Wang, Yingchen and Brown, Andrew and Vahldiek-Oberwagner, Anjo and LeMay, Michael and Huang, Wenyong and Wang, Xin and Sun, Mingqiu and Tullsen, Dean and Stefan, Deian
关键词: optimization, sandboxing, sfi, wasm

Abstract

Software-based fault isolation (SFI) enables in-process isolation through compiler instrumentation of memory accesses, and is a critical part of WebAssembly (Wasm). We present two optimizations that improve SFI performance and scalability: Segue uses x86-64 segmentation to reduce the cost of instrumentation on memory accesses, e.g., it eliminates 44.7% of Wasm’s overhead on a Wasm-compatible subset of SPEC CPU 2006, and reduces overhead of Wasm-sandboxed font rendering in Firefox by 75%; ColorGuard leverages memory tagging (e.g., MPK), to enable up to a 15\texttimes{

DOI: 10.1145/3669940.3707249

Selectively Uniform Concurrency Testing

作者: Zhao, Huan and Wolff, Dylan and Mathur, Umang and Roychoudhury, Abhik
关键词: concurrency bugs detection, concurrency testing, probabilistic sampling, random testing

Abstract

Buggy behaviors in concurrent programs are notoriously elusive, as they may manifest only in few of exponentially many possible thread interleavings. Randomized concurrency testing techniques probabilistically sample from (instead of enumerating) the vast search space and have been shown to be both an effective as well as a scalable class of algorithms for automated discovery of concurrency bugs. In this work we focus on the key desirable characteristic of black-box randomized concurrency testing algorithms — uniformity of exploration. Unfortunately, prior randomized algorithms acutely fall short on uniformity and, as a result, struggle to expose bugs that only manifest in few, infrequent interleavings. Towards this, we show that, indeed, a sampling strategy for uniformly sampling over the interleaving space, is eminently achievable with minimal additional information for broad classes of programs. Moreover, when applied to a carefully selected subset of program events, this interleaving-uniformity strategy allows for an effective exploration of program behaviors. We present an online randomized concurrency testing algorithm named Selectively Uniform Random Walk (SURW) that builds on these insights. SURW is the first of its class to achieve interleaving-uniformity for a wide class of programs, or an arbitrary subset of events thereof. This property translates to effective behavioral exploration should a subset with desirable characteristics be selected. Extensive evaluation on leading concurrency benchmarks suggests SURW is able to expose more bugs and significantly faster than comparable randomized algorithms. In addition, we show that SURW is able to explore both the space of interleavings and behaviors more uniformly on real-world programs.

DOI: 10.1145/3669940.3707214

SmoothE： Differentiable E-Graph Extraction

作者: Cai, Yaohui and Yang, Kaixin and Deng, Chenhui and Yu, Cunxi and Zhang, Zhiru
关键词: compilers, equivalence graph, machine learning for systems, programming languages

Abstract

E-graphs have gained increasing popularity in compiler optimization, program synthesis, and theorem proving tasks. They enable compact representation of many equivalent expressions and facilitate transformations via rewrite rules without phase ordering limitations. A major benefit of using e-graphs is the ability to explore a large space of equivalent expressions, allowing the extraction of an expression that best meets certain optimization objectives (or cost models). However, current e-graph extraction methods often face unfavorable scalability-quality trade-offs and only support simple linear cost functions, limiting their applicability to more realistic optimization problems.In this work, we propose SmoothE, a differentiable e-graph extraction algorithm designed to handle complex cost models and optimized for GPU acceleration. More specifically, we approach the e-graph extraction problem from a probabilistic perspective, where the original discrete optimization is relaxed to a continuous differentiable form. This formulation supports any differentiable cost functions and enables efficient searching for solutions using gradient descent. We implement SmoothE in PyTorch to leverage the advancements of the modern machine learning ecosystem. Additionally, we introduce performance optimization techniques to exploit sparsity and data parallelism. We evaluate SmoothE on a variety of realistic e-graphs from five different applications using three distinct cost models, including both linear and non-linear ones. Our experiments demonstrate that SmoothE consistently achieves a favorable trade-off between scalability and solution quality.

DOI: 10.1145/3669940.3707262

SuperNoVA： Algorithm-Hardware Co-Design for Resource-Aware SLAM

作者: Kim, Seah and Hsiao, Roger and Nikolic, Borivoje and Demmel, James and Shao, Yakun Sophia
关键词: accelerator, algorithm-hardware co-design, ar/vr, resource management, robotics, slam

Abstract

Simultaneous Localization and Mapping (SLAM) plays a crucial role in robotics, autonomous systems, and augmented and virtual reality (AR/VR) applications by enabling devices to understand and map unknown environments. However, deploying SLAM in AR/VR applications poses significant challenges, including the demand for high accuracy, real-time processing, and efficient resource utilization, especially on compact and lightweight devices. To address these challenges, we propose SuperNoVA, which enables high-accuracy, real-time, large-scale SLAM in resource-constrained settings through a full-stack system, spanning from algorithm to hardware. In particular, SuperNoVA dynamically constructs a subgraph to meet the latency target while preserving accuracy, virtualizes hardware resources for efficient graph processing, and implements a novel hardware architecture to accelerate the SLAM backend efficiently. Evaluation results demonstrate that, for a large-scale AR dataset, SuperNoVA reduces full SLAM backend computation latency by 89.5% compared to the baseline out-of-order CPU and 78.6% compared to the baseline embedded GPU, and reduces the maximum pose error by 89% over existing SLAM solutions, while always meeting the latency target.

DOI: 10.1145/3669940.3707258

Tally： Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads

作者: Zhao, Wei and Jayarajan, Anand and Pekhimenko, Gennady
关键词: cloud infrastructure, deep learning, gpu sharing, performance isolation, systems for machine learning

Abstract

GPU underutilization is a significant concern in many production deep learning clusters, leading to prolonged job queues and increased operational expenses. A promising solution to this inefficiency is GPU sharing, which improves resource utilization by allowing multiple workloads to execute concurrently on a single GPU. However, deploying GPU sharing in production settings faces critical obstacles due to the limitations of existing mechanisms, including high integration costs, inadequate performance isolation, and limited application compatibility. To address these issues, we introduce Tally, a non-intrusive GPU sharing mechanism that provides robust performance isolation and comprehensive workload compatibility. The key to Tally’s robust performance isolation capability lies in its fine-grained thread-block-level GPU kernel scheduling strategy, which allows the system to effectively mitigate interference caused by workload co-execution. We evaluate Tally on a diverse range of workloads and show that it incurs an average overhead of only 7.2% on the 99th-percentile latency of high-priority inference tasks when executed concurrently with best-effort training workloads, compared to 188.9% overhead exhibited by the state-of-the-art GPU sharing systems like TGS, while achieving over 80% of TGS’s system throughput.

DOI: 10.1145/3669940.3707282

Target-Aware Implementation of Real Expressions

作者: Saiki, Brett and Brough, Jackson and Regehr, Jonas and Ponce, Jesus and Pradeep, Varun and Akhileshwaran, Aditya and Tatlock, Zachary and Panchekha, Pavel
关键词: domain-specific compilation, equality saturation, floating point, optimization

Abstract

New low-precision accelerators, vector instruction sets, and library functions make maximizing accuracy and performance of numerical code increasingly challenging. Two lines of work—traditional compilers and numerical compilers—attack this problem from opposite directions. Traditional compiler backends optimize for specific target environments but are limited in their ability to balance performance and accuracy. Numerical compilers trade off accuracy and performance, or even improve both, but ignore the target environment. We join aspects of both to produce Chassis, a target-aware numerical compiler.Chassis compiles mathematical expressions to operators from a target description, which lists the real expressions each operator approximates and estimates its cost and accuracy. Chassis then uses an iterative improvement loop to optimize for speed and accuracy. Specifically, a new instruction selection modulo equivalence algorithm efficiently searches for faster target-specific programs, while a new cost-opportunity heuristic supports iterative improvement. We demonstrate Chassis capabilities on 9 different targets, including hardware ISAs, math libraries, and programming languages. Chassis finds better accuracy and performance trade-offs than both Clang (by 3.5\texttimes{

DOI: 10.1145/3669940.3707277

Tela： A Temporal Load-Aware Cloud Virtual Disk Placement Scheme

作者: Tan, Difan and Li, Jiawei and Wang, Hua and Li, Xiaoxiao and Liu, Wenbo and Qin, Zijin and Zhou, Ke and Xie, Ming and Tao, Mengling
关键词: cloud block storage, load balancing, resource management

Abstract

Cloud Block Storage (CBS) relies on Cloud Virtual Disks (CVDs) to provide block interfaces to Cloud Virtual Machines. The process of allocating user-subscribed CVDs to physical storage warehouses in cloud data centers, known as CVD placement, significantly impacts resource utilization, load balancing, and I/O performance. However, previous works have failed to account for temporal fluctuations in cloud loads, resulting in imbalanced loads, low resource utilization, and frequent warehouse overloads.To address these issues, we propose Tela, the first temporal load-aware CVD placement scheme. Using a series of interpretable models, Tela predicts the temporal load characteristics and values of CVDs, as well as potential peak loads in warehouses. Guided by these predictions, TELA places CVDs into warehouses according to their load patterns, aiming for peak shaving and load balancing while preventing overloads. Experimental results show that Tela significantly outperforms the state-of-the-art scheme, reducing overload occurrences by 86.8-93.8%, reducing P99 overload duration by 92.6%, and decreasing load imbalance by 36.7-44.4%.

DOI: 10.1145/3669940.3707252

UniZK： Accelerating Zero-Knowledge Proof with Unified Hardware and Flexible Kernel Mapping

作者: Wang, Cheng and Gao, Mingyu
关键词: domain-specific acceleration, mapping, zero-knowledge proof

Abstract

Zero-knowledge proof (ZKP) is an important cryptographic tool that sees wide applications in real-world scenarios where privacy must be protected, including privacy-preserving blockchains and zero-knowledge machine learning. Existing ZKP acceleration approaches using GPUs, FPGAs, and ASICs focus only on classic protocols that rely on expensive elliptic curve arithmetics. Emerging ZKP protocols based on hash functions can greatly reduce the algorithmic complexity, but they also introduce much more diverse computation kernels that cannot be efficiently handled by a single accelerator chip if dedicated units for each kernel are used. Our approach is to leverage a unified hardware architecture that is able to efficiently support the common primitives in ZKP, and then use smart mapping strategies to flexibly map various kernels to such hardware while ensuring high resource utilization. We design UniZK as such a ZKP accelerator, with a systolic-array-based hardware architecture enhanced with extra local links and a new vector processing mode. We propose novel mapping strategies to support diverse kernels including number theoretic transforms, hash functions, and general polynomial computations. UniZK provides 97x and 46x speedups on average compared to the CPU and GPU implementations of the same protocols, and is also 840x faster than previous ZKP accelerators using different protocols.

DOI: 10.1145/3669940.3707228

Using Analytical Performance/Power Model and Fine-Grained DVFS to Enhance AI Accelerator Energy Efficiency

作者: Wang, Zibo and Zhang, Yijia and Wei, Fuchun and Wang, Bingqiang and Liu, Yanlin and Hu, Zhiheng and Zhang, Jingyi and Xu, Xiaoxin and He, Jian and Wang, Xiaoliang and Dou, Wanchun and Chen, Guihai and Tian, Chen
关键词: ai accelerator, fine-grained dvfs, genetic algorithm, performance model, power model

Abstract

Recent advancements in deep learning have significantly increased AI processors’ energy consumption, which is becoming a critical factor limiting AI development. Dynamic Voltage and Frequency Scaling (DVFS) stands as a key method in power optimization. However, due to the latency of DVFS control in AI processors, previous works typically apply DVFS control at the granularity of a program’s entire duration or sub-phases, rather than at the level of AI operators.The advent of millisecond-level DVFS capabilities on the latest Ascend NPU platforms enables us to set frequency individually for single or multiple operators, opening up the opportunity for further enhancing energy efficiency through fine-grained DVFS control. To ensure performance is unaffected in DVFS, our work builds performance and power models for each operator. Through in-depth timeline analysis, we demonstrate that the cycle count of an operator can be modeled as a convex piecewise linear function of frequency, resulting in a performance model with an average error of 1.96%. Moreover, we build power models that incorporate temperature-dependent terms, which enhances the model’s precision and results in an average error of 4.62%.Based on our performance and power models as well as the fine-grained DVFS functionality of Ascend NPU, we propose a DVFS strategy that integrates operator classification, preprocessing, and a genetic algorithm-based search. Experiments on applications including GPT-3 training achieve a reduction in AICore (the computing component within the Ascend NPU) power by 13.44% and NPU chip power by 4.95%, while limiting performance degradation to 1.76%.

DOI: 10.1145/3669940.3707231

vAttention： Dynamic Memory Management for Serving LLMs without PagedAttention

作者: Prabhu, Ramya and Nayak, Ajay and Mohan, Jayashree and Ramjee, Ramachandran and Panwar, Ashish
关键词: fragmentation, kv cache, large language models, memory management

Abstract

PagedAttention is a popular approach for dynamic memory allocation in LLM serving systems. It enables on-demand allocation of GPU memory to mitigate KV cache fragmentation - a phenomenon that crippled the batch size (and consequently throughput) in prior systems. However, in trying to allocate physical memory at runtime, PagedAttention ends up changing the virtual memory layout of the KV cache from contiguous to non-contiguous. Such a design leads to non-trivial programming and performance overheads.We present vAttention - an approach that mitigates fragmentation in physical memory while retaining the virtual memory contiguity of the KV cache. We achieve this by decoupling the allocation of virtual and physical memory using CUDA virtual memory management APIs. We also introduce various LLM-specific optimizations to address the limitations of CUDA virtual memory support. Overall, vAttention is a simpler, portable, and performant alternative to PagedAttention: it supports various attention kernels out-of-the-box and improves LLM serving throughput by up to 1.23\texttimes{

DOI: 10.1145/3669940.3707256

ZRAID： Leveraging Zone Random Write Area (ZRWA) for Alleviating Partial Parity Tax in ZNS RAID

作者: Kim, Minwook and Jeong, Seongyeop and Kim, Jin-Soo
关键词: partial parity, raid, zone random write area (zrwa), zoned namespaces

Abstract

The Zoned Namespace (ZNS) SSD is an innovative technology that aims to mitigate the block interface tax associated with conventional SSDs. However, constructing a RAID system using ZNS SSDs presents a significant challenge in managing partial parity for incomplete stripes. Previous research permanently logs partial parity in a limited number of reserved zones, which not only creates bottlenecks in throughput but also exacerbates write amplification, thereby reducing the device’s lifetime. We refer to these inefficiencies as the partial parity tax.In this paper, we present ZRAID, a software ZNS RAID layer that leverages the newly added Zone Random Write Area (ZRWA) feature in the ZNS Command Set, to alleviate partial parity tax. ZRWA enables in-place updates within a confined area near the write pointer. ZRAID temporarily stores partial parity within the ZRWA of data zones. Thus, partial parity writes are distributed across multiple data zones, effectively eliminating throughput bottlenecks. Furthermore, any expired partial parity in the ZRWA is overwritten by subsequent data, avoiding unnecessary flash writes. With the introduction of ZRWA, ZRAID can leverage general schedulers, overcoming the queue depth limitations of ZNS-compatible schedulers. Our evaluation with actual ZNS SSDs demonstrates a significant improvement in write throughput: up to 34.7% in the fio microbenchmark, and an average of 14.5% in db_bench on RocksDB, along with up to a 1.6x reduction in flash write amplification.

DOI: 10.1145/3669940.3707248

Accelerating Number Theoretic Transform with Multi-GPU Systems for Efficient Zero Knowledge Proof

Abstract

Accelerating Retrieval-Augmented Generation

Abstract

AnA： An Attentive Autonomous Driving System

Abstract

AnyKey： A Key-Value SSD for All Workload Types

Abstract

ARC： Warp-level Adaptive Atomic Reduction in GPUs to Accelerate Differentiable Rendering

Abstract

Automatic Tracing in Task-Based Runtime Systems

Abstract

BatchZK： A Fully Pipelined GPU-Accelerated System for Batch Generation of Zero-Knowledge Proofs

Abstract

ByteFS： System Support for (CXL-based) Memory-Semantic Solid-State Drives

Abstract

Cinnamon： A Framework for Scale-Out Encrypted AI

Abstract

ClosureX： Compiler Support for Correct Persistent Fuzzing

Abstract

Coach： Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms

Abstract

Composing Distributed Computations Through Task and Kernel Fusion

Abstract

Concerto： Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning

Abstract

Cooperative Graceful Degradation in Containerized Clouds

Abstract

Copper and Wire： Bridging Expressiveness and Performance for Service Mesh Policies

Abstract

CRUSH： A Credit-Based Approach for Functional Unit Sharing in Dynamically Scheduled HLS

Abstract

DarwinGame： Playing Tournaments for Tuning Applications in Noisy Cloud Environments

Abstract

Debugger Toolchain Validation via Cross-Level Debugging

Abstract

Design and Operation of Shared Machine Learning Clusters on Campus

Abstract

Dilu： Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity

Abstract

D-VSync： Decoupled Rendering and Displaying for Smartphone Graphics

Abstract

Early Termination for Hyperdimensional Computing Using Inferential Statistics

Abstract

Earth+： On-Board Satellite Imagery Compression Leveraging Historical Earth Observations

Abstract

EDM： An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation

Abstract

Efficient Lossless Compression of Scientific Floating-Point Data on CPUs and GPUs

Abstract

Enhancing CGRA Efficiency Through Aligned Compute and Communication Provisioning

Abstract

Exo 2： Growing a Scheduling Language

Abstract

Fast On-device LLM Inference with NPUs

Abstract

Faster Chaitin-like Register Allocation via Grammatical Decompositions of Control-Flow Graphs

Abstract

FleetIO： Managing Multi-Tenant Cloud Storage with Multi-Agent Reinforcement Learning

Abstract

Forecasting GPU Performance for Deep Learning Training and Inference

Abstract

Frugal： Efficient and Economic Embedding Model Training with Commodity GPUs

Abstract

FSMoE： A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

Abstract

Fusion： An Analytics Object Store Optimized for Query Pushdown

Abstract

GraphPipe： Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Abstract

HALO： Loop-aware Bootstrapping Management for Fully Homomorphic Encryption

Abstract

Helix： Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow

Abstract

H-Houdini： Scalable Invariant Learning

Abstract

Instruction-Aware Cooperative TLB and Cache Replacement Policies

Abstract

Marionette： A RowHammer Attack via Row Coupling

Abstract

Segue & ColorGuard： Optimizing SFI Performance and Scalability on Modern Architectures