WiseGraph: Optimizing GNN with Joint Workload Partition of Graph and Operations
作者: Huang, Kezhao and Zhai, Jidong and Zheng, Liyan and Wang, Haojie and Jin, Yuyang and Zhang, Qihao and Zhang, Runqing and Zheng, Zhen and Yi, Youngmin and Shen, Xipeng
关键词: Efficient training method, Graph neural network, Parallelism
Abstract
Graph Neural Network (GNN) has emerged as an important workload for learning on graphs. With the size of graph data and the complexity of GNN model architectures increasing, developing an efficient GNN system grows more important. As GNN has heavy neural computation workloads on a large graph, it is crucial to partition the entire workload into smaller parts for parallel execution and optimization. However, existing approaches separately partition graph data and GNN operations, resulting in inefficiency and large data movement overhead.To address this problem, we present WiseGraph, a GNN training framework exploring the joint optimization space of graph data partition and GNN operation partition. To bridge the gap between the two classes of partitions, we propose a workload abstraction tailored to GNN, gTask, which can not only describe existing GNN partition strategies as special cases but also exploit new optimization opportunities. Based on gTasks, WiseGraph effectively generates partition plans adaptive to input graph data and GNN models. Evaluation on five typical GNN models shows that WiseGraph outperforms existing GNN frameworks by 2.04\texttimes{
Core Graph: Exploiting Edge Centrality to Speedup the Evaluation of Iterative Graph Queries
作者: Jiang, Xiaolin and Afarin, Mahbod and Zhao, Zhijia and Abu-Ghazaleh, Nael and Gupta, Rajiv
关键词: GPU, core graph, in-memory, iterative graph algorithms, out-of-core
Abstract
When evaluating an iterative graph query over a large graph, systems incur significant overheads due to repeated graph transfer across the memory hierarchy coupled with repeated (redundant) propagation of values over the edges in the graph. An approach for reducing these overheads combines the use of a small proxy graph and the large original graph in a two phase query evaluation. The first phase evaluates the query on the proxy graph incurring low overheads and producing mostly precise results. The second phase uses these mostly precise results to bootstrap query evaluation on the larger original graph producing fully precise results. The effectiveness of this approach depends upon the quality of the proxy graph. Prior methods find proxy graphs that are either large or produce highly imprecise results.We present a new form of proxy graph named the Core Graph (CG) that is not only small, it also produces highly precise results. A CG is a subgraph of the larger input graph that contains all vertices but on average contains only 10.7% of edges and yet produces precise results for 94.5-99.9% vertices in the graph for different queries. The finding of such an effective CG is based on our key new insight, namely, a small subset of non-zero centrality edges are responsible for determining the converged results of nearly all the vertices across different queries. We develop techniques to identify a CG that produces precise results for most vertices and optimizations to efficiently compute precise results of remaining vertices. Across six kinds of graph queries and four input graphs, CGs improved the performance of GPU-based Subway system by up to 4.48\texttimes{
LSGraph: A Locality-centric High-performance Streaming Graph Engine
作者: Qi, Hao and Wu, Yiyang and He, Ligang and Zhang, Yu and Luo, Kang and Cai, Minzhi and Jin, Hai and Zhang, Zhan and Zhao, Jin
关键词: Data locality, Graph analytics, Graph engine, Graph update, Streaming graph
Abstract
Streaming graph has been broadly employed across various application domains. It involves updating edges to the graph and then performing analytics on the updated graph. However, existing solutions either suffer from poor data locality and high computation complexity for streaming graph analytics, or need high overhead to search and move graph data to ensure ordered neighbors during streaming graph update.This paper presents a novel locality-centric streaming graph engine, called LSGraph, to enable efficient both graph analytics and graph update. The main novelty of this engine is a differentiated hierarchical indexed streaming graph representation approach to achieve efficient data search and movement for graph update and also maintain data locality and ordered neighbors for efficient graph analytics simultaneously. Besides, a locality-aware streaming graph data update mechanism is also proposed to efficiently regulate the distance of data movement, minimizing the overhead of memory access during graph update. We have implemented LSGraph and conducted a systematic evaluation on both real-world and synthetic datasets. Compared with three cutting-edge streaming graph engines, i.e., Terrace, Aspen, and PaC-tree, LSGraph achieves 2.98\texttimes{
Contigra: Graph Mining with Containment Constraints
作者: Che, Joanna and Jamshidi, Kasra and Vora, Keval
关键词: graph mining, keyword search, maximality, motifs, nested queries, quasi-cliques, subgraph exploration
Abstract
While graph mining systems employ efficient task-parallel strategies to quickly explore subgraphs of interest (or matches), they remain oblivious to containment constraints like maximality and minimality, resulting in expensive constraint checking on every explored match as well as redundant explorations that limit their scalability.In this paper, we develop Contigra for efficient graph mining with containment constraints. We first model the impact of constraints in terms of dependencies across exploration tasks, and then exploit the dependencies to develop: (a) task fusion that merges correlated tasks to increase cache reuse; (b) task promotion that allows explorations to continue from available subgraphs and skip re-exploring subgraphs from scratch; © task cancelations that avoid unnecessary constraint checking and prioritizes faster constraint validations; and (d) task skipping that safely skips certain exploration and validation tasks. Experimental results show that Contigra scale to graph mining workloads with containment constraints, which could not be handled by existing state-of-the-art systems.
Halflife: An Adaptive Flowlet-based Load Balancer with Fading Timeout in Data Center Networks
作者: Liu, Sen and Gao, Yongbo and Chen, Zixuan and Ye, Jiarui and Xu, Haiyang and Liang, Furong and Yan, Wei and Tian, Zerui and Sun, Quanwei and Guo, Zehua and Xu, Yang
关键词: No keywords
Abstract
Modern data centers (DCs) employ various traffic load balancers to achieve high bisection bandwidth. Among them, flowlet switching has shown remarkable performance in both load balancing and upper-layer protocol (e.g., TCP) friendliness. However, flowlet-based load balancers suffer from the inflexibility of flowlet timeout value (FTV) and result in sub-optimal performance under various application workloads. To this end, we propose Halflife, a novel flowlet-based load balancer that leverages fading FTVs to reroute traffic promptly under different workloads without any prior knowledge. Halflife not only balances traffic better, but also avoids the performance degradation caused by frequent oscillation or shifting of lows between paths. Furthermore, Halflife’s fading mechanism is not only compatible with most flowlet-based load balancers, such as CONGA and LetFlow, but also improves their performance when leveraging flowlet switching in RDMA network. Through testbed experiments and simulations, we prove that Halflife improves the performance of CONGA and LetFlow by 10% ~ 150%, and it outperforms other load balancers by 30% ~ 200% across most application workloads.
Hoda: a High-performance Open vSwitch Dataplane with Multiple Specialized Data Paths
作者: Pan, Heng and He, Peng and Li, Zhenyu and Zhang, Pan and Wan, Junjie and Zhou, Yuhao and Duan, XiongChun and Zhang, Yu and Xie, Gaogang
关键词: No keywords
Abstract
Open vSwitch (OvS) has been widely used in cloud networks in view of its programmability and flexibility. However, we observe a huge performance drop when it loads practical cloud networking services (e.g., tunneling and firewalling). Our further analysis reveals that the root cause lies in the gap between the needs of supporting various selections of packet header fields and the one-size-fits-all data path in the vanilla OvS. Motivated by this, we design Hoda, a high-performance OvS dataplane with multiple specialized data paths. Specifically, Hoda constructs the specialized parser and microflow cache for each OpenFlow program so as to achieve lightweight parsing and caching. We also propose a configurable version of Hoda that introduces configuration knobs in the data path to ease specialization. The experiments with real-life OpenFlow rules show that Hoda achieves up to 1.7\texttimes{
Astraea: Towards Fair and Efficient Learning-based Congestion Control
作者: Liao, Xudong and Tian, Han and Zeng, Chaoliang and Wan, Xinchen and Chen, Kai
关键词: Congestion Control, Reinforcement Learning, Transport Protocol
Abstract
Recent years have witnessed a plethora of learning-based solutions for congestion control (CC) that demonstrate better performance over traditional TCP schemes. However, they fail to provide consistently good convergence properties, including fairness, fast convergence and stability, due to the mismatch between their objective functions and these properties. Despite being intuitive, integrating these properties into existing learning-based CC is challenging, because: 1) their training environments are designed for the performance optimization of single flow but incapable of cooperative multi-flow optimization, and 2) there is no directly measurable metric to represent these properties into the training objective function.We present Astraea, a new learning-based congestion control that ensures fast convergence to fairness with stability. At the heart of Astraea is a multi-agent deep reinforcement learning framework that explicitly optimizes these convergence properties during the training process by enabling the learning of interactive policy between multiple competing flows, while maintaining high performance. We further build a faithful multi-flow environment that emulates the competing behaviors of concurrent flows, explicitly expressing convergence properties to enable their optimization during training. We have fully implemented Astraea and our comprehensive experiments show that Astraea can quickly converge to fairness point and exhibit better stability than its counterparts. For example, Astraea achieves near-optimal bandwidth sharing (i.e., fairness) when multiple flows compete for the same bottleneck, delivers up to 8.4\texttimes{
Unison: A Parallel-Efficient and User-Transparent Network Simulation Kernel
作者: Bai, Songyuan and Zheng, Hao and Tian, Chen and Wang, Xiaoliang and Liu, Chang and Jin, Xin and Xiao, Fu and Xiang, Qiao and Dou, Wanchun and Chen, Guihai
关键词: Data center networks, Network simulation, Parallel discrete-event simulation
Abstract
Discrete-event simulation (DES) is a prevalent tool for evaluating network designs. Although DES offers full fidelity and generality, its slow performance limits its application. To speed up DES, many network simulators employ parallel discrete-event simulation (PDES). However, adapting existing network simulation models to PDES requires complex reconfigurations and often yields limited performance improvement. In this paper, we address this gap by proposing a parallel-efficient and user-transparent network simulation kernel, Unison, that adopts fine-grained partition and load-adaptive scheduling optimized for network scenarios. We prototype Unison based on ns-3. Existing network simulation models of ns-3 can be seamlessly transitioned to Unison. Testbed experiments on commodity servers demonstrate that Unison can achieve a 40\texttimes{
Serialization/Deserialization-free State Transfer in Serverless Workflows
作者: Lu, Fangming and Wei, Xingda and Huang, Zhuobin and Chen, Rong and Wu, Minyu and Chen, Haibo
关键词: No keywords
Abstract
Serialization and deserialization play a dominant role in the state transfer time of serverless workflows, leading to substantial performance penalties during workflow execution. We identify the key reason as a lack of ability to efficiently access the (remote) memory of another function. We propose RMMap, an OS primitive for remote memory map. It allows a serverless function to directly access the memory of another function, even if it is located remotely. RMMap is the first to completely eliminates serialization and deserialization when transferring states between any pairs of functions in (unmodified) serverless workflows. To make remote memory map efficient and feasible, we co-design it with fast networking (RDMA), OS, language runtime, and serverless platform. Evaluations using real-world serverless workloads show that integrating RMMap with Knative reduces the serverless workflow execution time on Knative by up to 2.6 \texttimes{
Occam: A Programming System for Reliable Network Management
作者: Xing, Jiarong and Hsu, Kuo-Feng and Xia, Yiting and Cai, Yan and Li, Yanping and Zhang, Ying and Chen, Ang
关键词: Network management, Reliability
Abstract
The complexity of large networks makes their management a daunting task. State-of-the-art network management tools use workflow systems for automation, but they do not adequately address the substantial challenges in operation reliability. This paper presents Occam, a programming system that simplifies the development of reliable network management tasks. We leverage the fact that most modern network management systems are backed with a source-of-truth database, and thus customize database techniques to the context of network management. Occam exposes an easy-to-use programming model for network operators to express the key management logic, while shielding them from reliability concerns, such as operational conflicts and task atomicity. Instead, the Occam runtime provides these reliability guardrails automatically. Our evaluation demonstrates Occam’s effectiveness in simplifying management tasks, minimizing network vulnerable time and assisting with failure recovery.
Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation
作者: Liu, Guodong and Miao, Youshan and Lin, Zhiqi and Shi, Xiaoxiang and Maleki, Saeed and Yang, Fan and Bao, Yungang and Wang, Sa
关键词: automatic parallelization, deep learning, distributed system
Abstract
Many parallel mechanisms, including data parallelism, tensor parallelism, and pipeline parallelism, have been proposed and combined together to support training increasingly large deep neural networks (DNN) on massive GPU devices. Given a DNN model and GPU cluster, finding the optimal configuration by combining these parallelism mechanisms is an NP-hard problem. Widely adopted mathematical programming approaches have been proposed to search in a configuration subspace, but they are still too costly when scaling to large models over numerous devices.Aceso is a scalable parallel-mechanism auto-configuring system that operates iteratively. For a given parallel configuration, Aceso identifies a performance bottleneck and then, by summarizing all possible configuration adjustments with their resource consumption changes, infers their performance impacts to the bottleneck and selects one that mitigates the bottleneck. This process repeats for many iterations until a desired final configuration is found. Unlike mathematical programming approaches that examine the configurations subspace to find the optimal solution, Aceso searches in the configuration space in a stochastic approach by repeatedly identifying and alleviating bottlenecks. Aceso significantly reduces configuration searching cost by taking the approach of resolving one bottleneck at a time. This allows Aceso to find configurations that would be usually missed in subspace search approaches. We implemented and tested Aceso on representative DNN models. Evaluations show that it can scale to 1K-layer models. Compared to state-of-the-art systems, Aceso achieves up to 1.33\texttimes{
Totoro: A Scalable Federated Learning Engine for the Edge
作者: Ching, Cheng-Wei and Chen, Xin and Kim, Taehwan and Ji, Bo and Wang, Qingyang and Da Silva, Dilma and Hu, Liting
关键词: Distributed and parallel systems for machine learning, edge computing, federated learning
Abstract
Federated Learning (FL) is an emerging distributed machine learning (ML) technique that enables in-situ model training and inference on decentralized edge devices. We propose Totoro, a novel scalable FL engine, that enables massive FL applications to run simultaneously on edge networks. The key insight is to explore a distributed hash table (DHT)-based peer-to-peer (P2P) model to re-architect the centralized FL system design into a fully decentralized one. In contrast to previous studies where many FL applications shared one centralized parameter server, Totoro assigns a dedicated parameter server to each individual application. Any edge node can act as any application’s coordinator, aggregator, client selector, worker (participant device), or any combination of the above, thereby radically improving scalability and adaptivity. Totoro introduces three innovations to realize its design: a locality-aware P2P multi-ring structure, a publish/subscribe-based forest abstraction, and a bandit-based exploitation-exploration path planning model. Real-world experiments on 500 Amazon EC2 servers show that Totoro scales gracefully with the number of FL applications and N edge nodes, speeds up the total training time by 1.2 \texttimes{
FLOAT: Federated Learning Optimizations with Automated Tuning
作者: Khan, Ahmad Faraz and Khan, Azal Ahmad and Abdelmoniem, Ahmed M. and Fountain, Samuel and Butt, Ali R. and Anwar, Ali
关键词: Federated Learning, Machine Learning Systems, Resource Management
Abstract
Federated Learning (FL) has emerged as a powerful approach that enables collaborative distributed model training without the need for data sharing. However, FL grapples with inherent heterogeneity challenges leading to issues such as stragglers, dropouts, and performance variations. Selection of clients to run an FL instance is crucial, but existing strategies introduce biases and participation issues and do not consider resource efficiency. Communication and training acceleration solutions proposed to increase client participation also fall short due to the dynamic nature of system resources. We address these challenges in this paper by designing FLOAT, a novel framework designed to boost FL client resource awareness. FLOAT optimizes resource utilization dynamically for meeting training deadlines, and mitigates stragglers and dropouts through various optimization techniques; leading to enhanced model convergence and improved performance. FLOAT leverages multi-objective Reinforcement Learning with Human Feedback (RLHF) to automate the selection of the optimization techniques and their configurations, tailoring them to individual client resource conditions. Moreover, FLOAT seamlessly integrates into existing FL systems, maintaining non-intrusiveness and versatility for both asynchronous and synchronous FL settings. As per our evaluations, FLOAT increases accuracy by up to 53%, reduces client dropouts by up to 78\texttimes{
DeTA: Minimizing Data Leaks in Federated Learning via Decentralized and Trustworthy Aggregation
作者: Cheng, Pau-Chen and Eykholt, Kevin and Gu, Zhongshu and Jamjoom, Hani and Jayaram, K. R. and Valdez, Enriquillo and Verma, Ashish
关键词: Decentralized Aggregation, Federated Learning, Parameter Shuffling, Trusted Aggregation
Abstract
Federated learning (FL) relies on a central authority to oversee and aggregate model updates contributed by multiple participating parties in the training process. This centralization of sensitive model updates naturally raises concerns about the trustworthiness of the central aggregation server, as well as the potential risks associated with server failures or breaches, which could result in loss and leaks of model updates. Moreover, recent attacks have demonstrated that, by obtaining the leaked model updates, malicious actors can even reconstruct substantial amounts of private data belonging to training participants. This underscores the critical necessity to rethink the existing FL system architecture to mitigate emerging attacks in the evolving threat landscape. One straightforward approach is to fortify the central aggregator with confidential computing (CC), which offers hardware-assisted protection for runtime computation and can be remotely verified for execution integrity. However, a growing number of security vulnerabilities have surfaced in tandem with the adoption of CC, indicating that depending solely on this singular defense may not provide the requisite resilience to thwart data leaks.To address the security challenges inherent in the centralized aggregation paradigm and enhance system resilience, we introduce DeTA, an FL system architecture that employs a decentralized and trustworthy aggregation strategy with a defense-in-depth design. In DeTA, FL parties locally divide and shuffle their model updates at the parameter level, creating random partitions designated for multiple aggregators, all of which are shielded within CC execution environments. Moreover, to accommodate the multi-aggregator FL ecosystem, we have implemented a two-phase authentication protocol that enables new parties to verify all CC-protected aggregators and establish secure channels to upstream their model updates. With DeTA, model aggregation algorithms can function without any alterations. However, each aggregator is now oblivious to model architectures, possessing only a fragmented and shuffled view of each model update. This approach effectively mitigates attacks aimed at tampering with the aggregation process or exploiting leaked model updates, while also preserving training accuracy and minimizing performance overheads.
ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling
作者: Shi, Shaohuai and Pan, Xinglin and Wang, Qiang and Liu, Chengjian and Ren, Xiaozhe and Hu, Zhongzhe and Yang, Yu and Li, Bo and Chu, Xiaowen
关键词: Distributed Deep Learning, Large Language Model, Mixture-of-Experts, Scheduling
Abstract
In recent years, large-scale models can be easily scaled to trillions of parameters with sparsely activated mixture-of-experts (MoE), which significantly improves the model quality while only requiring a sub-linear increase in computational costs. However, MoE layers require the input data to be dynamically routed to a particular GPU for computing during distributed training. The highly dynamic property of data routing and high communication costs in MoE make the training system low scaling efficiency on GPU clusters. In this work, we propose an extensible and efficient MoE training system, ScheMoE, which is equipped with several features. 1) ScheMoE provides a generic scheduling framework that allows the communication and computation tasks in training MoE models to be scheduled in an optimal way. 2) ScheMoE integrates our proposed novel all-to-all collective which better utilizes intra- and inter-connect bandwidths. 3) ScheMoE supports easy extensions of customized all-to-all collectives and data compression approaches while enjoying our scheduling algorithm. Extensive experiments are conducted on a 32-GPU cluster and the results show that ScheMoE outperforms existing state-of-the-art MoE systems, Tutel and Faster-MoE, by 9%-30%.
Dashing and Star: Byzantine Fault Tolerance with Weak Certificates
作者: Duan, Sisi and Zhang, Haibin and Sui, Xiao and Huang, Baohan and Mu, Changchun and Di, Gang and Wang, Xiaoyun
关键词: No keywords
Abstract
State-of-the-art Byzantine fault-tolerant (BFT) protocols assuming partial synchrony such as SBFT and HotStuff use regular certificates obtained from 2f + 1 (partial) signatures. We show that one can use weak certificates obtained from only f + 1 signatures to assist in designing more robust and more efficient BFT protocols. We design and implement two BFT systems: Dashing (a family of two HotStuff-style BFT protocols) and Star (a parallel BFT framework).We first present Dashingl that targets both efficiency and robustness using weak certificates. Dashingl is also network-adaptive in the sense that it can leverage network connection discrepancy to improve performance. We show that Dashing1 outperforms HotStuff in various failure-free and failure scenarios. We then present Dashing2 enabling a one-phase fast path by using strong certificates from 3f + 1 signatures.We then leverage weak certificates to build Star, a highly scalable BFT framework that delivers transactions from n - f replicas. Star compares favorably with existing protocols in terms of liveness, communication, state transfer, scalability, and/or robustness under failures.We demonstrate that Dashing achieves 47%-107% higher peak throughput than HotStuff for experiments on Amazon EC2. Meanwhile, unlike all known BFT protocols whose performance degrades as f grows large, the peak throughput of Star increases as f grows. When deployed in a WAN with 91 replicas across five continents, Star achieves an impressive throughput of 256 ktx/sec, 2.38x that of Narwhal.
Bandle: Asynchronous State Machine Replication Made Efficient
作者: Wang, Bo and Liu, Shengyun and Dong, He and Wang, Xiangzhe and Xu, Wenbo and Zhang, Jingjing and Zhong, Ping and Zhang, Yiming
关键词: State machine replication, asynchrony, consensus
Abstract
State machine replication (SMR) uses consensus as its core component for reaching agreement among a group of processes, in order to provide fault-tolerant services. Most SMR protocols, such as Paxos and Raft, are designed in the partial synchrony model. Partially synchronous protocols rely on timing assumptions to elect a special role (such as the leader), which may become the performance bottleneck under a heavy workload. From an engineering perspective, partially synchronous protocols have to wait for a pre-defined period of time and implement a (complicated) failover mechanism in order to replace the faulty leader. In contrast, asynchronous protocols are immune to such problems.This paper presents Bandle, a simple and highly efficient asynchronous SMR protocol. Instead of electing a special role, Bandle evenly assigns sequence numbers to each process and proceeds in a leaderless manner. We further propose a binary agreement protocol, referred to as FlashBA, which decides whether a given proposal can be committed. FlashBA is inspired by Ben-Or’s randomized algorithm but leverages a promise mechanism to achieve optimal latency (i.e., one message delay in the best case). An empirical study on the Amazon EC2 platform shows that Bandle delivers exceptional performance when deployed within a data center and across the globe.
Characterization and Reclamation of Frozen Garbage in Managed FaaS Workloads
作者: Zhao, Ziming and Wu, Mingyu and Chen, Haibo and Zang, Binyu
关键词: Function-as-a-Service, Garbage Collection, Language Runtime
Abstract
FaaS (function-as-a-service) is becoming a popular workload in cloud environments due to its virtues such as auto-scaling and pay-as-you-go. High-level languages like JavaScript and Java are commonly used in FaaS for programmability, but their managed runtimes complicate memory management in the cloud. This paper first observes the issue of frozen garbage, which is caused by freezing cached function instances where their threads have been paused but the unused memory (e.g., garbage) is not reclaimed due to the semantic gap between FaaS and the managed runtime. This paper presents the first characterization of the negative effects induced by frozen garbage with various functions, which uncovers that it can occupy more than half of FaaS instances’ memory resources on average. To this end, this paper proposes Desiccant, a freeze-aware memory manager for managed workloads in FaaS, which reclaims idle memory resources consumed by frozen garbage from managed runtime instances and thus notably improves memory efficiency. The evaluation on various FaaS workloads shows that Desiccant can reduce FaaS functions’ peak memory consumption by up to 6.72\texttimes{
Pronghorn: Effective Checkpoint Orchestration for Serverless Hot-Starts
作者: Kohli, Sumer and Kharbanda, Shreyas and Bruno, Rodrigo and Carreira, Joao and Fonseca, Pedro
关键词: No keywords
Abstract
Serverless computing allows developers to deploy and scale stateless functions in ephemeral workers easily. As a result, serverless computing has been widely used for many applications, such as computer vision, video processing, and HTML generation. However, we find that the stateless nature of serverless computing wastes many of the important benefits modern language runtimes have to offer. A notable example is the extensive profiling and Just-in-Time (JIT) compilation effort that runtimes implement to achieve acceptable performance of popular high-level languages, such as Java, JavaScript, and Python. Unfortunately, when modern language runtimes are naively adopted in serverless computing, all of these efforts are lost upon worker eviction. Checkpoint-restore methods alleviate the problem by resuming workers from snapshots taken after initialization. However, production-grade language runtimes can take up to thousands of invocations to fully optimize a single function, thus rendering naive checkpoint-restore policies ineffective.This paper proposes Pronghorn, a snapshot serverless orchestrator that automatically monitors the function performance and decides (1) when it is the right moment to take a snapshot and (2) which snapshot to use for new workers. Pronghorn is agnostic to the underlying platform and JIT runtime, thus easing its integration into existing runtimes and worker deployment environments (container, virtual machine, etc.). On a set of representative serverless benchmarks, Pronghorn provides end-to-end median latency improvements of 37.2% across 9 out of 13 benchmarks (20-58% latency reduction) when compared to state-of-art checkpointing policies.
Improving Resource and Energy Efficiency for Cloud 3D through Excessive Rendering Reduction
作者: Liu, Tianyi and Lucas, Jerry and He, Sen and Liu, Tongping and Wang, Xiaoyin and Wang, Wei
关键词: Cloud Graphics Rendering, FPS gaps, OnDemand Rendering, Priority Frames, Resource and Energy Efficiency
Abstract
The rise of cloud gaming makes interactive 3D applications an emerging type of data center workload. However, the excessive rendering in current cloud 3D systems leads to large gaps between the cloud and client frame rates (FPS, frames per second), thus wasting resources and power. Although FPS regulation can remove excessive rendering, due to the highly-varying frame processing time and the use of rendering delays, existing cloud FPS regulation solutions have low FPS and slow motion-to-photon (MtP) latency, causing violations of Quality-of-Service (QoS) requirements.In this paper, we present a novel cloud FPS regulation solution, called OnDemand Rendering (ODR). ODR employs multi-buffering, dynamic rendering delay/acceleration, and input processing prioritization to reduce excessive rendering and ensure QoS satisfaction. ODR was evaluated in our private cloud and Google cloud. Evaluation results showed that ODR effectively removed excessive rendering, thus improving DRAM performance by 19% and reducing power usage by 16% over no FPS regulation. Better memory efficiency also allowed ODR to increase client FPS by 5.5%. Moreover, ODR reduced average MtP latency by more than 92% and outperformed existing FPS regulations. More importantly, ODR’s high FPS and low latency make it feasible to deploy 3D applications to conventional public clouds.
Draconis: Network-Accelerated Scheduling for Microsecond-Scale Workloads
作者: Udayashankar, Sreeharsha and Abdel-Hadi, Ashraf and Mashtizadeh, Ali and Al-Kiswany, Samer
关键词: No keywords
Abstract
We present Draconis, a novel scheduler for workloads in the range of tens to hundreds of microseconds. Draconis challenges the popular belief that programmable switches cannot house the complex data structures, such as queues, needed to support an in-network scheduler. Using programmable switches, Draconis achieves the low scheduling tail latency and high throughput needed to support these microsecond-scale workloads on large clusters. Furthermore, Draconis supports a wide range of complex scheduling policies, including locality-aware scheduling, priority-based scheduling, and resource-based scheduling.Draconis reduces the 99th percentile scheduling latencies by 3\texttimes{
Snatch: Online Streaming Analytics at the Network Edge
作者: Xiao, Yunming and Zhao, Yibo and Lin, Sen and Kuzmanovic, Aleksandar
关键词: Network Edge, Online Streaming Analytics
Abstract
In recent years, we have witnessed a growing trend of content hyper-giants deploying server infrastructure and services close to end-users, in “eyeball” networks. Still, one of the services that remained largely unaffected by this trend is online streaming analytics. This is despite the fact that most of the “big data” is received in real time and is most valuable at the time of arrival. The inability to process requests at the network edge is caused by a common setting where user profiles, necessary for analytics, are stored deep in the data center back-ends. This setting also carries privacy concerns as such user profiles are individually identifiable, yet the users are almost blind to what data is associated with their identities and how the data is analyzed. In this paper, we revise this arrangement, and plant encrypted semantic cookies at the user end. Without altering any of the existing protocols, this enables capturing and analytically pre-processing user requests soon after they are generated, at edge ISPs or content providers’ off-nets. In addition, it ensures user anonymity perseverance during the analytics. We design and implement Snatch, a QUIC-based streaming analytics prototype, and demonstrate that it speeds up user analytics by up to 200x, and by 10-30x in the common case.
Blaze: Holistic Caching for Iterative Data Processing
作者: Song, Won Wook and Eo, Jeongyoon and Um, Taegeon and Jeon, Myeongjae and Chun, Byung-Gon
关键词: Caching, Data Processing, Distributed Systems
Abstract
Modern data processing workloads, such as machine learning and graph processing, involve iterative computations to converge generated models into higher accuracy. An effective caching mechanism is vital to expedite iterative computations since the intermediate data that needs to be stored in memory grows larger over iterations, often exceeding the memory capacity. However, existing systems handle intermediate data through separate operational layers (e.g., caching, eviction, and recovery), with each layer working independently in a greedy or cost-agnostic manner. These layers typically rely on user annotations and past access patterns, failing to make globally optimal decisions for the workload.To overcome these limitations, Blaze introduces a unified caching mechanism that integrates the separate operational layers. Blaze dynamically captures the workload structure and metrics using profiling and inductive regression, and automatically estimates the potential data caching efficiency associated with different operational decisions based on the profiled information. To achieve this goal, Blaze incorporates potential data recovery costs across stages into a single cost optimization function, which informs the optimal partition state and location. This approach reduces the significant disk I/O overheads caused by oversized partitions and the recomputation overheads for partitions with long lineages, while efficiently utilizing the constrained memory space. Our evaluations demonstrate that Blaze can accelerate end-to-end application completion time by 2.02 - 2.52\texttimes{
TTLs Matter: Efficient Cache Sizing with TTL-Aware Miss Ratio Curves and Working Set Sizes
作者: Sultan, Sari and Shakiba, Kia and Lee, Albert and Chen, Paul and Stumm, Michael
关键词: Cache Sizing, HyperLogLog (HLL), In-memory Caches, Key-Value Stores, Miss Ratio Curve (MRC), Time to Live (TTL), Working Set Size (WSS)
Abstract
In-memory caches play a pivotal role in optimizing distributed systems by significantly reducing query response times. Correctly sizing these caches is critical, especially considering that prominent organizations use terabytes and even petabytes of DRAM for these caches. The Miss Ratio Curve (MRC) and Working Set Size (WSS) are the most widely used tools for sizing these caches.Modern cache workloads employ Time-to-Live (TTL) limits to define the lifespan of cached objects, a feature essential for ensuring data freshness and adhering to regulations like GDPR. Surprisingly, none of the existing MRC and WSS tools accommodate TTLs. Based on 28 real-world cache workloads that contain 113 billion accesses, we show that taking TTL limits into consideration allows an average of 69% (and up to 99%) lower memory footprint for in-memory caches without a degradation in the hit rate.This paper describes how TTLs can be integrated into today’s most important MRC generation and WSS estimation algorithms. We also describe how the widely used HyperLogLog (HLL) cardinality estimator can be extended to accommodate TTLs, and show how it can be used to efficiently estimate the WSS. Our extended algorithms maintain comparable performance levels to the original algorithms. All our extended approximate algorithms are efficient, run in constant space, and enable more resource-efficient and cost-effective cache management.
Trinity: A Fast Compressed Multi-attribute Data Store
作者: Mao, Ziming and Srinivasan, Kiran and Khandelwal, Anurag
关键词: No keywords
Abstract
With the proliferation of attribute-rich machine-generated data, emerging real-time monitoring, diagnosis, and visualization tools ingest and analyze such data across multiple attributes simultaneously. Due to the sheer volume of the data, applications need storage-efficient and performant data representations to analyze them efficiently.We present TRINITY, a system that simultaneously facilitates query and storage efficiency across large volumes of multi-attribute records. Trinity accomplishes this through a new dynamic, succinct, multi-dimensional data structure, MdTrie. MdTrie employs a combination of novel Morton code generalization, a multi-attribute query algorithm, and a self-indexed trie structure to achieve the above goals. Our evaluation of TRINITY for real-world use-cases shows that compared to state-of-the-art systems, it supports (1) 7.2-59.6\texttimes{
FLOWS: Balanced MRC Profiling for Heterogeneous Object-Size Cache
作者: Guo, Xiaojun and Wang, Hua and Zhou, Ke and Jiang, Hong and Han, Yaodong and Xing, Guangjie
关键词: MRC profiling, cache optimization, object cache
Abstract
While Miss Ratio Curve (MRC) profiling methods based on spatial sampling are effective in modeling cache behaviors, previous MRC studies lack in-depth analysis of profiling errors and primarily target homogeneous object-size scenarios. This has caused imbalanced errors of existing MRC approaches when employed in heterogeneous object-size caches. For instance, in CDN traces, the error of the Byte Miss Ratio Curve (BMRC) could be two orders of magnitude larger than that of the Object Miss Ratio Curve (OMRC).In this paper, we reveal an important insight from our experimental analysis, namely, the source of profiling inaccuracy is twofold, the “imbalanced requests” and the heterogeneous object-size distribution. To this end, we propose FLOWS, a Filtered LOw-variance Weighted Sampling approach, to address the root causes of the problem by combining a Cache Filter, designed to balance sampled requests, with a Weighted Sampling technique, designed to reduce bytelevel estimation error. FLOWS constructs a more accurate MRC for traces with heterogeneous content popularity and object sizes. Evaluation on real-world traces demonstrates that FLOWS reduces the error of the BMRC and OMRC profiling by 16\texttimes{
CCL-BTree: A Crash-Consistent Locality-Aware B±Tree for Reducing XPBuffer-Induced Write Amplification in Persistent Memory
作者: Li, Zhenxin and He, Shuibing and Dang, Zheng and Hong, Peiyi and Zhang, Xuechen and Wang, Rui and Wu, Fei
关键词: B±Tree, Index Structures, Persistent Memory
Abstract
In persistent B+ -Tree, random updates of small key-value (KV) pairs will cause severe XPBuffer-induced write amplification (XBI-amplification) because CPU cacheline size is smaller than media access granularity in persistent memory (PM). We observe that XBI-amplification directly determines the application performance when the PM bandwidth is exhausted in multi-thread scenarios. However, none of the existing work can efficiently address the XBI-amplification issue while maintaining superior range query performance.In this paper, we design a novel crash-consistent locality-aware B±Tree (CCL-BTree). It preserves the key order between adjacent leaf nodes for efficient range query and proposes a leaf-node centric buffering strategy that merges writes and then flushes them together to reduce the number of flushes to the PM media. For crash-consistency, all the buffered KVs are recorded in write-ahead logs. CCL-BTree further devises write-conservative logging to skip unnecessary log operations, and locality-aware garbage collection to avoid random PM writes in reclaiming log data. Our experiments show that CCL-BTree reduces the XBI-amplification by up to 81%, improves the insert throughput by up to 9.35\texttimes{
Wormhole Filters: Caching Your Hash on Persistent Memory
作者: Wang, Hancheng and Dai, Haipeng and Gu, Rong and Lu, Youyou and Zheng, Jiaqi and Dai, Jingsong and Chen, Shusen and Chen, Zhiyuan and Li, Shuaituan and Chen, Guihai
关键词: Approximate Membership Query, Cuckoo Filter, Persistent Memory, Probabilistic Data Structure
Abstract
Approximate membership query (AMQ) data structures can approximately determine whether an element is in the set with high efficiency. They are widely used in distributed systems, database systems, bioinformatics, IoT applications, data stream mining, etc. However, the memory consumption of AMQ data structures grows rapidly as the data scale grows, which limits the system’s ability to process a massive amount of data. The emerging persistent memory provides a close-to-DRAM access speed and terabyte-level capacity, facilitating AMQ data structures to handle massive data. Nevertheless, existing AMQ data structures perform poorly on persistent memory due to intensive random accesses and/or sequential writes. Therefore, we propose a novel AMQ data structure called wormhole filter, which achieves high performance on persistent memory by reducing random accesses and sequential writes. In addition, we reduce the number of log records for lower recovery overhead. Theoretical analysis and experimental results show that wormhole filters significantly outperform competitive state-of-the-art AMQ data structures. For example, wormhole filters achieve 23.26\texttimes{
Dordis: Efficient Federated Learning with Dropout-Resilient Differential Privacy
作者: Jiang, Zhifeng and Wang, Wei and Chen, Ruichuan
关键词: Client Dropout, Distributed Differential Privacy, Federated Learning, Pipeline, Secure Aggregation
Abstract
Federated learning (FL) is increasingly deployed among multiple clients to train a shared model over decentralized data. To address privacy concerns, FL systems need to safeguard the clients’ data from disclosure during training and control data leakage through trained models when exposed to untrusted domains. Distributed differential privacy (DP) offers an appealing solution in this regard as it achieves a balanced tradeoff between privacy and utility without a trusted server. However, existing distributed DP mechanisms are impractical in the presence of client dropout, resulting in poor privacy guarantees or degraded training accuracy. In addition, these mechanisms suffer from severe efficiency issues.We present Dordis, a distributed differentially private FL framework that is highly efficient and resilient to client dropout. Specifically, we develop a novel ‘add-then-remove’ scheme that enforces a required noise level precisely in each training round, even if some sampled clients drop out. This ensures that the privacy budget is utilized prudently, despite unpredictable client dynamics. To boost performance, Dordis operates as a distributed parallel architecture via encapsulating the communication and computation operations into stages. It automatically divides the global model aggregation into several chunk-aggregation tasks and pipelines them for optimal speedup. Large-scale deployment evaluations demonstrate that Dordis efficiently handles client dropout in various realistic FL scenarios, achieving the optimal privacy-utility tradeoff and accelerating training by up to 2.4\texttimes{
Accelerating Privacy-Preserving Machine Learning With GeniBatch
作者: Huang, Xinyang and Zhang, Junxue and Cheng, Xiaodian and Zhang, Hong and Jin, Yilun and Hu, Shuihai and Tian, Han and Chen, Kai
关键词: batch compiler, homomorphic encryption, privacy-preserving machine learning
Abstract
Cross-silo privacy-preserving machine learning (PPML) adopt; Partial Homomorphic Encryption (PHE) for secure data combination and high-quality model training across multiple organizations (e.g., medical and financial). However, PHE introduces significant computation and communication overheads due to data inflation. Batch optimization is an encouraging direction to mitigate the problem by compressing multiple data into a single ciphertext. While promising, it is impractical for a large number of cross-silo PPML applications due to the limited vector operations support and severe data corruption.In this paper, we present GeniBatch, a batch compiler that translates a PPML program with PHE into an efficient program with batch optimization. GeniBatch adopts a set of conversion rules to allow PHE programs involving all vector operations required in cross-silo PPML and ensures end-to-end result consistency before/after compiling. By proposing bit-reserving algorithms, GeniBatch avoids bit-overflow for the correctness of compiled programs and maximizes the compression ratio. We have integrated GeniBatch into FATE, a representative cross-silo PPML framework, and provided SIMD APIs to harness hardware acceleration. Experiments across six popular applications show that GeniBatch achieves up to 22.6\texttimes{