EuroSys 2023

Effective Performance Issue Diagnosis with Value-Assisted Cost Profiling

作者: Weng, Lingmei and Hu, Yigong and Huang, Peng and Nieh, Jason and Yang, Junfeng
关键词: program analysis, profilers, debugging

Abstract

Diagnosing performance issues is often difficult, especially when they occur only during some program executions. Profilers can help with performance debugging, but are ineffective when the most costly functions are not the root causes of performance issues. To address this problem, we introduce a new profiling methodology, value-assisted cost profiling, and a tool vProf. Our insight is that capturing the values of variables can greatly help diagnose performance issues. vProf continuously records values while profiling normal and buggy program executions. It identifies anomalies in the values and the functions where they occur to pinpoint the real root causes of performance issues. Using a set of 15 real-world performance bugs in four widely used applications, we show that vProf is effective at diagnosing all of the issues while other state-of-the-art tools diagnose only a few of them. We further use vProf to diagnose longstanding performance issues in these applications that have been unresolved for over four years.

DOI: 10.1145/3552326.3587444

Foxhound： Server-Grade Observability for Network-Augmented Applications

作者: Castanheira, Lucas and Schaeffer-Filho, Alberto and Benson, Theophilus A.
关键词: programmable networks, telemetry, debugging, tracing, INC

Abstract

There is a growing move to offload functionality, e.g., TCP or key-value stores, into programmable networks - either on SmartNICs or programmable switches. While offloading promises significant performance boosts, these programmable devices often provide little visibility into their performance. Moreover, many existing tools for analyzing and debugging performance problems, e.g., distributed tracing, do not extend into these devices.Motivated by this lack of visibility, we present the design and implementation of an observability framework called Foxhound, which introduces a co-designed query language, compiler, and storage abstraction layer for expressing, capturing and analyzing distributed traces and their performance data across an infrastructure comprising servers and programmable data planes. While general, Foxhound’s query language offers optimized constructs which can circumvent limitations of programmable devices by pushing down operations to hardware. We have evaluated Foxhound using a Tofino switch and a large scale simulator. Our evaluations show that our storage layer can support common tracing tasks and detect associated problems at scale.

DOI: 10.1145/3552326.3567502

OFence： Pairing Barriers to Find Concurrency Bugs in the Linux Kernel

作者: Lepers, Baptiste and Giet, Josselin and Zwaenepoel, Willy and Lawall, Julia
关键词: kernel, memory barrier, static analysis

Abstract

Knowing which functions may execute concurrently is key to finding concurrency-related bugs. Existing tools infer the possibility of concurrency using dynamic analysis or by pairing functions that use the same locks. Code that relies on more relaxed concurrency controls is, by and large, out of the reach of existing concurrency-related bug-tracking tools.In this paper, we propose a new heuristic to automatically infer the possibility of concurrency in lockless code that relies on memory barriers (memory fences) for correctness, a task made complex by the fact that barriers do not have a unique identifier and do not have a clearly delimited scope.To infer the possibility of concurrency between barriers, we propose a novel heuristic based on matching objects accessed before and after barriers. Our approach is based on the observation that barriers work in pairs: if a write memory barrier orders writes to a set of objects, then there should be a read barrier that orders reads to the same set of objects. This pairing strategy allows us to infer which barriers are meant to run concurrently and, in turn, check the code surrounding the barriers for concurrency-related bugs. As a example of a type of concurrency bug, we focus on bugs related to the incorrect placement of reads or writes relative to barriers. When we detect incorrect read or write placements in the code, we automatically produce a patch to fix them.We evaluate our heuristic on the Linux kernel. Our analysis runs in 8 minutes. We fixed 12 incorrect ordering constraints that could have resulted in hard-to-debug data corruption or kernel crashes. The patches have been merged in the mainline kernel. None of the bugs could have been found using existing static analysis heuristics.

DOI: 10.1145/3552326.3567504

Pocket： ML Serving from the Edge

作者: Park, Misun and Bhardwaj, Ketan and Gavrilovska, Ada
关键词: visual analytics, edge computing, IPC, runtime-as-a-service, isolation, resource management, containers, ML serving

Abstract

One of the major challenges in serving ML applications is the resource pressure introduced by the underlying ML frameworks. This becomes a bigger problem at resource-constrained, multi-tenant edge server locations, where it is necessary to scale to a larger number of clients with a fixed resource envelope. Naive approaches which simply minimize the resource budget allocation of each application result in performance degradation that voids the benefits expected from operating at the edge.This paper presents Pocket - a new approach for serving ML applications in settings like the edge, based on a shared ML runtime backend as a service and lightweight ML application pocket containers. Key to realizing Pocket is use of lightweight IPC, support for cross-client isolation, and a novel resource amplification method which inlines resource reallocation with IPC. The latter ensures just-in-time assignment of the limited edge resources where they’re most needed, thereby reducing contention effects and boosting overall performance and efficiency. Experimental evaluations demonstrate that Pocket can scale to 1.3–20\texttimes{

DOI: 10.1145/3552326.3587459

Efficient and Safe I/O Operations for Intermittent Systems

作者: Yildiz, Eren and Ahmed, Saad and Islam, Bashima and Hester, Josiah and Yildirim, Kasim Sinan
关键词: peripherals, batteryless internet of things, energy harvesting, intermittent computing

Abstract

Task-based intermittent software systems always re-execute peripheral input/output (I/O) operations upon power failures since tasks have all-or-nothing semantics. Re-executed I/O wastes significant time and energy and risks memory inconsistency. This paper presents EaseIO, a new task-based intermittent system that remedies these problems. EaseIO programming interface introduces re-execution semantics for I/O operations to facilitate safe and efficient I/O management for intermittent applications. EaseIO compiler front-end considers the programmer-annotated I/O re-execution semantics to preserve the task’s energy efficiency and idem-potency. EaseIO runtime introduces regional privatization to eliminate memory inconsistency caused by idempotence bugs. Our evaluation shows that EaseIO reduces the wasted useful I/O work by up to 3\texttimes{

DOI: 10.1145/3552326.3587435

ICE： Collaborating Memory and Process Management for User Experience on Resource-limited Mobile Devices

作者: Li, Changlong and Liang, Yu and Ausavarungnirun, Rachata and Zhu, Zongwei and Shi, Liang and Xue, Chuan Jason
关键词: mobile device, user experience, process freezing, memory management

Abstract

Mobile devices with limited resources are prevalent as they have a relatively low price. Providing a good user experience with limited resources has been a big challenge. This paper found that foreground applications are often unexpectedly interfered by background applications’ memory activities. Improving user experience on resource-limited mobile devices calls for a strong collaboration between memory and process management. This paper proposes a framework, Ice, to optimize the user experience on resource-limited mobile devices. With Ice, processes that will cause frequent refaults in the background are identified and frozen accordingly. The frozen application will be thawed when memory condition allows. Evaluation of resource-limited mobile devices demonstrates that the user experience is effectively improved with Ice. Specifically, Ice boosts the frame rate by 1.57x on average over the state-of-the-art.

DOI: 10.1145/3552326.3567497

Diagnosing Kernel Concurrency Failures with AITIA

作者: Jeong, Dae R. and Jung, Minkyu and Lee, Yoochan and Lee, Byoungyoung and Shin, Insik and Kwon, Youngjin
关键词: debugging, concurrency bug, operating system, failure diagnosis

Abstract

Kernel concurrency failures are notoriously difficult to identify and diagnose their fundamental reason, the root cause. Kernel concurrency bugs frequently involve challenging patterns such as multi-variable races, data races with asynchronous kernel threads, and pervasive benign races. We perform an in-depth study of real-world kernel concurrency bugs and elicit three requirements: comprehensiveness, pattern-agnostic, and conciseness.To fulfill the requirements, this paper defines the root cause as a chained sequence of data races, called a causality chain. A causality chain is presented as a comprehensive form to explain how a failure eventually happens in the presence of multi-variable races rather than simply pointing out a few instructions related to the root cause. To build a causality chain, this work proposes two practical approaches: Least Interleaving First Search to reproduce a concurrency failure, and Causality Analysis to identify the root cause. Causality Analysis runs the kernel to confirm what data races contribute to the failure among all detected data races. The approach is pattern-agnostic because it dynamically tests data races without counting on pre-defined patterns. While testing data races, Causality Analysis rules out failure-irrelevant data races such as benign races, producing a concise causality chain.Aitia is a system implementing the two approaches. By evaluating Aitia with 22 real-world concurrency failures, we show that Aitia can successfully build their causality chain. With Aitia, we found the root causes of six unfixed bugs; three bugs were concurrently fixed, the root causes of three bugs were confirmed by kernel developers.

DOI: 10.1145/3552326.3567486

WAFFLE： Exposing Memory Ordering Bugs Efficiently with Active Delay Injection

作者: Stoica, Bogdan Alexandru and Lu, Shan and Musuvathi, Madanlal and Nath, Suman
关键词: reliability, debugging, delay injection, concurrency bugs, order violations, memory ordering bugs

Abstract

Concurrency bugs are difficult to detect, reproduce, and diagnose, as they manifest under rare timing conditions. Recently, active delay injection has proven efficient for exposing one such type of bug — thread-safety violations — with low overhead, high coverage, and minimal code analysis. However, how to efficiently apply active delay injection to broader classes of concurrency bugs is still an open question.We aim to answer this question by focusing on MemOrder bugs — a type of concurrency bug caused by incorrect timing between a memory access to a particular object and the object’s initialization or deallocation. We first show experimentally that the current state-of-the-art delay injection technique leads to high overhead and low detection coverage since MemOrder bugs exhibit particular characteristics that cause high delay density and interference. Based on these insights, we propose Waffle — a delay injection tool that tailors key design points to better match the nature of MemOrder bugs. Evaluating our tool on 11 popular open-source multi-threaded C# applications shows that Waffle can expose more bugs with less overhead than state-of-the-art techniques.

DOI: 10.1145/3552326.3567507

Model Checking Guided Testing for Distributed Systems

作者: Wang, Dong and Dou, Wensheng and Gao, Yu and Wu, Chenao and Wei, Jun and Huang, Tao
关键词: testing, model checking, distributed system

Abstract

Distributed systems have become the backbone of cloud computing. Incorrect system designs and implementations can greatly impair the reliability of distributed systems. Although a distributed system design modelled in the formal specification can be verified by formal model checking, it is still challenging to figure out whether its corresponding implementation conforms to the verified specification. An incorrect system implementation can violate its verified specification, and causes intricate bugs.In this paper, we propose a novel distributed system testing technique, Model checking guided testing (Mocket), to fill the gap between the specification and its implementation in a distributed system. Specially, we use the state space generated by formal model checking to guide the testing for the system implementation, and unearth bugs in the target distributed system. To evaluate the feasibility and effectiveness of Mocket, we apply Mocket on three popular distributed systems, and find 3 previously unknown bugs in them.

DOI: 10.1145/3552326.3587442

MariusGNN： Resource-Efficient Out-of-Core Training of Graph Neural Networks

作者: Waleffe, Roger and Mohoney, Jason and Rekatsinas, Theodoros and Venkataraman, Shivaram
关键词: multi-hop sampling, GNN training, GNNs

Abstract

We study training of Graph Neural Networks (GNNs) for large-scale graphs. We revisit the premise of using distributed training for billion-scale graphs and show that for graphs that fit in main memory or the SSD of a single machine, out-of-core pipelined training with a single GPU can outperform state-of-the-art (SoTA) multi-GPU solutions. We introduce MariusGNN, the first system that utilizes the entire storage hierarchy—including disk—for GNN training. MariusGNN introduces a series of data organization and algorithmic contributions that 1) minimize the end-to-end time required for training and 2) ensure that models learned with disk-based training exhibit accuracy similar to those fully trained in memory. We evaluate MariusGNN against SoTA systems for learning GNN models and find that single-GPU training in MariusGNN achieves the same level of accuracy up to 8\texttimes{

DOI: 10.1145/3552326.3567501

Accelerating Graph Mining Systems with Subgraph Morphing

作者: Jamshidi, Kasra and Xu, Harry and Vora, Keval
关键词: graph system performance, frequent subgraph mining, motifs, subgraph exploration

Abstract

Graph mining applications analyze the structural properties of large graphs. These applications are computationally expensive because finding structural patterns requires checking subgraph isomorphism, which is NP-complete. This paper exploits the sub-structural similarities across different patterns by employing Subgraph Morphing to accurately infer the results for a given set of patterns from the results of a completely different set of patterns that are less expensive to compute. To enable Subgraph Morphing in practice, we develop efficient query transformation techniques as well as automatic result conversion strategies for different application scenarios. We have implemented Subgraph Morphing in four state-of-the-art graph mining and subgraph matching systems: Peregrine, AutoMine/- GraphZero, GraphPi, and BigJoin; a thorough evaluation demonstrates that Subgraph Morphing improves the performance of these four systems by 34\texttimes{

DOI: 10.1145/3552326.3567489

TEA： A General-Purpose Temporal Graph Random Walk Engine

作者: Huan, Chengying and Song, Shuaiwen Leon and Pandey, Santosh and Liu, Hang and Liu, Yongchao and Lepers, Baptiste and He, Changhua and Chen, Kang and Jiang, Jinlei and Wu, Yongwei
关键词: temporal graph, graph algorithm, random walk

Abstract

Many real-world graphs are temporal in nature, where the temporal information indicates when a particular edge is changed (e.g., edge insertion and deletion). Performing random walks on such temporal graphs is of paramount value. The state-of-the-art sampling strategies are tailored for conventional static graphs and thus cannot effectively tackle the dynamic nature of temporal graphs due to several significant efficiency challenges, i.e., high sampling complexity, gigantic index space, and poor programmability.In this paper, we present TEA, the first highly-efficient general-purpose TEmporal grAph random walk engine. At its core, TEA introduces a new hybrid sampling approach that combines two Monte Carlo sampling methods together to drastically reduce space complexity and achieve high sampling speed. TEA further employs a series of algorithmic and system-level optimizations to remarkably improve the sampling efficiency, as well as provide streaming graph support. Finally, we introduce a temporal-centric programming model to ease the implementation of various random walk algorithms on temporal graphs. Experimental results demonstrate that TEA can achieve up to 3 orders of magnitude speedups over the state-of-the-art random walk engines on large temporal graphs.

DOI: 10.1145/3552326.3567491

ALT： Breaking the Wall between Data Layout and Loop Optimizations for Deep Learning Compilation

作者: Xu, Zhiying and Xu, Jiafan and Peng, Hongding and Wang, Wei and Wang, Xiaoliang and Wan, Haoran and Dai, Haipeng and Xu, Yixu and Cheng, Hao and Wang, Kun and Chen, Guihai
关键词: deep learning systems, code generation and synthesis, compiler techniques and optimizations

Abstract

Deep learning models rely on highly optimized tensor libraries for efficient inference on heterogeneous hardware. Current deep compilers typically predetermine layouts of tensors and then optimize loops of operators. However, such unidirectional and one-off workflow strictly separates graph-level optimization and operator-level optimization into different system layers, missing opportunities for unified tuning.This paper proposes ALT, a deep compiler that performs joint graph-level layout optimization and operator-level loop optimization. ALT provides a generic transformation module to manipulate layouts and loops with easy-to-use primitive functions. ALT further integrates an auto-tuning module that jointly optimizes graph-level data layouts and operator-level loops while guaranteeing efficiency. Experimental results show that ALT significantly outperforms state-of-the-art compilers (e.g., Ansor) in terms of both single operator performance (e.g., 1.5\texttimes{

DOI: 10.1145/3552326.3587440

REFL： Resource-Efficient Federated Learning

作者: Abdelmoniem, Ahmed M. and Sahu, Atal Narayan and Canini, Marco and Fahmy, Suhaib A.
关键词: No keywords

Abstract

Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to the heterogeneity of the data distribution, device capabilities, and participant availability as deployments scale, which can impact both model convergence and bias. Existing FL schemes use random participant selection to improve the fairness of the selection process; however, this can result in inefficient use of resources and lower quality training. In this work, we systematically address the question of resource efficiency in FL, showing the benefits of intelligent participant selection, and incorporation of updates from straggling participants. We demonstrate how these factors enable resource efficiency while also improving trained model quality.

DOI: 10.1145/3552326.3567485

Tabi： An Efficient Multi-Level Inference System for Large Language Models

作者: Wang, Yiding and Chen, Kai and Tan, Haisheng and Guo, Kun
关键词: attention-based transformer, machine learning inference

Abstract

Today’s trend of building ever larger language models (LLMs), while pushing the performance of natural language processing, adds significant latency to the inference stage. We observe that due to the diminishing returns of adding parameters to LLMs, a smaller model could make the same prediction as a costly LLM for a majority of queries. Based on this observation, we design Tabi, an inference system with a multi-level inference engine that serves queries using small models and optional LLMs for demanding applications. Tabi is optimized for discriminative models (i.e., not generative LLMs) in a serving framework. Tabi uses the calibrated confidence score to decide whether to return the accurate results of small models extremely fast or re-route them to LLMs. For re-routed queries, it uses attention-based word pruning and weighted ensemble techniques to offset the system overhead and accuracy loss. We implement and evaluate Tabi with multiple tasks and models. Our result shows that Tabi achieves 21%-40% average latency reduction (with comparable tail latency) over the state-of-the-art while meeting LLM-grade high accuracy targets.

DOI: 10.1145/3552326.3587438

Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access

作者: Jeong, Jinwoo and Baek, Seungsu and Ahn, Jeongseob
关键词: parallel-transmission, direct-host-access, DNN model serving

Abstract

As deep learning (DL) inference has been widely adopted for building user-facing applications in many domains, it is increasingly important for DL inference servers to achieve high throughput while preserving bounded latency. DL inference requests can be immediately served if the corresponding model is already in the GPU memory. Otherwise, it needs to load the model from host to GPU, adding a significant delay to inference. This paper proposes DeepPlan to minimize inference latency while provisioning DL models from host to GPU in server environments. First, we take advantage of the direct-host-access facility provided by commodity GPUs, allowing access to particular layers of models in the host memory directly from GPU without loading. Second, we parallelize model transmission across multiple GPUs to reduce the time for loading models from host to GPU. We show that a single inference can achieve a 1.94\texttimes{

DOI: 10.1145/3552326.3567508

DiLOS： Do Not Trade Compatibility for Performance in Memory Disaggregation

作者: Yoon, Wonsup and Ok, Jisu and Oh, Jinyoung and Moon, Sue and Kwon, Youngjin
关键词: unikernel, disaggregated data center, memory disaggregation

Abstract

Memory disaggregation has replaced the landscape of dat-acenters by physically separating compute and memory nodes, achieving improved utilization. As early efforts, kernel paging-based approaches offer transparent virtual memory abstraction for remote memory with paging schemes but suffer from expensive page fault handling. This paper revisits the paging-based approaches and challenges their performance in paging schemes. We posit that the overhead of the paging-based approaches is not a fundamental limitation. We propose DiLOS, a new library operating system (LibOS) specialized for paging-based memory disaggregation. We have revamped the page fault handler to get away with the swap cache and incorporated known techniques in our prefetcher, page manager, and communication module for performance optimization. Furthermore, we provide APIs to augment the LibOS with application semantics. We present two app-aware guides, app-aware prefetching and bandwidth-reducing memory allocator in DiLOS. Through extensive evaluation of microbenchmarks and applications, we demonstrate that DiLOS outperforms the state-of-the-art kernel paging-based system (Fastswap) up to 2.24\texttimes{

DOI: 10.1145/3552326.3567488

vTMM： Tiered Memory Management for Virtual Machines

作者: Sha, Sai and Li, Chuandong and Luo, Yingwei and Wang, Xiaolin and Wang, Zhenlin
关键词: PML, hot set, virtual machine, tiered memory

Abstract

The memory demand of virtual machines (VMs) is increasing, while the traditional DRAM-only memory system has limited capacity and high power consumption. The tiered memory system can effectively expand the memory capacity and increase the cost efficiency. Virtualization introduces new challenges for memory tiering, specifically enforcing performance isolation, minimizing context switching, and providing resource overcommit. However, none of the state-of-the-art designs consider virtualization and thus address these challenges; we observe that a VM with tiered memory incurs up to a 2\texttimes{

DOI: 10.1145/3552326.3587449

Making Dynamic Page Coalescing Effective on Virtualized Clouds

作者: Jia, Weiwei and Zhang, Jiyuan and Shan, Jianchen and Ding, Xiaoning
关键词: operating systems, virtualization, memory management, cloud computing

Abstract

Using huge pages has become a mainstream method to reduce address translation overhead for big memory workloads in modern computer systems. To create huge pages, system software usually uses page coalescing methods to dynamically combine contiguous base pages. Though page coalescing methods help effectively reduce address translation overhead on native systems, as the paper shows, their effectiveness is substantially undermined on virtualized platforms.The paper identifies this problem and analyzes the causes. It reveals and experimentally confirms that only huge guest pages backed by huge host pages can effectively reduce address translation overhead. Existing page coalescing methods only aim to increase huge pages at each layer, and fail to consider this cross-layer requirement on the alignmentment of huge pages.To address this issue, the paper designs Gemini as a cross-layer solution that guides the formation and allocation of huge pages in the guest and the host. With Gemini, the memory management at one layer is aware of the huge pages at the other layer, and manages carefully the memory regions corresponding to these huge pages. This is to increase the potential of forming and allocating huge pages from these regions and minimize the associated cost. Then, it guides page coalescing and huge page allocation to first consider these regions before other memory regions. Because huge pages are preferentially formed and allocated from these regions and less from other regions, huge guest pages backed by huge host pages can be increased without aggravating the adverse effects incurred by excessive huge pages.Extensive evaluation based on the prototype implementation in Linux/KVM and diverse real-world applications, such as key-value store, web server, and AI workloads, shows that Gemini can reduce TLB misses by up to 83% and improve application performance by up to 126%, compared to state-of-the-art page coalescing methods.

DOI: 10.1145/3552326.3567487

Omni-Paxos： Breaking the Barriers of Partial Connectivity

作者: Ng, Harald and Haridi, Seif and Carbone, Paris
关键词: reconfiguration, partial connectivity, state machine replication, consensus

Abstract

Omni-Paxos is a system for state machine replication that is completely resilient to partial network partitions, a major source of service disruptions in recent years. Omni-Paxos achieves its resilience through a decoupled design that separates the execution and state of leader election from log replication. The leader election builds on the concept of quorum-connected servers, with the sole focus on connectivity. Additionally, by decoupling reconfiguration from log replication, Omni-Paxos provides flexible and parallel log migration that improves the performance and robustness of reconfiguration. Our evaluation showcases two benefits over state-of-the-art protocols: (1) guaranteed recovery in at most four election timeouts under extreme partial network partitions, and (2) up to 8x shorter reconfiguration periods with 46% less I/O at the leader.

DOI: 10.1145/3552326.3587441

CFS： Scaling Metadata Service for Distributed File System via Pruned Scope of Critical Sections

作者: Wang, Yiduo and Wu, Yufei and Li, Cheng and Zheng, Pengfei and Cao, Biao and Sun, Yan and Zhou, Fei and Xu, Yinlong and Wang, Yao and Xie, Guangjun
关键词: metadata management, distributed file system

Abstract

There is a fundamental tension between metadata scalability and POSIX semantics within distributed file systems. The bottleneck lies in the coordination, mainly locking, used for ensuring strong metadata consistency, namely, atomicity and isolation. CFS is a scalable, fully POSIX-compliant distributed file system that eliminates the metadata management bottleneck via pruning the scope of critical sections for reduced locking overhead. First, CFS adopts a tiered metadata organization to scale file attributes and the remaining namespace hierarchies independently with appropriate partitioning and indexing methods, eliminating cross-shard distributed coordination. Second, it further scales up the single metadata shard performance by single-shard atomic primitives, shortening the metadata requests’ lifespan and removing spurious conflicts. Third, CFS drops the metadata proxy layer but employs the light-weight, scalable client-side metadata resolving. CFS has been running in the production environment of Baidu AI Cloud for three years. Our evaluation with a 50-node cluster and microbenchmarks shows that CFS simultaneously improves the throughput of baselines like HopsFS and InfiniFS by 1.76–75.82\texttimes{

DOI: 10.1145/3552326.3587443

OLPart： Online Learning based Resource Partitioning for Colocating Multiple Latency-Critical Jobs on Commodity Computers

作者: Chen, Ruobing and Shi, Haosen and Li, Yusen and Liu, Xiaoguang and Wang, Gang
关键词: online learning, resource partitioning, performance interference, job colocating

Abstract

Colocating multiple jobs on the same server has been a commonly used approach for improving resource utilization in cloud environments. However, performance interference due to the contention over shared resources makes resource partitioning an important research problem. Partitioning multiple resources coordinately is particularly challenging when multiple latency-critical (LC) jobs are colocated with best-effort (BE) jobs, since the QoS needs to be protected for all the LC jobs. So far, this problem is not well-addressed in the literatures.We propose an online learning based solution, named OL-Part, for partitioning resources among multiple colocated LC jobs and BE jobs. OLPart is designed based on our observation that runtime performance counters can approximately indicate resource sensitivities of jobs. Based on this finding, OLPart leverages contextual multi-armed bandit (CMAB) to design the partitioning solution, which employs the performance counters to enable an intelligent exploration of the search space. Applying CMAB to the resource partitioning problem faces several critical challenges. OLPart proposes several techniques to overcome these challenges. OLPart does not require prior knowledge of jobs and incurs very small overhead. Evaluations demonstrate that OLPart is optimally efficient and robust, which outperforms state-of-the-art solutions with significant margins. OLPart is publicly available at https://github.com/crbnk/OpenOLPart.

DOI: 10.1145/3552326.3567490

Palette Load Balancing： Locality Hints for Serverless Functions

作者: Abdi, Mania and Ginzburg, Samuel and Lin, Xiayue Charles and Faleiro, Jose and Chaudhry, Gohar Irfan and Goiri, Inigo and Bianchini, Ricardo and Berger, Daniel S and Fonseca, Rodrigo
关键词: data-parallel processing, caching, serverless computing, cloud computing

Abstract

Function-as-a-Service (FaaS) serverless computing enables a simple programming model with almost unbounded elasticity. Unfortunately, current FaaS platforms achieve this flexibility at the cost of lower performance for data-intensive applications compared to a serverful deployment. The ability to have computation close to data is a key missing feature. We introduce Palette load balancing, which offers FaaS applications a simple mechanism to express locality to the platform, through hints we term “colors”. Palette maintains the serverless nature of the service - users are still not allocating resources - while allowing the platform to place successive invocations related to each other on the same executing node. We compare a prototype of the Palette load balancer to a state-of-the-art locality-oblivious load balancer on representative examples of three applications. For a serverless web application with a local cache, Palette improves the hit ratio by 6x. For a serverless version of Dask, Palette improves run times by 46% and 40% on Task Bench and TPC-H, respectively. On a serverless version of NumS, Palette improves run times by 37%. These improvements largely bridge the gap to serverful implementation of the same systems.

DOI: 10.1145/3552326.3567496

With Great Freedom Comes Great Opportunity： Rethinking Resource Allocation for Serverless Functions

作者: Bilal, Muhammad and Canini, Marco and Fonseca, Rodrigo and Rodrigues, Rodrigo
关键词: resource allocation, optimization, serverless

Abstract

Current serverless offerings give users limited flexibility for configuring the resources allocated to their function invocations. This simplifies the interface for users to deploy server-less computations but creates deployments that are resource inefficient. In this paper, we take a principled approach to the problem of resource allocation for serverless functions, analyzing the effects of automating this choice in a way that leads to the best combination of performance and cost. In particular, we systematically explore the opportunities that come with decoupling memory and CPU resource allocations and also enabling the use of different VM types, and we find a rich trade-off space between performance and cost. The provider can use this in a number of ways, e.g., exposing all these parameters to the user; eliding preferences for performance and cost from users and simply offer the same performance with lower cost; or exposing a small number of choices for users to trade performance for cost.Our results show that, by decoupling memory and CPU allocation, there is the potential to have up to 40% lower execution cost than the preset coupled configurations that are the norm in current serverless offerings. Similarly, making the correct choice of VM instance type can provide up to 50% better execution time. Furthermore, we demonstrate that providers have the flexibility to choose different instance types for the same functions to maximize resource utilization while providing performance within 10–20% of the best resource configuration for each respective function.

DOI: 10.1145/3552326.3567506

Groundhog： Efficient Request Isolation in FaaS

作者: Alzayat, Mohamed and Mace, Jonathan and Druschel, Peter and Garg, Deepak
关键词: No keywords

Abstract

Security is a core responsibility for Function-as-a-Service (FaaS) providers. The prevailing approach isolates concurrent executions of functions in separate containers. However, successive invocations of the same function commonly reuse the runtime state of a previous invocation in order to avoid container cold-start delays. Although efficient, this container reuse has security implications for functions that are invoked on behalf of differently privileged users or administrative domains: bugs in a function’s implementation — or a third-party library/runtime it depends on — may leak private data from one invocation of the function to a subsequent one.Groundhog isolates sequential invocations of a function by efficiently reverting to a clean state, free from any private data, after each invocation. The system exploits two properties of typical FaaS platforms: each container executes at most one function at a time and legitimate functions do not retain state across invocations. This enables Groundhog to efficiently snapshot and restore function state between invocations in a manner that is independent of the programming language/runtime and does not require any changes to existing functions, libraries, language runtimes, or OS kernels. We describe the design and implementation of Groundhog and its integration with OpenWhisk, a popular production-grade open-source FaaS framework. On three existing benchmark suites, Groundhog isolates sequential invocations with modest overhead on end-to-end latency (median: 1.5%, 95p: 7%) and throughput (median: 2.5%, 95p: 49.6%), relative to an insecure baseline that reuses the container and runtime state.

DOI: 10.1145/3552326.3567503

Understanding and Optimizing Workloads for Unified Resource Management in Large Cloud Platforms

作者: Lu, Chengzhi and Xu, Huanle and Ye, Kejiang and Xu, Guoyao and Zhang, Liping and Yang, Guodong and Xu, Chengzhong
关键词: resource over-commitment, unified scheduling, cloud computing

Abstract

To fully utilize computing resources, cloud providers such as Google and Alibaba choose to co-locate online services with batch processing applications in their data centers. By implementing unified resource management policies, different types of complex computing jobs request resources in a consistent way, which can help data centers achieve global optimal scheduling and provide computing power with higher quality. To understand this new scheduling paradigm, in this paper, we first present an in-depth study of Alibaba’s unified scheduling workloads. Our study focuses on the characterization of resource utilization, the application running performance, and scheduling scalability. We observe that although computing resources are significantly over-committed under unified scheduling, the resource utilization in Alibaba data centers is still low. In addition, existing resource usage predictors tend to make severe overestimations. At the same time, tasks within the same application behave fairly consistently, and the running performance of tasks can be well-profiled with respect to resource contention on the corresponding physical host.Based on these observations, in this paper, we design Optum, a unified data center scheduler for improving the overall resource utilization while ensuring good performance for each application. Optum formulates an optimization problem to schedule unified task requests, aiming to balance the trade-off between utilization and resource contention. Optum also implements efficient heuristics to solve the optimization problem in a scalable manner. Large-scale experiments demonstrate that Optum can save up to 15% of resources without performance degradation compared to state-of-the-art unified scheduling schemes.

DOI: 10.1145/3552326.3587437

Fail through the Cracks： Cross-System Interaction Failures in Modern Cloud Systems

作者: Tang, Lilia and Bhandari, Chaitanya and Zhang, Yongle and Karanika, Anna and Ji, Shuyang and Gupta, Indranil and Xu, Tianyin
关键词: cloud system, root cause analysis, failure study, cross-system interaction

Abstract

Modern cloud systems are orchestrations of independent and interacting (sub-)systems, each specializing in important services (e.g., data processing, storage, resource management, etc.). Hence, cloud system reliability is affected not only by the reliability of each individual system, but also by the interplay between these systems. We observe that many recent production incidents of cloud systems are manifested through interactions across the system boundaries. However, there is a lack of systematic understanding of this emerging mode of failures, which we term as cross-system interaction failures (or CSI failures). This hinders the development of better design, integration practices, and new tooling.In this paper, we discuss cross-system interaction failures based on analyses of (1) 11 CSI-failure-induced cloud incidents of Google, Azure, and AWS, and (2) 120 CSI failure cases of seven widely co-deployed open-source systems. We focus on understanding discrepancies between interacting systems as the root causes of CSI failures—CSI failures cannot be understood by analyzing one single system in isolation. This paper draws attention to this emerging failure mode, provides a comprehensive understanding of CSI failure patterns, and discusses potential approaches for mitigation. We advocate for cross-system testing and verification and demonstrate its potential by cross-testing the Spark-Hive data plane and exposing 15 new discrepancies.

DOI: 10.1145/3552326.3587448

LogGrep： Fast and Cheap Cloud Log Storage by Exploiting both Static and Runtime Patterns

作者: Wei, Junyu and Zhang, Guangyan and Chen, Junchao and Wang, Yang and Zheng, Weimin and Sun, Tingtao and Wu, Jiesheng and Jiang, Jiangwei
关键词: static pattern, runtime pattern, full-text query, data compression, cloud log

Abstract

In cloud systems, near-line logs are mainly used for debugging, which means they prefer a low query latency for a better user experience, and like any other logs, they also prefer a low overall cost including storage cost to store compressed logs and computation cost to compress logs and execute queries.This paper proposes LogGrep, the first log compression and query tool that structurizes and organizes log data properly in fine-grained units by exploiting both static and runtime patterns. It first parses logs into variable vectors by exploiting static patterns and then extracts runtime pattern(s) automatically within each variable vector with a novel extraction method. Based on these runtime patterns, LogGrep further decomposes the variable vectors into fine-grained units called “Capsules” and stamps each Capsule with a summary of its values. During the query process, LogGrep can avoid decompressing and scanning Capsules that cannot possibly match the keywords, with the help of the extracted runtime patterns and the Capsule stamps.We evaluate LogGrep on 21 types of logs from the production environment of Alibaba Cloud, and 16 types of logs from the public datasets. The results show that LogGrep can reduce query latency and overall cost by an order of magnitude compared to state-of-the-art works. Such results have confirmed that exploiting both static and runtime patterns to structurize logs can achieve fast and cheap cloud log storage.

DOI: 10.1145/3552326.3567484

Aggregate VM： Why Reduce or Evict VM’s Resources When You Can Borrow Them From Other Nodes?

作者: Chuang, Ho-Ren and Manaouil, Karim and Xing, Tong and Barbalace, Antonio and Olivier, Pierre and Heerekar, Balvansh and Ravindran, Binoy
关键词: delegation, migration, DSM, distributed hypervisor, data center, resource fragmentation

Abstract

Hardware resource fragmentation is a common issue in data centers. Traditional solutions based on migration or overcommitment are unacceptably slow, and modern commercial or research solutions like Spot VM may reduce or evict VM’s resources anytime. We propose an alternative solution that does not suffer from these drawbacks, the Aggregate VM. We introduce a new distributed hypervisor design, the resource-borrowing hypervisor, which creates Aggregate VMs: distributed VMs that temporarily aggregate fragmented resources belonging to different host machines, which require mobility of virtual CPUs, memory and IO devices. We implement a prototype, FragVisor, which runs guest software transparently. We also propose minimal modifications to the guest OS that can enable significant performance gains. We evaluate FragVisor over a set of microbenchmarks and IaaS-style real applications. Although Aggregate VMs are not a perfect fit for every type of applications, some workloads enjoy significant speedups compared to overcommitted scenarios (up to 3.9x with 4 distributed vCPUs). We further demonstrate that FragVisor is faster than a state-of-the-art competitor, GiantVM (up to 2.5x).

DOI: 10.1145/3552326.3587452

R2C： AOCR-Resilient Diversity with Reactive and Reflective Camouflage

作者: Berlakovich, Felix and Brunthaler, Stefan
关键词: language-based security, software diversity, booby traps, booby-trapped pointers, reactive defenses, code-reuse attacks, address-oblivious code reuse, position-independent code reuse, randomization-based defenses

Abstract

Address-oblivious code reuse, AOCR for short, poses a substantial security risk, as it remains unchallenged. If neglected, adversaries have a reliable way to attack systems, offering an operational and profitable strategy. AOCR’s authors conclude that software diversity cannot mitigate AOCR, because it exposes fundamental limits to diversification.Reactive and reflective camouflage, or R2C for short, is a full-fledged, LLVM-based defense that thwarts AOCR by combining code and data diversification with reactive capabilities through booby traps. R2C includes optimizations using AVX2 SIMD instructions, compiles complex real-world software, such as browsers, and offers full support of C++. R2C thus proves that AOCR poses no fundamental limits to software diversification, but merely indicates that code diversification without data diversification is a dead end.An extensive evaluation along multiple dimensions proves the practicality of R2C. We evaluate the impact of our defense on performance, and find that R2C shows low performance impacts on compute-intensive benchmarks (6.6 – 8.5% geometric mean on SPEC CPU 2017). A security evaluation indicates R2C’s resistance against different types of code-reuse attacks.

DOI: 10.1145/3552326.3587439