EuroSys 2021

SmartHarvest： harvesting idle CPUs safely and efficiently in the cloud

作者: Wang, Yawen and Arya, Kapil and Kogias, Marios and Vanga, Manohar and Bhandari, Aditya and Yadwadkar, Neeraja J. and Sen, Siddhartha and Elnikety, Sameh and Kozyrakis, Christos and Bianchini, Ricardo
关键词: No keywords

Abstract

We can increase the efficiency of public cloud datacenters by harvesting allocated but temporarily idling CPU cores from customer virtual machines (VMs) to run batch or analytics workloads. Even small efficiency gains translate into substantial savings, since provisioning and operating a datacenter costs hundreds of millions of dollars per year. The main challenge is to harvest idle cores with little or no impact on customer VMs, which could be running latency-sensitive services and are essentially black-boxes to the cloud provider.We introduce ElasticVM, a new VM type that can run batch workloads cheaply using mainly harvested cores. We also propose SmartHarvest, a system that dynamically manages the number of cores available to ElasticVMs in each fine-grained time window. SmartHarvest uses online learning to predict the core demand of primary, customer VMs and compute the number of cores that can be safely harvested. Our results show that SmartHarvest can harvest a significant amount of CPU resources without increasing the 99th-percentile tail latency of latency-critical primary workloads by more than 10%. Unlike static harvesting techniques that rely on offline profiling, SmartHarvest is robust to different primary workloads, batch workloads, and load changes. Finally, we show that the online learning in SmartHarvest is complementary to systems optimizations for VM management.

DOI: 10.1145/3447786.3456225

Tripoline： generalized incremental graph processing via graph triangle inequality

作者: Jiang, Xiaolin and Xu, Chengshuo and Yin, Xizhe and Zhao, Zhijia and Gupta, Rajiv
关键词: No keywords

Abstract

For compute-intensive iterative queries over a streaming graph, it is critical to evaluate the queries continuously and incrementally for best efficiency. However, the existing incremental graph processing requires a priori knowledge of the query (e.g., the source vertex of a vertex-specific query); otherwise, it has to fall back to the expensive full evaluation that starts from scratch.To alleviate this restriction, this work presents a principled solution to generalizing the incremental graph processing, such that queries, without their a priori knowledge, can also be evaluated incrementally. The solution centers around the concept of graph triangle inequalities, an idea inspired by the classical triangle inequality principle in the Euclidean space. Interestingly, similar principles can also be derived for many vertex-specific graph problems. These principles can help establish rigorous constraints between the evaluation of one graph query and the results of another, thus enabling reusing the latter to accelerate the former. Based on this finding, a novel streaming graph system, called Tripoline, is built which enables incremental evaluation of queries without their a priori knowledge. Built on top of a state-of-the-art shared-memory streaming graph engine (Aspen), Tripoline natively supports high-throughput low-cost graph updates. A systematic evaluation with a set of eight vertex-specific graph problems and four real-world large graphs confirms both the effectiveness of the proposed techniques and the efficiency of Tripoline.

DOI: 10.1145/3447786.3456226

Towards timeout-less transport in commodity datacenter networks

作者: Lim, Hwijoon and Bai, Wei and Zhu, Yibo and Jung, Youngmok and Han, Dongsu
关键词: RoCE, TCP, datacenter networking, low-latency transport

Abstract

Despite recent advances in datacenter networks, timeouts caused by congestion packet losses still remain a major cause of high tail latency. Priority-based Flow Control (PFC) was introduced to make the network lossless, but its Head-of-Line blocking nature causes various performance and management problems. In this paper, we ask if it is possible to design a network that achieves (near) zero timeout only using commodity hardware in datacenters.Our answer is TLT, an extension to existing transport designed to eliminate timeouts. We are inspired by the observation that only certain types of packet drops cause timeouts. Therefore, instead of blindly dropping (TCP) or not dropping packets at all (RoCEv2), TLT proactively drops some packets to ensure the delivery of more important ones, whose losses may cause timeouts. It classifies packets at the host and leverages color-aware thresholding, a feature widely supported by commodity switches, to proactively drop some less important packets. We implement TLT prototypes using VMA to test with real applications. Our testbed evaluation on Redis shows that TLT reduces 99%-ile FCT up to 91.7% on handling bursts of SET operations. In large-scale simulations, TLT augments diverse datacenter transports, from widely-used (TCP, DCTCP, DCQCN) to state-of-the-art (IRN and HPCC), by achieving up to 81% lower tail latency.

DOI: 10.1145/3447786.3456227

PaSh： light-touch data-parallel shell processing

作者: Vasilakis, Nikos and Kallas, Konstantinos and Mamouras, Konstantinos and Benetopoulos, Achilles and Cvetkovi'{c
关键词: POSIX, Unix, automatic parallelization, pipelines, shell, source-to-source compiler

Abstract

This paper presents PaSh, a system for parallelizing POSIX shell scripts. Given a script, PaSh converts it to a dataflow graph, performs a series of semantics-preserving program transformations that expose parallelism, and then converts the dataflow graph back into a script—one that adds POSIX constructs to explicitly guide parallelism coupled with PaSh-provided Unix-aware runtime primitives for addressing performance- and correctness-related issues. A lightweight annotation language allows command developers to express key parallelizability properties about their commands. An accompanying parallelizability study of POSIX and GNU commands—two large and commonly used groups—guides the annotation language and optimized aggregator library that PaSh uses. PaSh’s extensive evaluation over 44 unmodified Unix scripts shows significant speedups (0.89–61.1\texttimes{

DOI: 10.1145/3447786.3456228

FlexGraph： a flexible and efficient distributed framework for GNN training

作者: Wang, Lei and Yin, Qiang and Tian, Chao and Yang, Jianbang and Chen, Rong and Yu, Wenyuan and Yao, Zihang and Zhou, Jingren
关键词: No keywords

Abstract

Graph neural networks (GNNs) aim to learn a low-dimensional feature for each vertex in the graph from its input high-dimensional feature, by aggregating the features of the vertex’s neighbors iteratively. This paper presents Flex-Graph, a distributed framework for training GNN models. FlexGraph is able to efficiently train GNN models with flexible definitions of neighborhood and hierarchical aggregation schemes, which are the two main characteristics associated with GNNs. In contrast, existing GNN frameworks are usually designed for GNNs having fixed definitions and aggregation schemes. They cannot support different kinds of GNN models well simultaneously. Underlying FlexGraph are a simple GNN programming abstraction called NAU and a compact data structure for modeling various aggregation operations. To achieve better performance, FlexGraph is equipped with a hybrid execution strategy to select proper and efficient operations according to different contexts during aggregating neighborhood features, an application-driven workload balancing strategy to balance GNN training workload and reduce synchronization overhead, and a pipeline processing strategy to overlap computations and communications. Using real-life datasets and GNN models GCN, PinSage and MAGNN, we verify that NAU makes FlexGraph more expressive than prior frameworks (e.g., DGL and Euler) which adopt GAS-like programming abstractions, e.g., it can handle MAGNN that is beyond the reach of DGL and Euler. The evaluation further shows that FlexGraph outperforms the state-of-the-art GNN frameworks such as DGL and Euler in training time by on average 8.5\texttimes{

DOI: 10.1145/3447786.3456229

DZiG： sparsity-aware incremental processing of streaming graphs

作者: Mariappan, Mugilan and Che, Joanna and Vora, Keval
关键词: No keywords

Abstract

State-of-the-art streaming graph processing systems that provide Bulk Synchronous Parallel (BSP) guarantees remain oblivious to the computation sparsity present in iterative graph algorithms, which severely limits their performance. In this paper we propose DZiG, a high-performance streaming graph processing system that retains efficiency in presence of sparse computations while still guaranteeing BSP semantics. At the heart of DZiG is: (1) a sparsity-aware incremental processing technique that expresses computations in a recursive manner to be able to safely identify and prune updates (hence retaining sparsity); (2) a simple change-driven programming model that naturally exposes sparsity in iterative computations; and, (3) an adaptive processing model that automatically changes the incremental computation strategy to limit its overheads when computations become very sparse. DZiG outperforms state-of-the-art streaming graph processing systems, and pushes the boundary of dependency-driven processing for streaming graphs to over 10 million simultaneous mutations, which is orders of magnitude higher compared to the state-of-the-art systems.

DOI: 10.1145/3447786.3456230

Ethanos： efficient bootstrapping for full nodes on account-based blockchain

作者: Kim, Jae-Yun and Lee, Junmo and Koo, Yeonjae and Park, Sanghyeon and Moon, Soo-Mook
关键词: blockchain, state trie, synchronization

Abstract

Ethereum is a popular account-based blockchain whose number of accounts and transactions has skyrocketed, causing its data explosion. As a result, ordinary clients using PCs or smartphones cannot easily bootstrap as a full node, but rely on other full nodes to verify transactions, thus being exposed to security risks. The most serious overhead is caused by synchronizing the state of all accounts in the block’s state trie, which takes several tens of gigabytes. Observing that more than 95% of the accounts are dormant, we propose a novel state optimization technique, named Ethanos. Ethanos downsizes the state trie by periodically emptying it, and then re-build it only with the active accounts used in the period’s transactions. Ethanos runs transactions using the accounts available in the current period’s state trie as well as those available at the end of the previous period’s state trie. For an account in neither of the tries, the account first restores itself by transmitting a restore transaction. One important result of this state management is that a node can now bootstrap only with the latest period’s state trie, yet can fully verify all transactions thereafter. We evaluated Ethanos with real Ethereum transactions for 300,000 blocks from the 7.0 million block, with a one-week period of emptying the state trie. Our result shows that Ethanos can sharply reduce the state trie, with only a tiny fraction of the restore transactions. More importantly, unlike the Ethereum state trie which continues to grow as time goes on, the Ethanos state trie size at the end of each period is bounded by a few hundred MB, when there are more than one million, one-week-active accounts.

DOI: 10.1145/3447786.3456231

Virtual machine preserving host updates for zero day patching in public cloud

作者: Russinovich, Mark and Govindaraju, Naga and Raghuraman, Melur and Hepkin, David and Schwartz, Jamie and Kishan, Arun
关键词: cloud computing, data centers, virtualization and security

Abstract

Host software updates are critical to ensure the security, reliability and compliance of public clouds. Many updates require a virtualization component restart or operating system reboot. Virtual machines (VMs) running on the updated servers must either be restarted or live migrated off. Reboots can result in downtime for the VMs on the order of ten minutes, and has further impact on the workloads running in the VMs because cached state is lost. Live migration (LM) is a technology that can avoid the need to shutdown VMs. However, LM requires turn space in the form of already-patched hosts, consumes network, CPU and other resources that scale with the amount of and level of activity of VM, and has variable impact on VM performance and availability, making it too expensive and disruptive for zero-day security updates that must be applied across an entire fleet on the order of hours. We present a novel update technology, virtual machine preserving host updates (VM-PHU), that does not require turn space, consumes no network and little CPU, preserves VM state, and causes minimal VM blackout time that does not scale with VM resource usage. VM-PHU persists the memory and device state of all running guest VMs, reboots the host and virtualization components into updated code, restores the state of the VMs, and then resumes them. VM-PHU makes use of several techniques to minimize VM blackout time. One is to use kernel soft reboot (KSR) to directly transition to an updated host operating system, bypassing firmware reset of the server and attached devices. To minimize resource consumption and VM disruption, VM-PHU leaves VM memory in physical memory pages and other state in persisted pages across the soft reboot, and VM-PHU implements a mechanism called fast close to enable a reboot to proceed without waiting for the completion of in-flight VM I/Os to remote storage devices. We have implemented VM-PHU in Microsoft Azure hosting millions of servers and show results of several zero-day updates that demonstrate VM blackout times on the order of seconds. VM-PHU provides significant benefits to both customers and public cloud vendors by minimizing application downtime while enabling fast and resource efficient updates, including zero-day patches.

DOI: 10.1145/3447786.3456232

DGCL： an efficient communication library for distributed GNN training

作者: Cai, Zhenkun and Yan, Xiao and Wu, Yidi and Ma, Kaihao and Cheng, James and Yu, Fan
关键词: distributed and parallel training, graph neural networks, network communication

Abstract

Graph neural networks (GNNs) have gained increasing popularity in many areas such as e-commerce, social networks and bio-informatics. Distributed GNN training is essential for handling large graphs and reducing the execution time. However, for distributed GNN training, a peer-to-peer communication strategy suffers from high communication overheads. Also, different GPUs require different remote vertex embeddings, which leads to an irregular communication pattern and renders existing communication planning solutions unsuitable. We propose the distributed graph communication library (DGCL) for efficient GNN training on multiple GPUs. At the heart of DGCL is a communication planning algorithm tailored for GNN training, which jointly considers fully utilizing fast links, fusing communication, avoiding contention and balancing loads on different links. DGCL can be easily adopted to extend existing single-GPU GNN systems to distributed training. We conducted extensive experiments on different datasets and network configurations to compare DGCL with alternative communication schemes. In our experiments, DGCL reduces the communication time of the peer-to-peer communication by 77.5% on average and the training time for an epoch by up to 47%.

DOI: 10.1145/3447786.3456233

Zeus： locality-aware distributed transactions

作者: Katsarakis, Antonios and Ma, Yijun and Tan, Zhaowei and Bainbridge, Andrew and Balkwill, Matthew and Dragojevic, Aleksandar and Grot, Boris and Radunovic, Bozidar and Zhang, Yongguang
关键词: availability, dynamic sharding, locality, pipelining, replication, strict serializability, transactions

Abstract

State-of-the-art distributed in-memory datastores (FaRM, FaSST, DrTM) provide strongly-consistent distributed transactions with high performance and availability. Transactions in those systems are fully general; they can atomically manipulate any set of objects in the store, regardless of their location. To achieve this, these systems use complex distributed transactional protocols. Meanwhile, many workloads have a high degree of locality. For such workloads, distributed transactions are an overkill as most operations only access objects located on the same server - if sharded appropriately.In this paper, we show that for these workloads, a single-node transactional protocol combined with dynamic object re-sharding and asynchronously pipelined replication can provide the same level of generality with better performance, simpler protocols, and lower developer effort. We present Zeus, an in-memory distributed datastore that provides general transactions by acquiring all objects involved in the transaction to the same server and executing a single-node transaction on them. Zeus is fault-tolerant and strongly-consistent. At the heart of Zeus is a reliable dynamic object sharding protocol that can move 250K objects per second per server, allowing Zeus to process millions of transactions per second and outperform more traditional distributed transactions on a wide range of workloads that exhibit locality.

DOI: 10.1145/3447786.3456234

Mitigating vulnerability windows with hypervisor transplant

作者: Ngoc, Tu Dinh and Teabe, Boris and Tchana, Alain and Muller, Gilles and Hagimont, Daniel
关键词: No keywords

Abstract

The vulnerability window of a hypervisor regarding a given security flaw is the time between the identification of the flaw and the integration of a correction/patch in the running hypervisor. Most vulnerability windows, regardless of severity, are long enough (several days) that attackers have time to perform exploits. Nevertheless, the number of critical vulnerabilities per year is low enough to allow an exceptional solution. This paper introduces hypervisor transplant, a solution for addressing vulnerability window of critical flaws. It involves temporarily replacing the current datacenter hypervisor (e.g., Xen) which is subject to a critical security flaw, by a different hypervisor (e.g., KVM) which is not subject to the same vulnerability.We build HyperTP, a generic framework which combines in a unified way two approaches: in-place server micro-reboot-based hypervisor transplant (noted InPlaceTP) and live VM migration-based hypervisor transplant (noted MigrationTP). We describe the implementation of HyperTP and its extension for transplanting Xen with KVM and vice versa. We also show that HyperTP is easy to integrate with the OpenStack cloud computing platform. Our evaluation results show that HyperTP delivers satisfactory performance: (1) MigrationTP takes the same time and impacts virtual machines (VMs) with the same performance degradation as normal live migration. (2) the downtime imposed by InPlaceTP on VMs is in the same order of magnitude (1.7 seconds for a VM with 1 vCPU and 1 GB of RAM) as in-place upgrade of homogeneous hypervisors based on server micro-reboot.

DOI: 10.1145/3447786.3456235

Efficient replication via timestamp stability

作者: Enes, Vitor and Baquero, Carlos and Gotsman, Alexey and Sutra, Pierre
关键词: consensus, fault tolerance, geo-replication

Abstract

Modern web applications replicate their data across the globe and require strong consistency guarantees for their most critical data. These guarantees are usually provided via state-machine replication (SMR). Recent advances in SMR have focused on leaderless protocols, which improve the availability and performance of traditional Paxos-based solutions. We propose Tempo - a leaderless SMR protocol that, in comparison to prior solutions, achieves superior throughput and offers predictable performance even in contended workloads. To achieve these benefits, Tempo timestamps each application command and executes it only after the timestamp becomes stable, i.e., all commands with a lower timestamp are known. Both the timestamping and stability detection mechanisms are fully decentralized, thus obviating the need for a leader replica. Our protocol furthermore generalizes to partial replication settings, enabling scalability in highly parallel workloads. We evaluate the protocol in both real and simulated geo-distributed environments and demonstrate that it outperforms state-of-the-art alternatives.

DOI: 10.1145/3447786.3456236

ChameleonDB： a key-value store for optane persistent memory

作者: Zhang, Wenhui and Zhao, Xingsheng and Jiang, Song and Jiang, Hong
关键词: Optane DC, key-value store, persistent-memory

Abstract

The emergence of Intel’s Optane DC persistent memory (Optane Pmem) draws much interest in building persistent key-value (KV) stores to take advantage of its high throughput and low latency. A major challenge in the efforts stems from the fact that Optane Pmem is essentially a hybrid storage device with two distinct properties. On one hand, it is a high-speed byte-addressable device similar to DRAM. On the other hand, the write to the Optane media is conducted at the unit of 256 bytes, much like a block storage device. Existing KV store designs for persistent memory do not take into account of the latter property, leading to high write amplification and constraining both write and read throughput. In the meantime, a direct re-use of a KV store design intended for block devices, such as LSM-based ones, would cause much higher read latency due to the former property.In this paper, we propose ChameleonDB, a KV store design specifically for this important hybrid memory/storage device by considering and exploiting these two properties in one design. It uses LSM tree structure to efficiently admit writes with low write amplification. It uses an in-DRAM hash table to bypass LSM-tree’s multiple levels for fast reads. In the meantime, ChameleonDB may choose to opportunistically maintain the LSM multi-level structure in the background to achieve short recovery time after a system crash. ChameleonDB’s hybrid structure is designed to be able to absorb sudden bursts of a write workload, which helps avoid long-tail read latency.Our experiment results show that ChameleonDB improves write throughput by 3.3\texttimes{

DOI: 10.1145/3447786.3456237

Achieving low tail-latency and high scalability for serializable transactions in edge computing

作者: Chen, Xusheng and Song, Haoze and Jiang, Jianyu and Ruan, Chaoyi and Li, Cheng and Wang, Sen and Zhang, Gong and Cheng, Reynold and Cui, Heming
关键词: distributed transaction, edge computing, scalability, tail-latency

Abstract

A distributed database utilizing the wide-spread edge computing servers to provide low-latency data access with the serializability guarantee is highly desirable for emerging edge computing applications. In an edge database, nodes are divided into regions, and a transaction can be categorized as intra-region (IRT) or cross-region (CRT) based on whether it accesses data in different regions. In addition to serializability, we insist that a practical edge database should provide low tail latency for both IRTs and CRTs, and such low latency must be scalable to a large number of regions. Unfortunately, none of existing geo-replicated serializable databases or edge databases can meet such requirements.In this paper, we present Dast (Decentralized Anticipate and STretch), the first edge database that can meet the stringent performance requirements with serializability. Our key idea is to order transactions by anticipating when they are ready to execute: Dast binds an IRT to the latest timestamp and binds a CRT to a future timestamp to avoid the coordination of CRTs blocking IRTs. Dast also carries a new stretchable clock abstraction to tolerate inaccurate anticipations and to handle cross-region data reads. Our evaluation shows that, compared to three relevant serializable databases, Dast’s 99-percentile latency was 87.9%~93.2% lower for IRTs and 27.7%~70.4% lower for CRTs; Dast’s low latency is scalable to a large number of regions.

DOI: 10.1145/3447786.3456238

OFC： an opportunistic caching system for FaaS platforms

作者: Mvondo, Djob and Bacou, Mathieu and Nguetchouang, Kevin and Ngale, Lucien and Pouget, St'{e
关键词: cache, cloud computing, functions as a service (FaaS), latency, serverless

Abstract

Cloud applications based on the “Functions as a Service” (FaaS) paradigm have become very popular. Yet, due to their stateless nature, they must frequently interact with an external data store, which limits their performance. To mitigate this issue, we introduce OFC, a transparent, vertically and horizontally elastic in-memory caching system for FaaS platforms, distributed over the worker nodes. OFC provides these benefits cost-effectively by exploiting two common sources of resource waste: (i) most cloud tenants overprovision the memory resources reserved for their functions because their footprint is non-trivially input-dependent and (ii) FaaS providers keep function sandboxes alive for several minutes to avoid cold starts. Using machine learning models adjusted for typical function input data categories (e.g., multimedia formats), OFC estimates the actual memory resources required by each function invocation and hoards the remaining capacity to feed the cache. We build our OFC prototype based on enhancements to the OpenWhisk FaaS platform, the Swift persistent object store, and the RAM-Cloud in-memory store. Using a diverse set of workloads, we show that OFC improves by up to 82 % and 60 % respectively the execution time of single-stage and pipelined functions.

DOI: 10.1145/3447786.3456239

Odyssey： the impact of modern hardware on strongly-consistent replication protocols

作者: Gavrielatos, Vasilis and Katsarakis, Antonios and Nagarajan, Vijay
关键词: RDMA, availability, consistency, fault-tolerant, latency, linearizability, replication, throughput

Abstract

Get/Put Key-Value Stores (KVSes) rely on replication protocols to enforce consistency and guarantee availability. Today’s modern hardware, with manycore servers and RDMA-capable networks, challenges the conventional wisdom on protocol design. In this paper, we investigate the impact of modern hardware on the performance of strongly-consistent replication protocols.First, we create an informal taxonomy of replication protocols, based on which we carefully select 10 protocols for analysis. Secondly, we present Odyssey, a framework tailored towards protocol implementation for multi-threaded, RDMA-enabled, in-memory, replicated KVSes. We implement all 10 protocols over Odyssey, and perform the first apples-to-apples comparison of replication protocols over modern hardware.Our comparison characterizes the protocol design space, revealing the performance capabilities of different classes of protocols on modern hardware. Among other things, our results demonstrate that some of the protocols that were efficient in yesterday’s hardware are not so today because they cannot take advantage of the abundant parallelism and fast networking present in modern hardware. Conversely, some protocols that were inefficient in yesterday’s hardware are very attractive today. We distill our findings in a concise set of general guidelines and recommendations for protocol selection and design in the era of modern hardware.

DOI: 10.1145/3447786.3456240

Parallelizing packet processing in container overlay networks

作者: Lei, Jiaxin and Munikar, Manish and Suo, Kun and Lu, Hui and Rao, Jia
关键词: No keywords

Abstract

Container networking, which provides connectivity among containers on multiple hosts, is crucial to building and scaling container-based microservices. While overlay networks are widely adopted in production systems, they cause significant performance degradation in both throughput and latency compared to physical networks. This paper seeks to understand the bottlenecks of in-kernel networking when running container overlay networks. Through profiling and code analysis, we find that a prolonged data path, due to packet transformation in overlay networks, is the culprit of performance loss. Furthermore, existing scaling techniques in the Linux network stack are ineffective for parallelizing the prolonged data path of a single network flow.We propose Falcon, a fast and balanced container networking approach to scale the packet processing pipeline in overlay networks. Falcon pipelines software interrupts associated with different network devices of a single flow on multiple cores, thereby preventing execution serialization of excessive software interrupts from overloading a single core. Falcon further supports multiple network flows by effectively multiplexing and balancing software interrupts of different flows among available cores. We have developed a prototype of Falcon in Linux. Our evaluation with both micro-benchmarks and real-world applications demonstrates the effectiveness of Falcon, with significantly improved performance (by 300% for web serving) and reduced tail latency (by 53% for data caching).

DOI: 10.1145/3447786.3456241

Memory-mapped I/O on steroids

作者: Papagiannis, Anastasios and Marazakis, Manolis and Bilas, Angelos
关键词: I/O caching, Linux mmap, fast storage devices, key-value stores, memory-mapped I/O

Abstract

With current technology trends for fast storage devices, the host-level I/O path is emerging as a main bottleneck for modern, data-intensive servers and applications. The need to improve I/O performance requires customizing various aspects of the I/O path, including the page cache and the method to access the storage devices.In this paper, we present Aquila, a library OS that allows applications to reduce I/O overhead by customizing the memory-mapped I/O (mmio) path for files or storage devices. Compared to Linux mmap, Aquila (a) offers full mmio compatibility and protection to minimize application modifications, (b) allows applications to customize the DRAM I/O cache, its policies, and access to storage devices, and © significantly reduces I/O overhead. Aquila achieves its mmio compatibility, flexibility, and performance by placing the application in a privileged domain, non-root ring 0.We show the benefits of Aquila in two cases: (a) Using mmio in key-value stores to reduce I/O overhead and (b) utilizing mmio in graph processing applications to extend the memory heap over fast storage devices. Aquila requires 2.58\texttimes{

DOI: 10.1145/3447786.3456242

Confidential computing for OpenPOWER

作者: Hunt, Guerney D. H. and Pai, Ramachandra and Le, Michael V. and Jamjoom, Hani and Bhattiprolu, Sukadev and Boivie, Rick and Dufour, Laurent and Frey, Brad and Kapur, Mohit and Goldman, Kenneth A. and Grimm, Ryan and Janakirman, Janani and Ludden, John M. and Mackerras, Paul and May, Cathy and Palmer, Elaine R. and Rao, Bharata Bhasker and Roy, Lawrence and Starke, William A. and Stuecheli, Jeff and Valdez, Enriquillo and Voigt, Wendel
关键词: KVM, Linux, POWER9, TEE, confidential computing, enclave, firmware, secure computing, trusted execution environment, ultravisor

Abstract

This paper presents Protected Execution Facility (PEF), a virtual machine-based Trusted Execution Environment (TEE) for confidential computing on Power ISA. PEF enables protected secure virtual machines (SVMs). Like other TEEs, PEF verifies the SVM prior to execution. PEF utilizes a Trusted Platform Module (TPM), secure boot, and trusted boot as well as newly introduced architectural changes for Power ISA systems. Exploiting these architectural changes requires new firmware, the Protected Execution Ultravisor. PEF is supported in the latest version of the POWER9 chip. PEF demonstrates that access control for isolation and cryptography for confidentiality is an effective approach to confidential computing. We particularly focus on how our design (i) balances between access control and cryptography, (ii) maximizes the use of existing security components, and (iii) simplifies the management of the SVM life cycle. Finally, we evaluate the performance of SVMs in comparison to normal virtual machines on OpenPOWER systems.

DOI: 10.1145/3447786.3456243

Accelerating graph sampling for graph machine learning using GPUs

作者: Jangda, Abhinav and Polisetty, Sandeep and Guha, Arjun and Serafini, Marco
关键词: No keywords

Abstract

Representation learning algorithms automatically learn the features of data. Several representation learning algorithms for graph data, such as DeepWalk, node2vec, and Graph-SAGE, sample the graph to produce mini-batches that are suitable for training a DNN. However, sampling time can be a significant fraction of training time, and existing systems do not efficiently parallelize sampling.Sampling is an “embarrassingly parallel” problem and may appear to lend itself to GPU acceleration, but the irregularity of graphs makes it hard to use GPU resources effectively. This paper presents NextDoor, a system designed to effectively perform graph sampling on GPUs. NextDoor employs a new approach to graph sampling that we call transit-parallelism, which allows load balancing and caching of edges. NextDoor provides end-users with a high-level abstraction for writing a variety of graph sampling algorithms. We implement several graph sampling applications, and show that NextDoor runs them orders of magnitude faster than existing systems.

DOI: 10.1145/3447786.3456244

RubberBand： cloud-based hyperparameter tuning

作者: Misra, Ujval and Liaw, Richard and Dunlap, Lisa and Bhardwaj, Romil and Kandasamy, Kirthevasan and Gonzalez, Joseph E. and Stoica, Ion and Tumanov, Alexey
关键词: distributed machine learning, hyperparameter optimization

Abstract

Hyperparameter tuning is essential to achieving state-of-the-art accuracy in machine learning (ML), but requires substantial compute resources to perform. Existing systems primarily focus on effectively allocating resources for a hyperparameter tuning job under fixed resource constraints. We show that the available parallelism in such jobs changes dynamically over the course of execution and, therefore, presents an opportunity to leverage the elasticity of the cloud.In particular, we address the problem of minimizing the financial cost of executing a hyperparameter tuning job, subject to a time constraint. We present RubberBand—the first framework for cost-efficient, elastic execution of hyperparameter tuning jobs in the cloud. RubberBand utilizes performance instrumentation and cloud pricing to model job completion time and cost prior to runtime, and generate a cost-efficient, elastic resource allocation plan. RubberBand is able to efficiently execute this plan and realize a cost reduction of up to 2x in comparison to static allocation baselines.

DOI: 10.1145/3447786.3456245

Bridging the performance gap for copy-based garbage collectors atop non-volatile memory

作者: Yang, Yanfei and Wu, Mingyu and Chen, Haibo and Zang, Binyu
关键词: garbage collection, java virtual machine, non-volatile memory

Abstract

Non-volatile memory (NVM) is expected to revolutionize the memory hierarchy with not only non-volatility but also large capacity and power efficiency. Memory-intensive applications, which are often written in managed languages like Java, would run atop NVM for better cost-efficiency. Unfortunately, such applications may suffer from performance slowdown due to the unmanaged performance gap between DRAM and NVM. This paper studies the performance of a series of Java applications atop NVM and uncovers that the copy-based garbage collection (GC), the mainstream GC algorithm, is an NVM-unfriendly component in JVM. GC becomes a severe performance bottleneck especially when memory resource is scarce. To this end, this paper analyzes the memory behavior of copy-based GC and uncovers that its inappropriate usage on NVM bandwidth is the main reason for its performance slowdown. This paper thus proposes two NVM-aware optimizations: write cache and header map, to effectively manage the limited NVM bandwidth. It further improves the GC performance with hardware instructions like non-temporal memory accesses and prefetching. We have implemented the optimizations on two mainstream copy-based garbage collectors in OpenJDK. Evaluation with various memory-intensive applications shows that our optimizations can improve the GC time, application execution time, application tail latency by up to 2.69\texttimes{

DOI: 10.1145/3447786.3456246

Seastar： vertex-centric programming for graph neural networks

作者: Wu, Yidi and Ma, Kaihao and Cai, Zhenkun and Jin, Tatiana and Li, Boyang and Zheng, Chenguang and Cheng, James and Yu, Fan
关键词: deep learning systems, graph neural networks

Abstract

Graph neural networks (GNNs) have achieved breakthrough performance in graph analytics such as node classification, link prediction and graph clustering. Many GNN training frameworks have been developed, but they are usually designed as a set of manually written, GNN-specific operators plugged into existing deep learning systems, which incurs high memory consumption, poor data locality, and large semantic gap between algorithm design and implementation. This paper proposes the Seastar system, which presents a vertex-centric programming model for GNN training on GPU and provides idiomatic python constructs to enable easy development of novel homogeneous and heterogeneous GNN models. We also propose novel optimizations to produce highly efficient fused GPU kernels for forward and backward passes in GNN training. Compared with the state-of-the art GNN systems, DGL and PyG, Seastar achieves better usability, up to 2 and 8 times less memory consumption, and 14 and 3 times faster execution, respectively.

DOI: 10.1145/3447786.3456247

Unikraft： fast, specialized unikernels the easy way

作者: Kuenzer, Simon and B\u{a
关键词: No keywords

Abstract

Unikernels are famous for providing excellent performance in terms of boot times, throughput and memory consumption, to name a few metrics. However, they are infamous for making it hard and extremely time consuming to extract such performance, and for needing significant engineering effort in order to port applications to them. We introduce Unikraft, a novel micro-library OS that (1) fully modularizes OS primitives so that it is easy to customize the unikernel and include only relevant components and (2) exposes a set of composable, performance-oriented APIs in order to make it easy for developers to obtain high performance.Our evaluation using off-the-shelf applications such as nginx, SQLite, and Redis shows that running them on Unikraft results in a 1.7x-2.7x performance improvement compared to Linux guests. In addition, Unikraft images for these apps are around 1MB, require less than 10MB of RAM to run, and boot in around 1ms on top of the VMM time (total boot time 3ms-40ms). Unikraft is a Linux Foundation open source project and can be found at www.unikraft.org.

DOI: 10.1145/3447786.3456248

Characterizing, exploiting, and detecting DMA code injection vulnerabilities in the presence of an IOMMU

作者: Alex, Markuze and Vargaftik, Shay and Kupfer, Gil and Pismeny, Boris and Amit, Nadav and Morrison, Adam and Tsafrir, Dan
关键词: No keywords

Abstract

Direct memory access (DMA) renders a system vulnerable to DMA attacks, in which I/O devices access memory regions not intended for their use. Hardware input-output memory management units (IOMMU) can be used to provide protection. However, an IOMMU cannot prevent all DMA attacks because it only restricts DMA at page-level granularity, leading to sub-page vulnerabilities.Current DMA attacks rely on simple situations in which write access to a kernel pointer is obtained due to sub-page vulnerabilities and all other attack ingredients are available and reside on the same page. We show that DMA vulnerabilities are a deep-rooted issue and it is often the kernel design that enables complex and multistage DMA attacks. This work presents a structured top-down approach to characterize, exploit, and detect them.To this end, we first categorize sub-page vulnerabilities into four types, providing insight into the structure of DMA vulnerabilities. We then identify a set of three vulnerability attributes that are sufficient to execute code injection attacks.We built analysis tools that detect these sub-page vulnerabilities and analyze the Linux kernel. We found that 72% of the device drivers expose callback pointers, which may be overwritten by a device to hijack the kernel control flow.Aided by our tools’ output, we demonstrate novel code injection attacks on the Linux kernel; we refer to these as compound attacks. All previously reported attacks are single-step, with the vulnerability attributes present in a single page. In compound attacks, the vulnerability attributes are initially incomplete. However, we demonstrate that they can be obtained by carefully exploiting standard OS behavior.

DOI: 10.1145/3447786.3456249

Finding heterogeneous-unsafe configuration parameters in cloud systems

作者: Ma, Sixiang and Zhou, Fang and Bond, Michael D. and Wang, Yang
关键词: No keywords

Abstract

With the increasing prevalence of heterogeneous hardware and the increasing need for online reconfiguration, there is increasing demand for heterogeneous configurations. However, allowing different nodes to have different configurations may cause errors when these nodes communicate, even if the configuration of each node uses valid values.To test which configuration parameters are unsafe when configured in a heterogeneous manner, this work reuses existing unit tests but runs them with heterogeneous configurations. To address the challenge that unit tests often share the configuration across different nodes, we incorporate several heuristics to accurately map configuration objects to nodes. To address the challenge that there are too many tests to run, we (1) “pre-run” unit tests to determine effective unit tests for each configuration parameter and (2) introduce pooled testing to test several parameters together. Our evaluation finds 41 heterogeneous-unsafe configuration parameters in Flink, HBase, HDFS, MapReduce, and YARN. We further propose suggestions and workarounds to make a subset of these parameters heterogeneous safe.

DOI: 10.1145/3447786.3456250

Tahoe： tree structure-aware high performance inference engine for decision tree ensemble on GPU

作者: Xie, Zhen and Dong, Wenqian and Liu, Jiawen and Liu, Hang and Li, Dong
关键词: decision tree ensemble, decision tree inference, performance model, tree structure

Abstract

Decision trees are widely used and often assembled as a forest to boost prediction accuracy. However, using decision trees for inference on GPU is challenging, because of irregular memory access patterns and imbalance workloads across threads. This paper proposes Tahoe, a tree structure-aware high performance inference engine for decision tree ensemble. Tahoe rearranges tree nodes to enable efficient and coalesced memory accesses; Tahoe also rearranges trees, such that trees with similar structures are grouped together in memory and assigned to threads in a balanced way. Besides memory access efficiency, we introduce a set of inference strategies, each of which uses shared memory differently and has different implications on reduction overhead. We introduce performance models to guide the selection of the inference strategies for arbitrary forests and data set. Tahoe consistently outperforms the state-of-the-art industry-quality library FIL by 3.82x, 2.59x, and 2.75x on three generations of NVIDIA GPUs (Kepler, Pascal, and Volta), respectively.

DOI: 10.1145/3447786.3456251

Understanding and dealing with hard faults in persistent memory systems

作者: Choi, Brian and Burns, Randal and Huang, Peng
关键词: No keywords

Abstract

The advent of Persistent Memory (PM) devices enables systems to actively persist information at low costs, including program state traditionally in volatile memory. However, this trend poses a reliability challenge in which multiple classes of soft faults that go away after restart in traditional systems turn into hard (recurring) faults in PM systems. In this paper, we first characterize this rising problem with an empirical study of 28 real-world bugs. We analyze how they cause hard faults in PM systems. We then propose Arthas, a tool to effectively recover PM systems from hard faults. Arthas checkpoints PM states via fine-grained versioning and uses program slicing of fault instructions to revert problematic PM states to good versions. We evaluate Arthas on 12 real-world hard faults from five large PM systems. Arthas successfully recovers the systems for all cases while discarding 10\texttimes{

DOI: 10.1145/3447786.3456252

Tesseract： distributed, general graph pattern mining on evolving graphs

作者: Bindschaedler, Laurent and Malicevic, Jasmina and Lepers, Baptiste and Goel, Ashvin and Zwaenepoel, Willy
关键词: differential, distributed, dynamic, evolving, graph, graph store, incremental, mining, pattern, processing, streaming, subgraph matching, temporal, tesseract

Abstract

Tesseract is the first distributed system for executing general graph mining algorithms on evolving graphs. Tesseract scales out by decomposing a stream of graph updates into per-update mining tasks and dynamically assigning these tasks to a set of distributed workers. We present a novel approach to change detection that efficiently determines the exact modifications to the algorithm’s output for each update to the input graph. We use a disaggregated, multiversioned graph store to allow workers to process updates independently, without producing duplicates. Moreover, Tesseract provides interactive mining insights for complex applications using an incremental aggregation API. Finally, we implement and evaluate Tesseract and demonstrate that it achieves orders-of-magnitude improvements over state-of-the-art systems.

DOI: 10.1145/3447786.3456253

Profiling dataflow systems on multiple abstraction levels

作者: Beischl, Alexander and Kersten, Timo and Bandle, Maximilian and Giceva, Jana and Neumann, Thomas
关键词: dataflow systems, profiling, query compilation

Abstract

Dataflow graphs are a popular abstraction for describing computation, used in many systems for high-level optimization. For execution, dataflow graphs are lowered and optimized through layers of program representations down to machine instructions. Unfortunately, performance profiling such systems is cumbersome, as today’s profilers present results merely at instruction and function granularity. This obfuscates the connection between profiles and high-level constructs, such as operators and pipelines, making interpretation of profiles an exercise in puzzling and deduction.In this paper, we show how to profile compiling dataflow systems at higher abstraction levels. Our approach tracks the code generation process and aggregates profiling data to any abstraction level. This bridges the semantic gap to match the engineer’s current information need and even creates a comprehensible way to report timing information within profiling data. We have evaluated this approach within our compiling DBMS Umbra, showing that the approach is generally applicable for compiling dataflow systems and can be implemented with high accuracy and reasonable overhead.

DOI: 10.1145/3447786.3456254