EuroSys 2020

Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning

作者: Chaudhary, Shubham and Ramjee, Ramachandran and Sivathanu, Muthian and Kwatra, Nipun and Viswanatha, Srinidhi
关键词: No keywords

Abstract

We present Gandivafair, a distributed, fair share scheduler that balances conflicting goals of efficiency and fairness in GPU clusters for deep learning training (DLT). Gandivafair provides performance isolation between users, enabling multiple users to share a single cluster, thus, maximizing cluster efficiency. Gandivafair is the first scheduler that allocates cluster-wide GPU time fairly among active users.Gandivafair achieves efficiency and fairness despite cluster heterogeneity. Data centers host a mix of GPU generations because of the rapid pace at which newer and faster GPUs are released. As the newer generations face higher demand from users, older GPU generations suffer poor utilization, thus reducing cluster efficiency. Gandivafair profiles the variable marginal utility across various jobs from newer GPUs, and transparently incentivizes users to older GPUs by a novel resource trading mechanism that maximizes cluster efficiency without affecting fairness guarantees of any user. With a prototype implementation and evaluation in a heterogeneous 200-GPU cluster, we show that Gandivafair achieves both fairness and efficiency under realistic multi-user workloads.

DOI: 10.1145/3342195.3387555

Experiences of landing machine learning onto market-scale mobile malware detection

作者: Gong, Liangyi and Li, Zhenhua and Qian, Feng and Zhang, Zifan and Chen, Qi Alfred and Qian, Zhiyun and Lin, Hao and Liu, Yunhao
关键词: No keywords

Abstract

App markets, being crucial and critical for today’s mobile ecosystem, have also become a natural malware delivery channel since they actually “lend credibility” to malicious apps. In the past decade, machine learning (ML) techniques have been explored for automated, robust malware detection. Unfortunately, to date, we have yet to see an ML-based malware detection solution deployed at market scales. To better understand the real-world challenges, we conduct a collaborative study with a major Android app market (T-Market) offering us large-scale ground-truth data. Our study shows that the key to successfully developing such systems is manifold, including feature selection/engineering, app analysis speed, developer engagement, and model evolution. Failure in any of the above aspects would lead to the “wooden barrel effect” of the entire system. We discuss our careful design choices as well as our first-hand deployment experiences in building such an ML-powered malware detection system. We implement our design and examine its effectiveness in the T-Market for over one year, using a single commodity server to vet ~ 10K apps every day. The evaluation results show that this design achieves an overall precision of 98% and recall of 96% with an average per-app scan time of 1.3 minutes.

DOI: 10.1145/3342195.3387530

Provable multicore schedulers with Ipanema： application to work conservation

作者: Lepers, Baptiste and Gouicem, Redha and Carver, Damien and Lozi, Jean-Pierre and Palix, Nicolas and Aponte, Maria-Virginia and Zwaenepoel, Willy and Sopena, Julien and Lawall, Julia and Muller, Gilles
关键词: scheduling, formal verification, Linux

Abstract

Recent research and bug reports have shown that work conservation, the property that a core is idle only if no other core is overloaded, is not guaranteed by Linux’s CFS or FreeBSD’s ULE multicore schedulers. Indeed, multicore schedulers are challenging to specify and verify: they must operate under stringent performance requirements, while handling very large numbers of concurrent operations on threads. As a consequence, the verification of correctness properties of schedulers has not yet been considered.In this paper, we propose an approach, based on a domain-specific language and theorem provers, for developing schedulers with provable properties. We introduce the notion of concurrent work conservation (CWC), a relaxed definition of work conservation that can be achieved in a concurrent system where threads can be created, unblocked and blocked concurrently with other scheduling events. We implement several scheduling policies, inspired by CFS and ULE. We show that our schedulers obtain the same level of performance as production schedulers, while concurrent work conservation is satisfied.

DOI: 10.1145/3342195.3387544

Delegation sketch： a parallel design with support for fast and accurate concurrent operations

作者: Stylianopoulos, Charalampos and Walulya, Ivan and Almgren, Magnus and Landsiedel, Olaf and Papatriantafilou, Marina
关键词: No keywords

Abstract

Sketches are data structures designed to answer approximate queries by trading memory overhead with accuracy guarantees. More specifically, sketches efficiently summarize large, high-rate streams of data and quickly answer queries on these summaries. In order to support such high throughput rates in modern architectures, parallelization and support for fast queries play a central role, especially when monitoring unpredictable data that can change rapidly as, e.g., in network monitoring for large-scale denial-of-service attacks. However, most existing parallel sketch designs have focused either on high insertion rate or on high query rate, and fail to support cases when these operations are concurrent.In this work we examine the trade-off between query and insertion efficiency and we propose Delegation Sketch, a parallelization design for sketch-based data structures to efficiently support concurrent insertions and queries. Delegation Sketch introduces a domain splitting scheme that uses multiple, parallel sketches to ensure all occurrences of a key fall into the same sketch. We complement the design by proposing synchronization mechanisms that facilitate delegation of insertion and queries among threads, enabling it to process streams at higher rates, even in the presence of concurrent queries. We thoroughly evaluate Delegation Sketch across multiple dimensions (accuracy, scalability, query rate and input skew) on two massively parallel platforms (including a NUMA architecture) using both synthetic and real data. We show that Delegation Sketch achieves from 2.5X to 4X higher throughput, depending on the rate of concurrent queries, than the best performing alternative, while at the same time maintaining better accuracy at the same memory cost.

DOI: 10.1145/3342195.3387542

Persistent memory and the rise of universal constructions

作者: Correia, Andreia and Felber, Pascal and Ramalhete, Pedro
关键词: No keywords

Abstract

Non-Volatile Main Memory (NVMM) has brought forth the need for data structures that are not only concurrent but also resilient to non-corrupting failures. Until now, persistent transactional memory libraries (PTMs) have focused on providing correct recovery from non-corrupting failures without memory leaks. Most PTMs that provide concurrent access do so with blocking progress.The main focus of this paper is to design practical PTMs with wait-free progress based on universal constructions. We first present CX-PUC, the first bounded wait-free persistent universal construction requiring no annotation of the underlying sequential data structure. CX-PUC is an adaptation to persistence of CX, a recently proposed universal construction. We next introduce CX-PTM, a PTM that achieves better throughput and supports transactions over multiple data structure instances, at the price of requiring annotation of the loads and stores in the data structure—as is commonplace in software transactional memory. Finally, we propose a new generic construction, Redo-PTM, based on a finite number of replicas and Herlihy’s wait-free consensus, which uses physical instead of logical logging. By exploiting its capability of providing wait-free ACID transactions, we have used Redo-PTM to implement the world’s first persistent key-value store with bounded wait-free progress.

DOI: 10.1145/3342195.3387515

Design of a symbolically executable embedded hypervisor

作者: Nordholz, Jan
关键词: symbolic execution, hypervisor, embedded, SAT solving

Abstract

Hypervisor implementations such as XMHF, Nova, PROSPER, prplHypervisor, the various L4 descendants, as well as KVM and Xen offer mechanisms for dynamic startup and reconfiguration, including the allocation, delegation and destruction of objects and resources at runtime. Some use cases such as cloud computing depend on this dynamicity, yet its inclusion also renders the state space intractable to simulation-based verification tools. On the other hand, system architectures for embedded devices are often fixed in the number and properties of isolated tasks, therefore a much simpler, less dynamic hypervisor design would suffice. We close this design gap by presenting Phidias, a new hypervisor consisting of a minimal runtime codebase that is almost devoid of dynamicity, and a comprehensive compile-time configuration framework. We then leverage this lack of dynamic components to non-interactively verify the validity of certain invariants. Specifically, we verify hypervisor integrity by subjecting the compiled hypervisor binary to our own symbolic execution engine. Finally, we discuss our results, point out possible improvements, and hint at unexplored characteristics of a static hypervisor design.

DOI: 10.1145/3342195.3387516

Autarky： closing controlled channels with self-paging enclaves

作者: Orenbach, Meni and Baumann, Andrew and Silberstein, Mark
关键词: No keywords

Abstract

As the first widely-deployed secure enclave hardware, Intel SGX shows promise as a practical basis for confidential cloud computing. However, side channels remain SGX’s greatest security weakness. Inparticular, the “controlled-channel attack” on enclave page faults exploits a longstanding architectural side channel and still lacks effective mitigation.We propose Autarky: a set of minor, backward-compatible modifications to the SGX ISA that hide an enclave’s page access trace from the host, and give the enclave full control over its page faults. A trusted library OS implements an enclave self-paging policy.We prototype Autarky on current SGX hardware and the Graphene library OS, implementing three paging schemes: a fast software oblivious RAM system made practical by leveraging the proposed ISA, a novel page cluster abstraction for application-aware secure self-paging, and a rate-limiting paging mechanism for unmodified binaries. Overall, Autarky provides a comprehensive defense for controlled-channel attacks which supports efficient secure demand paging, and adds no overheads in page-fault free execution.

DOI: 10.1145/3342195.3387541

Scalable range locks for scalable address spaces and beyond

作者: Kogan, Alex and Dice, Dave and Issa, Shady
关键词: semaphores, scalable synchronization, reader-writer locks, parallel file systems, lock-less, linux kernel

Abstract

Range locks are a synchronization construct designed to provide concurrent access to multiple threads (or processes) to disjoint parts of a shared resource. Originally conceived in the file system context, range locks are gaining increasing interest in the Linux kernel community seeking to alleviate bottlenecks in the virtual memory management subsystem. The existing implementation of range locks in the kernel, however, uses an internal spin lock to protect the underlying tree structure that keeps track of acquired and requested ranges. This spin lock becomes a point of contention on its own when the range lock is frequently acquired. Furthermore, where and exactly how specific (refined) ranges can be locked remains an open question.In this paper, we make two independent, but related contributions. First, we propose an alternative approach for building range locks based on linked lists. The lists are easy to maintain in a lock-less fashion, and in fact, our range locks do not use any internal locks in the common case. Second, we show how the range of the lock can be refined in the mprotect operation through a speculative mechanism. This refinement, in turn, allows concurrent execution of mprotect operations on non-overlapping memory regions. We implement our new algorithms and demonstrate their effectiveness in user-space and kernel-space, achieving up to 9X speedup compared to the stock version of the Linux kernel. Beyond the virtual memory management subsystem, we discuss other applications of range locks in parallel software. As a concrete example, we show how range locks can be used to facilitate the design of scalable concurrent data structures, such as skip lists.

DOI: 10.1145/3342195.3387533

Avoiding scheduler subversion using scheduler-cooperative locks

作者: Patel, Yuvraj and Yang, Leon and Arulraj, Leo and Arpaci-Dusseau, Andrea C. and Arpaci-Dusseau, Remzi H. and Swift, Michael M.
关键词: No keywords

Abstract

We introduce the scheduler subversion problem, where lock usage patterns determine which thread runs, thereby subverting CPU scheduling goals. To mitigate this problem, we introduce Scheduler-Cooperative Locks (SCLs), a new family of locking primitives that controls lock usage and thus aligns with system-wide scheduling goals; our initial work focuses on proportional share schedulers. Unlike existing locks, SCLs provide an equal (or proportional) time window called lock opportunity within which each thread can acquire the lock. We design and implement three different scheduler-cooperative locks that work well with proportional-share schedulers: a user-level mutex lock (u-SCL), a reader-writer lock (RW-SCL), and a simplified kernel implementation (k-SCL). We demonstrate the effectiveness of SCLs in two user-space applications (UpScaleDB and KyotoCabinet) and the Linux kernel. In all three cases, regardless of lock usage patterns, SCLs ensure that each thread receives proportional lock allocations that match those of the CPU scheduler. Using microbenchmarks, we show that SCLs are efficient and achieve high performance with minimal overhead under extreme workloads.

DOI: 10.1145/3342195.3387521

Statically inferring performance properties of software configurations

作者: Li, Chi and Wang, Shu and Hoffmann, Henry and Lu, Shan
关键词: static analysis, software configuration, performance, distributed systems

Abstract

Modern software systems often have a huge number of configurations whose performance properties are poorly documented. Unfortunately, obtaining a good understanding of these performance properties is a prerequisite for performance tuning. This paper explores a new approach to discovering performance properties of system configurations: static program analysis. We present a taxonomy of how a configuration might affect performance through program dependencies. Guided by this taxonomy, we design LearnConf, a static analysis tool that identifies which configurations affect what type of performance and how. Our evaluation, which considers hundreds of configurations in four widely used distributed systems, demonstrates that LearnConf can accurately and efficiently identify many configurations’ performance properties, and help performance tuning.

DOI: 10.1145/3342195.3387520

A Linux in unikernel clothing

作者: Kuo, Hsuan-Chi and Williams, Dan and Koller, Ricardo and Mohan, Sibin
关键词: No keywords

Abstract

Unikernels leverage library OS architectures to run isolated workloads on the cloud. They have garnered attention in part due to their promised performance characteristics such as small image size, fast boot time, low memory footprint and application performance. However, those that aimed at generality fall short of the application compatibility, robustness and, more importantly, community that is available for Linux. In this paper, we describe and evaluate Lupine Linux, a standard Linux system that—through kernel configuration specialization and system call overhead elimination—achieves unikernel-like performance, in fact outperforming at least one reference unikernel in all of the above dimensions. At the same time, Lupine can run any application (since it is Linux) when faced with more general workloads, whereas many unikernels simply crash. We demonstrate a graceful degradation of unikernel-like performance properties.

DOI: 10.1145/3342195.3387526

Subway： minimizing data transfer during out-of-GPU-memory graph processing

作者: Sabet, Amir Hossein Nodehi and Zhao, Zhijia and Gupta, Rajiv
关键词: No keywords

Abstract

In many graph-based applications, the graphs tend to grow, imposing a great challenge for GPU-based graph processing. When the graph size exceeds the device memory capacity (i.e., GPU memory oversubscription), the performance of graph processing often degrades dramatically, due to the sheer amount of data transfer between CPU and GPU.To reduce the volume of data transfer, existing approaches track the activeness of graph partitions and only load the ones that need to be processed. In fact, the recent advances of unified memory implements this optimization implicitly by loading memory pages on demand. However, either way, the benefits are limited by the coarse-granularity activeness tracking - each loaded partition or memory page may still carry a large ratio of inactive edges.In this work, we present, to the best of our knowledge, the first solution that only loads active edges of the graph to the GPU memory. To achieve this, we design a fast subgraph generation algorithm with a simple yet efficient subgraph representation and a GPU-accelerated implementation. They allow the subgraph generation to be applied in almost every iteration of the vertex-centric graph processing. Furthermore, we bring asynchrony to the subgraph processing, delaying the synchronization between a subgraph in the GPU memory and the rest of the graph in the CPU memory. This can safely reduce the needs of generating and loading subgraphs for many common graph algorithms. Our prototyped system, Subway (subgraph processing with asynchrony) yields over 4X speedup on average comparing with existing out-of-GPU-memory solutions and the unified memory-based approach, based on an evaluation with six common graph algorithms.

DOI: 10.1145/3342195.3387537

Peregrine： a pattern-aware graph mining system

作者: Jamshidi, Kasra and Mahadasa, Rakesh and Vora, Keval
关键词: No keywords

Abstract

Graph mining workloads aim to extract structural properties of a graph by exploring its subgraph structures. General purpose graph mining systems provide a generic runtime to explore subgraph structures of interest with the help of user-defined functions that guide the overall exploration process. However, the state-of-the-art graph mining systems remain largely oblivious to the shape (or pattern) of the subgraphs that they mine. This causes them to: (a) explore unnecessary subgraphs; (b) perform expensive computations on the explored subgraphs; and, © hold intermediate partial subgraphs in memory; all of which affect their overall performance. Furthermore, their programming models are often tied to their underlying exploration strategies, which makes it difficult for domain users to express complex mining tasks.In this paper, we develop Peregrine, a pattern-aware graph mining system that directly explores the subgraphs of interest while avoiding exploration of unnecessary subgraphs, and simultaneously bypassing expensive computations throughout the mining process. We design a pattern-based programming model that treats graph patterns as first class constructs and enables Peregrine to extract the semantics of patterns, which it uses to guide its exploration. Our evaluation shows that Peregrine outperforms state-of-the-art distributed and single machine graph mining systems, and scales to complex mining tasks on larger graphs, while retaining simplicity and expressivity with its `pattern-first’ programming approach.

DOI: 10.1145/3342195.3387548

Can far memory improve job throughput?

作者: Amaro, Emmanuel and Branner-Augmon, Christopher and Luo, Zhihong and Ousterhout, Amy and Aguilera, Marcos K. and Panda, Aurojit and Ratnasamy, Sylvia and Shenker, Scott
关键词: No keywords

Abstract

As memory requirements grow, and advances in memory technology slow, the availability of sufficient main memory is increasingly the bottleneck in large compute clusters. One solution to this is memory disaggregation, where jobs can remotely access memory on other servers, or far memory. This paper first presents faster swapping mechanisms and a far memory-aware cluster scheduler that make it possible to support far memory at rack scale. Then, it examines the conditions under which this use of far memory can increase job throughput. We find that while far memory is not a panacea, for memory-intensive workloads it can provide performance improvements on the order of 10% or more even without changing the total amount of memory available.

DOI: 10.1145/3342195.3387522

A fault-tolerance shim for serverless computing

作者: Sreekanti, Vikram and Wu, Chenggang and Chhatrapati, Saurav and Gonzalez, Joseph E. and Hellerstein, Joseph M. and Faleiro, Jose M.
关键词: No keywords

Abstract

Serverless computing has grown in popularity in recent years, with an increasing number of applications being built on Functions-as-a-Service (FaaS) platforms. By default, FaaS platforms support retry-based fault tolerance, but this is insufficient for programs that modify shared state, as they can unwittingly persist partial sets of updates in case of failures. To address this challenge, we would like atomic visibility of the updates made by a FaaS application.In this paper, we present aft, an atomic fault tolerance shim for serverless applications. aft interposes between a commodity FaaS platform and storage engine and ensures atomic visibility of updates by enforcing the read atomic isolation guarantee. aft supports new protocols to guarantee read atomic isolation in the serverless setting. We demonstrate that aft introduces minimal overhead relative to existing storage engines and scales smoothly to thousands of requests per second, while preventing a significant number of consistency anomalies.

DOI: 10.1145/3342195.3387535

Autopilot： workload autoscaling at Google

作者: Rzadca, Krzysztof and Findeisen, Pawel and Swiderski, Jacek and Zych, Przemyslaw and Broniek, Przemyslaw and Kusmierek, Jarek and Nowak, Pawel and Strack, Beata and Witusowski, Piotr and Hand, Steven and Wilkes, John
关键词: No keywords

Abstract

In many public and private Cloud systems, users need to specify a limit for the amount of resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits might be throttled or killed, resulting in delaying or dropping end-user requests, so human operators naturally err on the side of caution and request a larger limit than the job needs. At scale, this results in massive aggregate resource wastage.To address this, Google uses Autopilot to configure resources automatically, adjusting both the number of concurrent tasks in a job (horizontal scaling) and the CPU/memory limits for individual tasks (vertical scaling). Autopilot walks the same fine line as human operators: its primary goal is to reduce slack - the difference between the limit and the actual resource usage - while minimizing the risk that a task is killed with an out-of-memory (OOM) error or its performance degraded because of CPU throttling. Autopilot uses machine learning algorithms applied to historical data about prior executions of a job, plus a set of finely-tuned heuristics, to walk this line. In practice, Autopiloted jobs have a slack of just 23%, compared with 46% for manually-managed jobs. Additionally, Autopilot reduces the number of jobs severely impacted by OOMs by a factor of 10.Despite its advantages, ensuring that Autopilot was widely adopted took significant effort, including making potential recommendations easily visible to customers who had yet to opt in, automatically migrating certain categories of jobs, and adding support for custom recommenders. At the time of writing, Autopiloted jobs account for over 48% of Google’s fleet-wide resource usage.

DOI: 10.1145/3342195.3387524

Meerkat： multicore-scalable replicated transactions following the zero-coordination principle

作者: Szekeres, Adriana and Whittaker, Michael and Li, Jialin and Sharma, Naveen Kr. and Krishnamurthy, Arvind and Ports, Dan R. K. and Zhang, Irene
关键词: No keywords

Abstract

Traditionally, the high cost of network communication between servers has hidden the impact of cross-core coordination in replicated systems. However, new technologies, like kernel-bypass networking and faster network links, have exposed hidden bottlenecks in distributed systems.This paper explores how to build multicore-scalable, replicated storage systems. We introduce a new guideline for their design, called the Zero-Coordination Principle. We use this principle to design a new multicore-scalable, in-memory, replicated, key-value store, called Meerkat.Unlike existing systems, Meerkat eliminates all cross-core and cross-replica coordination, both of which pose a scalability bottleneck. Our experiments found that Meerkat is able to scale up to 80 hyper-threads and execute 8.3 million transactions per second. Meerkat represents an improvement of 12X on state-of-the-art, fault-tolerant, in-memory, transactional storage systems built using leader-based replication and a shared transaction log.

DOI: 10.1145/3342195.3387529

MPTEE： bringing flexible and efficient memory protection to Intel SGX

作者: Zhao, Wenjia and Lu, Kangjie and Qi, Yong and Qi, Saiyu
关键词: No keywords

Abstract

Intel Software Guard extensions (SGX), a hardware-based Trusted Execution Environment (TEE), has become a promising solution to stopping critical threats such as insider attacks and remote exploits. SGX has recently drawn extensive research in two directions—using it to protect the confidentiality and integrity of sensitive data, and protecting itself from attacks. Both the applications and defense mechanisms of SGX have a fundamental need—flexible memory protection that updates memory-page permissions dynamically and enforces the least-privilege principle. Unfortunately, SGX does not provide such a memory-protection mechanism due to the lack of hardware support and the untrustedness of operating systems.This paper proposes MPTEE, a memory-protection mechanism that provides flexible and efficient enforcement of memory-page permissions in SGX. The enforcement relies on our elastic cross-region bound check technique which uses only three bound registers but provides six memory permissions. To defend MPTEE against potential attacks, we further develop an efficient mechanism that exploits the in-place bound-check technique to ensure the integrity of the memory protection. With MPTEE, developers can enhance the protection for data and code in SGX enclaves and enforce the least-privilege principle such as Execute-no-Read memory readily. We have implemented MPTEE and extensively evaluated its effectiveness, utility, and performance. The results show that MPTEE incurs a performance overhead of only 2%–8%, and is effective in ensuring memory protection and in defending against potential attacks.

DOI: 10.1145/3342195.3387536

Rhythm： component-distinguishable workload deployment in datacenters

作者: Zhao, Laiping and Yang, Yanan and Zhang, Kaixuan and Zhou, Xiaobo and Qiu, Tie and Li, Keqiu and Bao, Yungang
关键词: No keywords

Abstract

Cloud service providers improve resource utilization by co-locating latency-critical (LC) workloads with best-effort batch (BE) jobs in datacenters. However, they usually treat an LC workload as a whole when allocating resources to BE jobs and neglect the different features of components of an LC workload. This kind of coarse-grained co-location method leaves a significant room for improvement in resource utilization.Based on the observation of the inconsistent interference tolerance abilities of different LC components, we propose a new abstraction called Servpod, which is a collection of a LC parts that are deployed on the same physical machine together, and show its merits on building a fine-grained co-location framework. The key idea is to differentiate the BE throughput launched with each LC Servpod, i.e., Servpod with high interference tolerance ability can be deployed along with more BE jobs. Based on Servpods, we present Rhythm, a co-location controller that maximizes the resource utilization while guaranteeing LC service’s tail latency requirement. It quantifies the interference tolerance ability of each servpod through the analysis of tail-latency contribution. We evaluate Rhythm using LC services in forms of containerized processes and microservices, and find that it can improve the system throughput by 31.7%, CPU utilization by 26.2%, and memory bandwidth utilization by 34% while guaranteeing the SLA (service level agreement).

DOI: 10.1145/3342195.3387534

Improving resource utilization by timely fine-grained scheduling

作者: Jin, Tatiana and Cai, Zhenkun and Li, Boyang and Zheng, Chengguang and Jiang, Guanxian and Cheng, James
关键词: resource utilization, distributed computing

Abstract

Monotask is a unit of work that uses only a single type of resource (e.g., CPU, network, disk I/O). While monotask was primarily introduced as a means to reason about job performance, in this paper we show that this fine-grained, resource-oriented abstraction can be leveraged by job schedulers to maximize cluster resource utilization. Although recent cluster schedulers have significantly improved resource allocation, the utilization of the allocated resources is often not high due to inaccurate resource requests. In particular, we show that existing scheduling mechanisms are ineffective for handling jobs with dynamic resource usage, which exists in common workloads, and propose a resource negotiation mechanism between job schedulers and executors that makes use of monotasks. We design a new framework, called Ursa, which enables the scheduler to capture accurate resource demands dynamically from the execution runtime and to provide timely, fine-grained resource allocation based on monotasks. Ursa also enables high utilization of the allocated resources by the execution runtime. We show by experiments that Ursa is able to improve cluster resource utilization, which effectively translates to improved makespan and average JCT.

DOI: 10.1145/3342195.3387551

AniFilter： parallel and failure-atomic cuckoo filter for non-volatile memories

作者: Oh, Hyungjun and Cho, Bongki and Kim, Changdae and Park, Heejin and Seo, Jiwon
关键词: No keywords

Abstract

Approximate Membership Query (AMQ) data structures are widely used in databases, storage systems, and other domains. Recent advances in Non-Volatile Memory (NVM) technologies made possible byte-addressable, high performance persistent memories. This paper presents an optimized persistent AMQ for NVM, called AniFilter (AF). Based on Cuckoo Filter (CF), AF improves insertion throughput on NVM with Spillable Buckets and Lookahead Eviction, and lookup throughput with Bucket Primacy. To analyze the effect of our optimizations, we design a probabilistic model to estimate the performance of CF and AF. For failure atomicity, AF writes minimum amount of logging to maintain its consistency. We evaluated AF and four other AMQs - CF, Morton Filter (MF), Rank-and-Select Quotient Filter (RSQF), and Bloom Filter (BF) - on NVM. We use Intel Optane DC Persistent Memory and Quartz emulation for the evaluation. AF and CF are generally much faster than the other filters for sequential runs. However, in high load factors CF’s insertion throughput becomes prohibitively low due to the eviction overhead. With our optimizations, AF’s insertion throughput is fastest even in high load factors. In parallel evaluation, AF’s performance is substantially higher than CF for both insertion and lookup. Our optimizations reduce the bandwidth consumption, making AF’s parallel performance much faster than CF’s on bandwidth-limited NVM. For parallel insertion AF is up to 10.7X faster than CF (2.6X faster on average) and for parallel lookup AF is up to 1.2X faster (1.1X faster on average).

DOI: 10.1145/3342195.3387528

Balancing storage efficiency and data confidentiality with tunable encrypted deduplication

作者: Li, Jingwei and Yang, Zuoru and Ren, Yanjing and Lee, Patrick P. C. and Zhang, Xiaosong
关键词: No keywords

Abstract

Conventional encrypted deduplication approaches retain the deduplication capability on duplicate chunks after encryption by always deriving the key for encryption/decryption from the chunk content, but such a deterministic nature causes information leakage due to frequency analysis. We present TED, a tunable encrypted deduplication primitive that provides a tunable mechanism for balancing the tradeoff between storage efficiency and data confidentiality. The core idea of TED is that its key derivation is based on not only the chunk content but also the number of duplicate chunk copies, such that duplicate chunks are encrypted by distinct keys in a controlled manner. In particular, TED allows users to configure a storage blowup factor, under which the information leakage quantified by an information-theoretic measure is minimized for any input workload. We implement an encrypted deduplication prototype TEDStore to realize TED in networked environments. Evaluation on real-world file system snapshots shows that TED effectively balances the trade-off between storage efficiency and data confidentiality, with small performance overhead.

DOI: 10.1145/3342195.3387531

Kollaps： decentralized and dynamic topology emulation

作者: Gouveia, Paulo and Neves, Jo~{a
关键词: experimental reproducibility, emulation, dynamic network topology, distributed systems, containers

Abstract

The performance and behavior of large-scale distributed applications is highly influenced by network properties such as latency, bandwidth, packet loss, and jitter. For instance, an engineer might need to answer questions such as: What is the impact of an increase in network latency in application response time? How does moving a cluster between geographical regions affect application throughput? What is the impact of network dynamics on application stability? Currently, answering these questions in a systematic and reproducible way is very hard due to the variability and lack of control over the underlying network. Unfortunately, state-of-the-art network emulation or testbed environments do not scale beyond a single machine or small cluster (i.e., MiniNet), are focused exclusively on the control-plane (i.e., CrystalNet) or lack support for network dynamics (i.e., EmuLab).In this paper, we address these limitations with Kollaps, a fully distributed network emulator. Kollaps hinges on two key observations. First, from an application’s perspective, what matters are the emergent end-to-end properties (e.g., latency, bandwidth, packet loss, and jitter) rather than the internal state of the routers and switches leading to those properties. This premise allows us to build a simpler, dynamically adaptable, emulation model that does not require maintaining the full network state. Second, this simplified model is amenable to be maintained in a fully decentralized way, allowing the emulation to scale with the number of machines required by the application.Kollaps is fully decentralized, agnostic of the application language and transport protocol, scales to thousands of processes and is accurate when compared against a bare-metal deployment or state-of-the-art approaches that emulate the full state of the network. We showcase how Kollaps can accurately reproduce results from the literature and predict the behaviour of a complex unmodified distributed key-value store (i.e., Cassandra) under different deployments.

DOI: 10.1145/3342195.3387540

State-machine replication for planet-scale systems

作者: Enes, Vitor and Baquero, Carlos and Rezende, Tuanir Fran\c{c
关键词: geo-replication, fault tolerance, consensus

Abstract

Online applications now routinely replicate their data at multiple sites around the world. In this paper we present Atlas, the first state-machine replication protocol tailored for such planet-scale systems. Atlas does not rely on a distinguished leader, so clients enjoy the same quality of service independently of their geographical locations. Furthermore, client-perceived latency improves as we add sites closer to clients. To achieve this, Atlas minimizes the size of its quorums using an observation that concurrent data center failures are rare. It also processes a high percentage of accesses in a single round trip, even when these conflict. We experimentally demonstrate that Atlas consistently outperforms state-of-the-art protocols in planet-scale scenarios. In particular, Atlas is up to two times faster than Flexible Paxos with identical failure assumptions, and more than doubles the performance of Egalitarian Paxos in the YCSB benchmark.

DOI: 10.1145/3342195.3387543

HovercRaft： achieving scalability and fault-tolerance for microsecond-scale datacenter services

作者: Kogias, Marios and Bugnion, Edouard
关键词: No keywords

Abstract

Cloud platform services must simultaneously be scalable, meet low tail latency service-level objectives, and be resilient to a combination of software, hardware, and network failures. Replication plays a fundamental role in meeting both the scalability and the fault-tolerance requirement, but is subject to opposing requirements: (1) scalability is typically achieved by relaxing consistency; (2) fault-tolerance is typically achieved through the consistent replication of state machines. Adding nodes to a system can therefore either increase performance at the expense of consistency, or increase resiliency at the expense of performance.We propose HovercRaft, a new approach by which adding nodes increases both the resilience and the performance of general-purpose state-machine replication. We achieve this through an extension of the Raft protocol that carefully eliminates CPU and I/O bottlenecks and load balances requests.Our implementation uses state-of-the-art kernel-bypass techniques, datacenter transport protocols, and in-network programmability to deliver up to 1 million operations/second for clusters of up to 9 nodes, linear speedup over unreplicated configuration for selected workloads, and a 4X speedup for the YCSBE-E benchmark running on Redis over an unreplicated deployment.

DOI: 10.1145/3342195.3387545

RAIDP： replication with intra-disk parity

作者: Rosenfeld, Eitan and Zuck, Aviad and Amit, Nadav and Factor, Michael and Tsafrir, Dan
关键词: No keywords

Abstract

Distributed storage systems often triplicate data to reduce the risk of permanent data loss, thereby tolerating at least two simultaneous disk failures at the price of 2/3 of the capacity. To reduce this price, some systems utilize erasure coding. But this optimization is usually only applied to cold data, because erasure coding might hinder performance for warm data.We propose RAIDP—a new point in the distributed storage design space between replication and erasure coding. RAIDP maintains only two replicas, rather than three or more. It increases durability by utilizing small disk “add-ons” for storing intra-disk erasure codes that are local to the server but fail independently from the disk. By carefully laying out the data, the add-ons allow RAIDP to recover from simultaneous disk failures (add-ons can be stacked to withstand an arbitrary number of failures). RAIDP retains much of the benefits of replication, trading off some performance and availability for substantially reduced storage requirements, networking overheads, and their related costs. We implement RAIDP in HDFS, which triplicates by default. We show that baseline RAIDP achieves performance close to that of HDFS with only two replicas, and performs within 21% of the default triplicating HDFS with an update-oriented variant, while halving the storage and networking overheads and providing similar durability.

DOI: 10.1145/3342195.3387546

EvenDB： optimizing key-value storage for spatial locality

作者: Gilad, Eran and Bortnikov, Edward and Braginsky, Anastasia and Gottesman, Yonatan and Hillel, Eshcar and Keidar, Idit and Moscovici, Nurit and Shahout, Rana
关键词: No keywords

Abstract

Applications of key-value (KV-)storage often exhibit high spatial locality, such as when many data items have identical composite key prefixes. This prevalent access pattern is underused by the ubiquitous LSM design underlying high-throughput KV-stores today.We present EvenDB, a general-purpose persistent KV-store optimized for spatially-local workloads. EvenDB combines spatial data partitioning with LSM-like batch I/O. It achieves high throughput, ensures consistency under multi-threaded access, and reduces write amplification.In experiments with real-world data from a large analytics platform, EvenDB outperforms the state-of-the-art. E.g., on a 256GB production dataset, EvenDB ingests data 4.4X faster than RocksDB and reduces write amplification by nearly 4X. In traditional YCSB workloads lacking spatial locality, EvenDB is on par with RocksDB and significantly better than other open-source solutions we explored.

DOI: 10.1145/3342195.3387523

Accessible near-storage computing with FPGAs

作者: Schmid, Robert and Plauth, Max and Wenzel, Lukas and Eberhardt, Felix and Polze, Andreas
关键词: programming model, near-storage computing, near-data processing, data-flow paradigm, FPGA

Abstract

Data transfers impose a major bottleneck in heterogenous system architectures. As a mitigation strategy, compute resources can be introduced in places where data occurs naturally. The increased diversity of compute resources in turn affects programming models and practicalities of software development for near-data compute kernels and raises the question of how those resources can be made accessible to users and applications.We introduce the Metal FS framework to improve the accessibility of FPGA-based near-storage accelerators: Firstly, we present a near-storage-compute-aware file system that enables self-contained, reusable compute kernels to operate on the granularity of file data streams. Secondly, we provide an integrated build process for FPGA overlay images that starts with the acquisition of compute kernels through a package manager and finally allows to dynamically configure near-storage compute pipelines consisting of them. Thirdly, we integrate the framework into Linux as a file system driver and repurpose Unix Pipes as a well-known operating system primitive to orchestrate near-storage compute pipelines.

DOI: 10.1145/3342195.3387557

StRoM： smart remote memory

作者: Sidler, David and Wang, Zeke and Chiosa, Monica and Kulkarni, Amit and Alonso, Gustavo
关键词: No keywords

Abstract

Big data applications often incur large costs in I/O, data transfer and copying overhead, especially when operating in cloud environments. Since most such computations are distributed, data processing operations offloaded to the network card (NIC) could potentially reduce the data movement overhead by enabling near-data processing at several points of a distributed system. Following this idea, in this paper we present StRoM, a programmable, FPGA-based RoCE v2 NIC supporting the offloading of application level kernels. These kernels can be used to perform memory access operations directly from the NIC such as traversal of remote data structures as well as filtering or aggregation over RDMA data streams on both the sending or receiving sides. StRoM bypasses the CPU entirely and extends the semantics of RDMA to enable multi-step data access operations and in-network processing of RDMA streams. We demonstrate the versatility and potential of StRoM with four different kernels extending one-sided RDMA commands: 1) Traversal of remote data structures through pointer chasing, 2) Consistent retrieval of remote data blocks, 3) Data shuffling on the NIC by partitioning incoming data to different memory regions or CPU cores, and 4) Cardinality estimation on data streams.

DOI: 10.1145/3342195.3387519

Borg： the next generation

作者: Tirmazi, Muhammad and Barker, Adam and Deng, Nan and Haque, Md E. and Qin, Zhijing Gene and Hand, Steven and Harchol-Balter, Mor and Wilkes, John
关键词: data centers, cloud computing

Abstract

This paper analyzes a newly-published trace that covers 8 different Borg [35] clusters for the month of May 2019. The trace enables researchers to explore how scheduling works in large-scale production compute clusters. We highlight how Borg has evolved and perform a longitudinal comparison of the newly-published 2019 trace against the 2011 trace, which has been highly cited within the research community.Our findings show that Borg features such as alloc sets are used for resource-heavy workloads; automatic vertical scaling is effective; job-dependencies account for much of the high failure rates reported by prior studies; the workload arrival rate has increased, as has the use of resource over-commitment; the workload mix has changed, jobs have migrated from the free tier into the best-effort batch tier; the workload exhibits an extremely heavy-tailed distribution where the top 1% of jobs consume over 99% of resources; and there is a great deal of variation between different clusters.

DOI: 10.1145/3342195.3387517

AlloX： compute allocation in hybrid clusters

作者: Le, Tan N. and Sun, Xiao and Chowdhury, Mosharaf and Liu, Zhenhua
关键词: No keywords

Abstract

Modern deep learning frameworks support a variety of hardware, including CPU, GPU, and other accelerators, to perform computation. In this paper, we study how to schedule jobs over such interchangeable resources - each with a different rate of computation - to optimize performance while providing fairness among users in a shared cluster. We demonstrate theoretically and empirically that existing solutions and their straightforward modifications perform poorly in the presence of interchangeable resources, which motivates the design and implementation of AlloX. At its core, AlloX transforms the scheduling problem into a min-cost bipartite matching problem and provides dynamic fair allocation over time. We theoretically prove its optimality in an ideal, offline setting and show empirically that it works well in the online scenario by incorporating with Kubernetes. Evaluations on a small-scale CPU-GPU hybrid cluster and large-scale simulations highlight that AlloX can reduce the average job completion time significantly (by up to 95% when the system load is high) while providing fairness and preventing starvation.

DOI: 10.1145/3342195.3387547

SEUSS： skip redundant paths to make serverless fast

作者: Cadden, James and Unger, Thomas and Awad, Yara and Dong, Han and Krieger, Orran and Appavoo, Jonathan
关键词: No keywords

Abstract

This paper presents a system-level method for achieving the rapid deployment and high-density caching of serverless functions in a FaaS environment. For reduced start times, functions are deployed from unikernel snapshots, bypassing expensive initialization steps. To reduce the memory footprint of snapshots we apply page-level sharing across the entire software stack that is required to run a function. We demonstrate the effects of our techniques by replacing Linux on the compute node of a FaaS platform architecture. With our prototype OS, the deployment time of a function drops from 100s of milliseconds to under 10 ms. Platform throughput improves by 51x on workload composed entirely of new functions. We are able to cache over 50,000 function instances in memory as opposed to 3,000 using standard OS techniques. In combination, these improvements give the FaaS platform a new ability to handle large-scale bursts of requests.

DOI: 10.1145/3342195.3392698

CSI： inferring mobile ABR video adaptation behavior under HTTPS and QUIC

作者: Xu, Shichang and Sen, Subhabrata and Mao, Z. Morley
关键词: No keywords

Abstract

Mobile video streaming services have widely adopted Adaptive Bitrate (ABR) streaming to dynamically adapt the streaming quality to variable network conditions. A wide range of third-party entities such as network providers and testing services need to understand such adaptation behavior for purposes such as QoE monitoring and network management. The traditional approach involved conducting test runs and analyzing the HTTP-level information from the associated network traffic to understand the adaptation behavior under different network conditions. However, end-to-end traffic encryption protocols such as HTTPS and QUIC are being increasingly used by streaming services, hindering such traditional traffic analysis approaches.To address this, we develop CSI (Chunk Sequence Inferencer), a general system that enables third-parties to conduct active measurements and infer mobile ABR video adaptation behavior based on packet size and timing information still available in the encrypted traffic. We perform extensive evaluations and demonstrate that CSI achieves high inference accuracy for video encodings of popular streaming services covering various ABR system designs. As an illustration, for a popular mobile video service, we show that CSI can effectively help understand the video QoE implications of network traffic shaping policies and develop optimized policies, even in the presence of encryption.

DOI: 10.1145/3342195.3387558

Mousse： a system for selective symbolic execution of programs with untamed environments

作者: Liu, Yingtong and Hung, Hsin-Wei and Sani, Ardalan Amiri
关键词: selective symbolic execution, program environment, program analysis

Abstract

Selective symbolic execution (SSE) is a powerful program analysis technique for exploring multiple execution paths of a program. However, it faces a challenge in analyzing programs with environments that cannot be modeled nor virtualized. Examples include OS services managing I/O devices, software frameworks for accelerators, and specialized applications. We introduce Mousse, a system for analyzing such programs using SSE. Mousse uses novel solutions to overcome the above challenge. These include a novel process-level SSE design, environment-aware concurrent execution, and distributed execution of program paths. We use Mousse to comprehensively analyze five OS services in three smartphones. We perform bug and vulnerability detection, taint analysis, and performance profiling. Our evaluation shows that Mousse outperforms alternative solutions in terms of performance and coverage.

DOI: 10.1145/3342195.3387556

Don’t shoot down TLB shootdowns!

作者: Amit, Nadav and Tai, Amy and Wei, Michael
关键词: No keywords

Abstract

Translation Lookaside Buffers (TLBs) are critical for building performant virtual memory systems. Because most processors do not provide coherence for TLB mappings, TLB shootdowns provide a software mechanism that invokes inter-processor interrupts (IPLs) to synchronize TLBs. TLB shootdowns are expensive, so recent work has aimed to avoid the frequency of shootdowns through techniques such as batching. We show that aggressive batching can cause correctness issues and addressing them can obviate the benefits of batching. Instead, our work takes a different approach which focuses on both improving the performance of TLB shootdowns and carefully selecting where to avoid shootdowns. We introduce four general techniques to improve shootdown performance: (1) concurrently flush initiator and remote TLBs, (2) early acknowledgement from remote cores, (3) cacheline consolidation of kernel data structures to reduce cacheline contention, and (4) in-context flushing of userspace entries to address the overheads introduced by Spectre and Meltdown mitigations. We also identify that TLB flushing can be avoiding when handling copy-on-write (CoW) faults and some TLB shootdowns can be batched in certain system calls. Overall, we show that our approach results in significant speedups without sacrificing safety and correctness in both microbenchmarks and real-world applications.

DOI: 10.1145/3342195.3387518

BinRec： dynamic binary lifting and recompilation

作者: Altinay, Anil and Nash, Joseph and Kroes, Taddeus and Rajasekaran, Prabhu and Zhou, Dixin and Dabrowski, Adrian and Gens, David and Na, Yeoul and Volckaert, Stijn and Giuffrida, Cristiano and Bos, Herbert and Franz, Michael
关键词: No keywords

Abstract

Binary lifting and recompilation allow a wide range of install-time program transformations, such as security hardening, deobfuscation, and reoptimization. Existing binary lifting tools are based on static disassembly and thus have to rely on heuristics to disassemble binaries.In this paper, we present BinRec, a new approach to heuristic-free binary recompilation which lifts dynamic traces of a binary to a compiler-level intermediate representation (IR) and lowers the IR back to a “recovered” binary. This enables BinRec to apply rich program transformations, such as compiler-based optimization passes, on top of the recovered representation. We identify and address a number of challenges in binary lifting, including unique challenges posed by our dynamic approach. In contrast to existing frameworks, our dynamic frontend can accurately disassemble and lift binaries without heuristics, and we can successfully recover obfuscated code and all SPEC INT 2006 benchmarks including C++ applications. We evaluate BinRec in three application domains: i) binary reoptimization, ii) deobfuscation (by recovering partial program semantics from virtualization-obfuscated code), and iii) binary hardening (by applying existing compiler-level passes such as AddressSanitizer and SafeStack on binary code).

DOI: 10.1145/3342195.3387550

An HTM-based update-side synchronization for RCU on NUMA systems

作者: Park, SeongJae and McKenney, Paul E. and Dufour, Laurent and Yeom, Heon Y.
关键词: transactional memory, synchronization, parallel programming, operating systems, RCU

Abstract

Read-copy update (RCU) can provide ideal scalability for read-mostly workloads, but some believe that it provides only poor performance for updates. This belief is due to the lack of RCU-centric update synchronization mechanisms. RCU instead works with a range of update-side mechanisms, such as locking. In fact, many developers embrace simplicity by using global locking. Logging, hardware transactional memory, or fine-grained locking can provide better scalability, but each of these approaches has limitations, such as imposing overhead on readers or poor scalability on non-uniform memory access (NUMA) systems, mainly due to their lack of NUMA-aware design principles.This paper introduces an RCU extension (RCX) that provides highly scalable RCU updates on NUMA systems while retaining RCU’s read-side benefits. RCX is a software-based synchronization mechanism combining hardware transactional memory (HTM) and traditional locking based on our NUMA-aware design principles for RCU. Micro-bench-marks on a NUMA system having 144 hardware threads show RCX has up to 22.6 times better performance and up to 145 times lower HTM abort rates compared to a state-of-the-art RCU/HTM combination. To demonstrate the effectiveness and applicability of RCX, we have applied RCX to parallelize some of the Linux kernel memory management system and an in-memory database system. The optimized kernel and the database show up to 24 and 17 times better performance compared to the original version, respectively.

DOI: 10.1145/3342195.3387527

Keystone： an open framework for architecting trusted execution environments

作者: Lee, Dayeol and Kohlbrenner, David and Shinde, Shweta and Asanovi'{c
关键词: trusted execution environment, side-channel attack, secure enclave, open source, memory isolation, hardware root of trust, hardware enclave, RISC-V

Abstract

Trusted execution environments (TEEs) see rising use in devices from embedded sensors to cloud servers and encompass a range of cost, power constraints, and security threat model choices. On the other hand, each of the current vendor-specific TEEs makes a fixed set of trade-offs with little room for customization. We present Keystone—the first open-source framework for building customized TEEs. Keystone uses simple abstractions provided by the hardware such as memory isolation and a programmable layer underneath untrusted components (e.g., OS). We build reusable TEE core primitives from these abstractions while allowing platform-specific modifications and flexible feature choices. We showcase how Keystone-based TEEs run on unmodified RISC-V hardware and demonstrate the strengths of our design in terms of security, TCB size, execution of a range of benchmarks, applications, kernels, and deployment models.

DOI: 10.1145/3342195.3387532

Oblivious coopetitive analytics using hardware enclaves

作者: Dave, Ankur and Leung, Chester and Popa, Raluca Ada and Gonzalez, Joseph E. and Stoica, Ion
关键词: No keywords

Abstract

Coopetitive analytics refers to cooperation among competing parties to run queries over their joint data. Regulatory, business, and liability concerns prevent these organizations from sharing their sensitive data in plaintext.We propose Oblivious Coopetitive Queries (OCQ), an efficient, general framework for oblivious coopetitive analytics using hardware enclaves. OCQ builds on Opaque, a Spark-based framework for secure distributed analytics, to execute coopetitive queries using hardware enclaves in a decentralized manner. Its query planner chooses how and where to execute each relational operator to prevent data leakage through side channels such as memory access patterns, network traffic statistics, and cardinality, while minimizing overhead.We implemented OCQ as an extension to Apache Spark SQL. We find that OCQ is up to 9.9x faster than Opaque, a state-of-the-art secure analytics framework which outsources all data and computation to an enclave-enabled cloud; and is up to 219x faster than implementing analytics using AgMPC, a state-of-the-art secure multi-party computation framework.

DOI: 10.1145/3342195.3387552

Accelerating winograd convolutions using symbolic computation and meta-programming

作者: Mazaheri, Arya and Beringer, Tim and Moskewicz, Matthew and Wolf, Felix and Jannesari, Ali
关键词: winograd convolution, symbolic computation, meta-programming, deep learning

Abstract

Convolution operations are essential constituents of convolutional neural networks. Their efficient and performance-portable implementation demands tremendous programming effort and fine-tuning. Winograd’s minimal filtering algorithm is a well-known method to reduce the computational complexity of convolution operations. Unfortunately, existing implementations of this algorithm are either vendor-specific or hard-coded to support a small subset of convolutions, thus limiting their versatility and performance portability. In this paper, we propose a novel method to optimize Winograd convolutions based on symbolic computation. Taking advantage meta-programming and auto-tuning, we further introduce a system to automate the generation of efficient and portable Winograd convolution code for various GPUs. We show that our optimization technique can effectively exploit repetitive patterns, enabling us to reduce the number of arithmetic operations by up to 62% without compromising numerical stability. Moreover, we demonstrate in experiments that we can generate efficient kernels with runtimes close to deep-learning libraries, requiring only a minimum of programming effort, which confirms the performance portability of our approach.

DOI: 10.1145/3342195.3387549

Env2Vec： accelerating VNF testing with deep learning

作者: Piao, Guangyuan and Nicholson, Patrick K. and Lugones, Diego
关键词: No keywords

Abstract

The adoption of fast-paced practices for developing virtual network functions (VNFs) allows for continuous software delivery and creates a market advantage for network operators. This adoption, however, is problematic for testing engineers that need to assure, in shorter development cycles, certain quality of highly-configurable product releases running on heterogeneous clouds. Machine learning (ML) can accelerate testing workflows by detecting performance issues in new software builds. However, the overhead of maintaining several models for all combinations of build types, network configurations, and other stack parameters, can quickly become prohibitive and make the application of ML infeasible.We propose Env2Vec, a deep learning architecture that combines contextual features with historical resource usage, and characterizes the various stack parameters that influence the test execution within an embedding space, which allows it to generalize model predictions to previously unseen environments. We integrate a single ML model in the testing workflow to automatically debug errors and pinpoint performance bottlenecks. Results obtained with real testing data show an accuracy between 86.2%-100%, while reducing the false alarm rate by 20.9%-38.1% when reporting performance issues compared to state-of-the-art approaches.

DOI: 10.1145/3342195.3387525

PLASMA： programmable elasticity for stateful cloud computing applications

作者: Sang, Bo and Roman, Pierre-Louis and Eugster, Patrick and Lu, Hui and Ravi, Srivatsan and Petri, Gustavo
关键词: No keywords

Abstract

Developers are always on the lookout for simple solutions to manage their applications on cloud platforms. Major cloud providers have already been offering automatic elasticity management solutions (e.g., AWS Lambda, Azure durable function) to users. However, many cloud applications are stateful — while executing, functions need to share their state with others. Providing elasticity for such stateful functions is much more challenging, as a deployment/elasticity decision for a stateful entity can strongly affect others in ways which are hard to predict without any application knowledge. Existing solutions either only support stateless applications (e.g., AWS Lambda) or only provide limited elasticity management (e.g., Azure durable function) to stateful applications.PLASMA (Programmable Elasticity for Stateful Cloud Computing Applications) is a programming framework for elastic stateful cloud applications. It includes (1) an elasticity programming language as a second “level” of programming (complementing the main application programming language) for describing elasticity behavior, and (2) a novel semantics-aware elasticity management runtime that tracks program execution and acts upon application features as suggested by elasticity behavior. We have implemented 10+ applications with PLASMA. Extensive evaluation on Amazon AWS shows that PLASMA significantly improves their efficiency, e.g., achieving same performance as a vanilla setup with 25% fewer resources, or improving performance by 40% compared to the default setup.

DOI: 10.1145/3342195.3387553

Analyzing system performance with probabilistic performance annotations

作者: Rogora, Daniele and Carzaniga, Antonio and Diwan, Amer and Hauswirth, Matthias and Soul'{e
关键词: performance analysis, performance, instrumentation, dynamic analysis

Abstract

To understand, debug, and predict the performance of complex software systems, we develop the concept of probabilistic performance annotations. In essence, we annotate components (e.g., methods) with a relation between a measurable performance metric, such as running time, and one or more features of the input or the state of that component. We use two forms of regression analysis: regression trees and mixture models. Such relations can capture non-trivial behaviors beyond the more classic algorithmic complexity of a component. We present a method to derive such annotations automatically by generalizing observed measurements. We illustrate the use of our approach on three complex systems—the ownCloud distributed storage service; the MySQL database system; and the x264 video encoder library and application—producing non-trivial characterizations of the performance. Notably, we isolate a performance regression and identify the root cause of a second performance bug in MySQL.

DOI: 10.1145/3342195.3387554