DAMYSUS: streamlined BFT consensus leveraging trusted components
作者: Decouchant, J'{e
关键词: consensus, fault tolerance, trusted component
Abstract
Recently, streamlined Byzantine Fault Tolerant (BFT) consensus protocols, such as HotStuff, have been proposed as a means to circumvent the inefficient view-changes of traditional BFT protocols, such as PBFT. Several works have detailed trusted components, and BFT protocols that leverage them to tolerate a minority of faulty nodes and use a reduced number of communication rounds. Inspired by these works we identify two basic trusted services, respectively called the Checker and Accumulator services, which can be leveraged by streamlined protocols. Based on these services, we design Damysus, a streamlined protocol that improves upon HotStuff’s resilience and uses less communication rounds. In addition, we show how the Checker and Accumulator services can be adapted to develop Chained-Damysus, a chained version of Damysus where operations are pipelined for efficiency. We prove the correctness of Damysus and Chained-Damysus, and evaluate their performance showcasing their superiority compared to previous protocols.
State machine replication scalability made simple
作者: Stathakopoulou, Chrysoula and Pavlovic, Matej and Vukoli'{c
关键词: Byzantine fault tolerence, consensus, scalability, state machine replication
Abstract
Consensus, state machine replication (SMR) and total order broadcast (TOB) protocols are notorious for being poorly scalable with the number of participating nodes. Despite the recent race to reduce overall message complexity of leader-driven SMR/TOB protocols, scalability remains poor and the throughput is typically inversely proportional to the number of nodes. We present Insanely Scalable State Machine Replication, a generic construction to turn leader-driven protocols into scalable multi-leader ones. For our scalable SMR construction we use a novel primitive called Sequenced (Total Order) Broadcast (SB) which we wrap around PBFT, HotStuff and Raft leader-driven protocols to make them scale. Our construction is general enough to accommodate most leader-driven ordering protocols (BFT or CFT) and make them scale. Our implementation improves the peak throughput of PBFT, HotStuff, and Raft by 37x, 56x, and 55x, respectively, at a scale of 128 nodes.
Narwhal and Tusk: a DAG-based mempool and efficient BFT consensus
作者: Danezis, George and Kokoris-Kogias, Lefteris and Sonnino, Alberto and Spiegelman, Alexander
关键词: Byzantine fault tolerant, consensus protocol
Abstract
We propose separating the task of reliable transaction dissemination from transaction ordering, to enable high-performance Byzantine fault-tolerant quorum-based consensus. We design and evaluate a mempool protocol, Narwhal, specializing in high-throughput reliable dissemination and storage of causal histories of transactions. Narwhal tolerates an asynchronous network and maintains high performance despite failures. Narwhal is designed to easily scale-out using multiple workers at each validator, and we demonstrate that there is no foreseeable limit to the throughput we can achieve.Composing Narwhal with a partially synchronous consensus protocol (Narwhal-HotStuff) yields significantly better throughput even in the presence of faults or intermittent loss of liveness due to asynchrony. However, loss of liveness can result in higher latency. To achieve overall good performance when faults occur we design Tusk, a zero-message overhead asynchronous consensus protocol, to work with Narwhal. We demonstrate its high performance under a variety of configurations and faults.As a summary of results, on a WAN, Narwhal-Hotstuff achieves over 130,000 tx/sec at less than 2-sec latency compared with 1,800 tx/sec at 1-sec latency for Hotstuff. Additional workers increase throughput linearly to 600,000 tx/sec without any latency increase. Tusk achieves 160,000 tx/sec with about 3 seconds latency. Under faults, both protocols maintain high throughput, but Narwhal-HotStuff suffers from increased latency.
Building an efficient key-value store in a flexible address space
作者: Chen, Chen and Zhong, Wenshao and Wu, Xingbo
关键词: address space, key-value store, storage
Abstract
Data management applications store their data using structured files in which data are usually sorted to serve indexing and queries. However, in-place insertions and removals of data are not naturally supported in a file’s address space. To avoid repeatedly rewriting existing data in a sorted file to admit changes in place, applications usually employ extra layers of indirections, such as mapping tables and logs, to admit changes out of place. However, this approach leads to increased access cost and excessive complexity.This paper presents a novel storage abstraction that provides a flexible address space, where in-place updates of arbitrary-sized data, such as insertions and removals, can be performed efficiently. With these mechanisms, applications can manage sorted data in a linear address space with minimal complexity. Extensive evaluations show that a key-value store built on top of it can achieve high performance and efficiency with a simple implementation.
Rolis: a software approach to efficiently replicating multi-core transactions
作者: Shen, Weihai and Khanna, Ansh and Angel, Sebastian and Sen, Siddhartha and Mu, Shuai
关键词: concurrency, distributed systems, multicore
Abstract
This paper presents Rolis, a new speedy and fault-tolerant replicated multi-core transactional database system. Rolis’s aim is to mask the high cost of replication by ensuring that cores are always doing useful work and not waiting for each other or for other replicas. Rolis achieves this by not mixing the multi-core concurrency control with multi-machine replication, as is traditionally done by systems that use Paxos to replicate the transaction commit protocol. Instead, Rolis takes an “execute-replicate-replay” approach. Rolis first speculatively executes the transaction on the leader machine, and then replicates the per-thread transaction log to the followers using a novel protocol that leverages independent Paxos instances to avoid coordination, while still allowing followers to safely replay. The execution, replication, and replay are carefully designed to be scalable and have nearly zero coordination overhead across cores. Our evaluation shows that Rolis can achieve 1.03M TPS (transactions per second) on the TPC-C workload, using a 3-replica setup where each server has 32 cores. This throughput result is orders of magnitude higher than traditional software approaches we tested (e.g., 2PL), and is comparable to state-of-the-art, fault-tolerant, in-memory storage systems built using kernel bypass and advanced networking hardware, even though Rolis runs on commodity machines.
Tebis: index shipping for efficient replication in LSM key-value stores
作者: Vardoulakis, Michalis and Saloustros, Giorgos and Gonz'{a
关键词: B+ tree, LSM tree, RDMA, flash, key value stores
Abstract
Key-value (KV) stores based on LSM tree have become a foundational layer in the storage stack of datacenters and cloud services. Current approaches for achieving reliability and availability favor reducing network traffic and send to replicas only new KV pairs. As a result, they perform costly compactions to reorganize data in both the primary and backup nodes, which increases device I/O traffic and CPU overhead, and eventually hurts overall system performance. In this paper we describe Tebis, an efficient LSM-based KV store that reduces I/O amplification and CPU overhead for maintaining the replica index. We use a primary-backup replication scheme that performs compactions only on the primary nodes and sends pre-built indexes to backup nodes, avoiding all compactions in backup nodes. Our approach includes an efficient mechanism to deal with pointer translation across nodes in the pre-built region index. Our results show that Tebis reduces pressure on backup nodes compared to performing full compactions: Throughput is increased by 1.1 – 1.48\texttimes{
Sharing is caring: secure and efficient shared memory support for MVEEs
作者: Vinck, Jonas and Abrath, Bert and Coppens, Bart and Voulimeneas, Alexios and De Sutter, Bjorn and Volckaert, Stijn
关键词: OS, security, shared memory
Abstract
Multi-Variant Execution Environments (MVEEs) are a powerful tool for protecting legacy software against memory corruption attacks. MVEEs employ software diversity to run multiple variants of the same program in lockstep, whilst providing them with the same inputs and comparing their behavior. Well-constructed variants will behave equivalently under normal operating conditions but diverge when under attack. The MVEE detects these divergences and takes action before compromised variants can damage the host system.Existing MVEEs replicate inputs at the system call boundary, and therefore do not support programs that use shared-memory IPC with other processes, since shared memory pages can be read from and written to directly without system calls.We analyzed modern applications, ranging from web servers, over media players, to browsers, and observe that they rely heavily on shared memory, in some cases for their basic functioning and in other cases for enabling more advanced functionality. It follows that modern applications cannot enjoy the security provided by MVEEs unless those MVEEs support shared-memory IPC.This paper first identifies the requirements for supporting shared-memory IPC in an MVEE. We propose a design that involves techniques to identify and instrument accesses to shared memory pages, as well as techniques to replicate I/O through shared-memory IPC. We implemented these techniques in a prototype MVEE and report our findings through an evaluation of a range of benchmark programs. Our contributions enable the use of MVEEs on a far wider range of programs than previously supported. By overcoming one of the major remaining limitations of MVEEs, our contributions can help to bolster their real-world adoption.
Hardening binaries against more memory errors
作者: Duck, Gregory J. and Zhang, Yuntong and Yap, Roland H. C.
关键词: binary hardening, binary instrumentation, buffer overflows, low-fat pointers, memory errors, memory safety, redzones, static binary rewriting, use-after-free
Abstract
Memory errors, such as buffer overflows and use-after-free, remain the root cause of many security vulnerabilities in modern software. The use of closed source software further exacerbates the problem, as source-based memory error mitigation cannot be applied. While many memory error detection tools exist, most are based on a single error detection methodology with resulting known limitations, such as incomplete memory error detection (redzones) or false error detections (low-fat pointers). In this paper we introduce RedFat, a memory error hardening tool for stripped binaries that is fast, practical and scalable. The core idea behind RedFat is to combine complementary error detection methodologies—redzones and low-fat pointers—in order to detect more memory errors that can be detected by each individual methodology alone. However, complementary error detection also inherits the limitations of each approach, such as false error detections from low-fat pointers. To mitigate this, we introduce a profile-based analysis that automatically determines the strongest memory error protection possible without negative side effects.We implement RedFat on top of a scalable binary rewriting framework, and demonstrate low overheads compared to the current state-of-the-art. We show RedFat to be language agnostic on C/C++/Fortran binaries with minimal requirements, and works with stripped binaries for both position independent/dependent code. We also show that the RedFat instrumentation can scale to very large/complex binaries, such as Google Chrome.
PKRU-safe: automatically locking down the heap between safe and unsafe languages
作者: Kirth, Paul and Dickerson, Mitchel and Crane, Stephen and Larsen, Per and Dabrowski, Adrian and Gens, David and Na, Yeoul and Volckaert, Stijn and Franz, Michael
关键词: MPK, compartmentalization, compilers, security
Abstract
After more than twenty-five years of research, memory safety violations remain one of the major causes of security vulnerabilities in real-world software. Memory-safe languages, like Rust, have demonstrated that compiler technology can assist developers in writing efficient low-level code without the risk of memory corruption. However, many memory-safe languages still have to interface with unsafe code to some extent, which opens up the possibility for attackers to exploit memory-corruption vulnerabilities in the unsafe part of the system and subvert the safety guarantees provided by the memory-safe language.In this paper, we present PKRU-Safe, an automated method for enforcing the principle of least privilege on unsafe components in mixed-language environments. PKRU-Safe ensures that unsafe (external) code cannot corrupt or otherwise abuse memory used exclusively by the safe-language components. Our approach is automated using traditional compiler infrastructure to limit memory accesses for developer-designated components efficiently. PKRU-Safe does not require any modifications to the program’s original data flows or execution model. It can be adopted by projects containing legacy code with minimal effort, requiring only a small number of changes to a project’s build files and dependencies, and a few lines of annotations for each untrusted library.We apply PKRU-Safe to Servo, one of the largest Rust projects with approximately two million lines of Rust code (including dependencies) to automatically partition and protect the browser’s heap from its JavaScript engine written in unsafe C/C++. Our detailed evaluation shows that PKRU-Safe is able to thwart real-world exploits, often without measurable overhead, and with a mean overhead under 11.55% in our most pessimistic benchmark suite. As the method is language agnostic and major prototype components operate directly on LLVM IR, applying our techniques to other languages is straightforward.
KASLR in the age of MicroVMs
作者: Holmes, Benjamin and Waterman, Jason and Williams, Dan
关键词: KASLR, MicroVM, operating systems, security, virtual machines
Abstract
Address space layout randomization (ASLR) is a widely used component of computer security aimed at preventing code reuse and/or data-only attacks. Modern kernels utilize kernel ASLR (KASLR) and finer-grained forms, such as functional granular KASLR (FGKASLR), but do so as part of an inefficient bootstrapping process we call bootstrap self-randomization. Meanwhile, under increasing pressure to optimize their boot times, microVM architectures such as AWS Firecracker have resorted to eliminating bootstrapping steps, particularly decompression and relocation from the guest kernel boot process, leaving them without KASLR. In this paper, we present in-monitor KASLR, in which the virtual machine monitor efficiently implements KASLR for the guest kernel by skipping the expensive kernel self-relocation steps. We prototype in-monitor KASLR and FGKASLR in the open-source Firecracker virtual machine monitor demonstrating, on a microVM configured kernel, boot times 22% and 16% faster than bootstrapped KASLR and FGKASLR methods, respectively. We also show the low overhead of in-monitor KASLR, with only 4% (2 ms) increase in boot times on average compared to a kernel without KASLR. We also discuss the implications and future opportunities for in-monitor approaches.
Nyx-net: network fuzzing with incremental snapshots
作者: Schumilo, Sergej and Aschermann, Cornelius and Jemmett, Andrea and Abbasi, Ali and Holz, Thorsten
关键词: fuzzing, software security, testing
Abstract
Coverage-guided fuzz testing (“fuzzing”) has become mainstream and we have observed lots of progress in this research area recently. However, it is still challenging to efficiently test network services with existing coverage-guided fuzzing methods. In this paper, we introduce the design and implementation of Nyx-Net, a novel snapshot-based fuzzing approach that can successfully fuzz a wide range of targets spanning servers, clients, games, and even Firefox’s Inter-Process Communication (IPC) interface. Compared to state-of-the-art methods, Nyx-Net improves test throughput by up to 300x and coverage found by up to 70%. Additionally, Nyx-Net is able to find crashes in two of ProFuzzBench’s targets that no other fuzzer found previously. When using Nyx-Net to play the game Super Mario, Nyx-Net shows speedups of 10–30x compared to existing work. Moreover, Nyx-Net is able to find previously unknown bugs in servers such as Lighttpd, clients such as MySQL client, and even Firefox’s IPC mechanism—demonstrating the strength and versatility of the proposed approach. Lastly, our prototype implementation was awarded a $20.000 bug bounty for enabling fuzzing on previously unfuzzable code in Firefox and solving a long-standing problem at Mozilla.
DeepRest: deep resource estimation for interactive microservices
作者: Chow, Ka-Ho and Deshpande, Umesh and Seshadri, Sangeetha and Liu, Ling
关键词: API, cloud computing, cyberattacks, machine learning, microservices, neural networks, resource estimation
Abstract
Interactive microservices expose API endpoints to be invoked by users. For such applications, precisely estimating the resources required to serve specific API traffic is challenging. This is because an API request can interact with different components and consume different resources for each component. The notion of API traffic is vital to application owners since the API endpoints often reflect business logic, e.g., a customer transaction. The existing systems that simply rely on historical resource utilization are not API-aware and thus cannot estimate the resource requirement accurately. This paper presents DeepRest, a deep learning-driven resource estimation system. DeepRest formulates resource estimation as a function of API traffic and learns the causality between user interactions and resource utilization directly in a production environment. Our evaluation shows that DeepRest can estimate resource requirements with over 90% accuracy, even if the API traffic to be estimated has never been observed (e.g., 3\texttimes{
Unicorn: reasoning about configurable system performance through the lens of causality
作者: Iqbal, Md Shahriar and Krishna, Rahul and Javidian, Mohammad Ali and Ray, Baishakhi and Jamshidi, Pooyan
关键词: causal inference, configurable systems, counterfactual reasoning, performance debugging, performance modeling, performance optimization
Abstract
Modern computer systems are highly configurable, with the total variability space sometimes larger than the number of atoms in the universe. Understanding and reasoning about the performance behavior of highly configurable systems, over a vast and variable space, is challenging. State-of-the-art methods for performance modeling and analyses rely on predictive machine learning models, therefore, they become (i) unreliable in unseen environments (e.g., different hardware, workloads), and (ii) may produce incorrect explanations. To tackle this, we propose a new method, called Unicorn, which (i) captures intricate interactions between configuration options across the software-hardware stack and (ii) describes how such interactions can impact performance variations via causal inference. We evaluated Unicorn on six highly configurable systems, including three on-device machine learning systems, a video encoder, a database management system, and a data analytics pipeline. The experimental results indicate that Unicorn outperforms state-of-the-art performance debugging and optimization methods in finding effective repairs for performance faults and finding configurations with near-optimal performance. Further, unlike the existing methods, the learned causal performance models reliably predict performance for new environments.
Multi-objective congestion control
作者: Ma, Yiqing and Tian, Han and Liao, Xudong and Zhang, Junxue and Wang, Weiyan and Chen, Kai and Jin, Xin
关键词: congestion control, multi-objective, reinforcement learning
Abstract
Decades of research on Internet congestion control (CC) have produced a plethora of algorithms that optimize for different performance objectives. Applications face the challenge of choosing the most suitable algorithm based on their needs, and it takes tremendous efforts and expertise to customize CC algorithms when new demands emerge. In this paper, we explore a basic question: can we design a single CC algorithm to satisfy different objectives?We propose MOCC, the first multi-objective congestion control algorithm that attempts to address this question. The core of MOCC is a novel multi-objective reinforcement learning framework for CC to automatically learn the correlations between different application requirements and the corresponding optimal control policies. Under this framework, MOCC further applies transfer learning to transfer the knowledge from past experience to new applications, quickly adapting itself to a new objective even if it is unforeseen. We provide both user-space and kernel-space implementation of MOCC. Real-world Internet experiments and extensive simulations show that MOCC supports well multi-objective, competing or outperforming the best existing CC algorithms on each individual objectives, and quickly adapting to new application objectives in 288 seconds (14.2\texttimes{
Hybrid anomaly detection and prioritization for network logs at cloud scale
作者: Ohana, David and Wassermann, Bruno and Dupuis, Nicolas and Kolodner, Elliot and Raichstein, Eran and Malka, Michal
关键词: AIOps, anomaly detection, cloud computing, deep learning, log analysis, machine learning, reliability
Abstract
Monitoring the health of large-scale systems requires significant manual effort, usually through the continuous curation of alerting rules based on keywords, thresholds and regular expressions, which might generate a flood of mostly irrelevant alerts and obscure the actual information operators would like to see. Existing approaches try to improve the observability of systems by intelligently detecting anomalous situations. Such solutions surface anomalies that are statistically significant, but may not represent events that reliability engineers consider relevant. We propose ADEPTUS, a practical approach for detection of relevant health issues in an established system. ADEPTUS combines statistics and unsupervised learning to detect anomalies with supervised learning and heuristics to determine which of the detected anomalies are likely to be relevant to the Site Reliability Engineers (SREs). ADEPTUS overcomes the labor-intensive prerequisite of obtaining anomaly labels for supervised learning by automatically extracting information from historic alerts and incident tickets. We leverage ADEPTUS for observability in the network infrastructure of IBM Cloud. We perform an extensive real-world evaluation on 10 months of logs generated by tens of thousands of network devices across 11 data centers and demonstrate that ADEPTUS achieves higher alerting accuracy than the rule-based log alerting solution, curated by domain experts, used by SREs daily.
Performance evolution of mitigating transient execution attacks
作者: Behrens, Jonathan and Belay, Adam and Kaashoek, M. Frans
关键词: meltdown, spectre, speculative execution, transient execution attack
Abstract
Today’s applications pay a performance penalty for mitigations to protect against transient execution attacks such as Meltdown [32] and Spectre [25]. Such a reduction in performance directly translates to higher operating costs and degraded user experience. This paper measures the performance impact of these mitigations across a range of processors from multiple vendors and across several security boundaries to identify trends over successive generations of processors and to attribute how much of the overall slowdown is caused by each individual mitigation.We find that overheads for operating system intensive workloads have declined by as much as 10\texttimes{
You shall not (by)pass! practical, secure, and fast PKU-based sandboxing
作者: Voulimeneas, Alexios and Vinck, Jonas and Mechelinck, Ruben and Volckaert, Stijn
关键词: PKU, in-process isolation, sanboxing, security
Abstract
Memory Protection Keys for Userspace (PKU) is a recent hardware feature that allows programs to assign virtual memory pages to protection domains, and to change domain access permissions using inexpensive, unprivileged instructions. Several in-process memory isolation approaches leverage this feature to prevent untrusted code from accessing sensitive program state and data. Typically, PKU-based isolation schemes need to be used in conjunction with mitigations such as CFI because untrusted code, when compromised, can otherwise bypass the PKU access permissions using unprivileged instructions or operating system APIs.Recently, researchers proposed fully self-contained PKU-based memory isolation schemes that do not rely on other mitigations. These systems use exploit-proof call gates to transfer control between trusted and untrusted code, as well as a sandbox that prevents tampering with the PKU infrastructure from untrusted code.In this paper, we show that these solutions are not complete. We first develop two proof-of-concept attacks against a state-of-the-art PKU-based memory isolation scheme. We then present Cerberus, a PKU-based sandboxing framework that can overcome limitations of existing sandboxes. We apply Cerberus to several memory isolation schemes, and show that it is practical, efficient, and secure.
Verified programs can party: optimizing kernel extensions via post-verification merging
作者: Kuo, Hsuan-Chi and Chen, Kai-Hsun and Lu, Yicheng and Williams, Dan and Mohan, Sibin and Xu, Tianyin
关键词: BPF, eBPF, indirect jump, kernel extension, retpoline, spectre, transient attack
Abstract
Operating system (OS) extensions are more popular than ever. For example, Linux BPF is marketed as a “superpower” that allows user programs to be downloaded into the kernel, verified to be safe and executed at kernel hook points. So, BPF extensions have high performance and are often placed at performance-critical paths for tracing and filtering.However, although BPF extension programs execute in a shared kernel environment and are already individually verified, they are often executed independently in chains. We observe that the chain pattern has large performance overhead, due to indirect jumps penalized by security mitigations (e.g., Spectre), loops, and memory accesses.In this paper, we argue for a separation of concerns. We propose to decouple the execution of BPF extensions from their verification requirements—BPF extension programs can be collectively optimized, after each BPF extension program is individually verified and loaded into the shared kernel.We present KFuse, a framework that dynamically and automatically merges chains of BPF programs by transforming indirect jumps into direct jumps, unrolling loops, and saving memory accesses, without loss of security or flexibility. KFuse can merge BPF programs that are (1) installed by multiple principals, (2) maintained to be modular and separate, (3) installed at different points of time, and (4) split into smaller, verifiable programs via BPF tail calls. KFuse demonstrates 85% performance improvement of BPF chain execution and 7% of application performance improvement over existing BPF use cases (systemd’s Seccomp BPF filters). It achieves more significant benefits for longer chains.
Minimum viable device drivers for ARM trustzone
作者: Guo, Liwei and Lin, Felix Xiaozhu
关键词: arm TrustZone, device drivers, operating systems, security
Abstract
While TrustZone can isolate IO hardware, it lacks drivers for modern IO devices. Rather than porting drivers, we propose a novel approach to deriving minimum viable drivers: developers exercise a full driver and record the driver/device interactions; the processed recordings, dubbed driverlets, are replayed in the TEE at run time to access IO devices. Driverlets address two key challenges: correctness and expressiveness, for which they build on a key construct called interaction template. The interaction template ensures faithful reproduction of recorded IO jobs (albeit on new IO data); it accepts dynamic input values; it tolerates nondeterministic device behaviors. We demonstrate driverlets on a series of sophisticated devices, making them accessible to Trust-Zone for the first time to our knowledge. Our experiments show that driverlets are secure, easy to build, and incur acceptable overhead (1.4\texttimes{
OPEC: operation-based security isolation for bare-metal embedded systems
作者: Zhou, Xia and Li, Jiaqi and Zhang, Wenlong and Zhou, Yajin and Shen, Wenbo and Ren, Kui
关键词: hardware-assisted security, memory protection unit, security isolation
Abstract
Bare-metal embedded systems usually lack security isolation. Attackers can subvert the whole system with a single vulnerability. Previous research intends to enforce both privilege isolation (to run application code at the unprivileged level) and resource isolation for global variables and peripherals. However, it suffers from partition-time and execution-time over-privilege issues, due to the limited hardware resources (MPU regions) and the improper way to partition a program.In this paper, we propose operation-based isolation for bare-metal embedded systems. An operation is a logically independent task composed of an entry function and all functions reachable from it. To solve the partition-time over-privilege issue, we utilize the global variables shadowing technique to reduce the needed MPU regions to confine the access of the global variables. To mitigate the execution-time over-privilege issue, we split programs into code compartments (called operation) that only contain necessary functions to perform specific tasks, thereby removing the resources needed by unnecessary functions. We implement a prototype called OPEC, which contains an LLVM-based compiler and a reference monitor. The compiler partitions a program and analyzes the resource dependency for each operation. With the hardware-supported privilege levels and MPU, the reference monitor is responsible for enforcing the privilege and resource isolation at runtime. Our evaluation shows that OPEC can achieve the security guarantees for the privilege and resource isolation with negligible runtime overhead (average 0.23%), moderate Flash overhead (average 1.79%), and acceptable SRAM overhead (average 5.35%).
LiteReconfig: cost and content aware reconfiguration of video object detection systems for mobile GPUs
作者: Xu, Ran and Lee, Jayoung and Wang, Pengcheng and Bagchi, Saurabh and Li, Yin and Chaterji, Somali
关键词: approximate computing, latency-sensitive analytics, mobile vision, object detection, reconfiguration, video analytics
Abstract
An adaptive video object detection system selects different execution paths at runtime, based on video content and available resources, so as to maximize accuracy under a target latency objective (e.g., 30 frames per second). Such a system is well suited to mobile devices with limited computing resources, and often running multiple contending applications. Existing solutions suffer from two major drawbacks. First, collecting feature values to decide on an execution branch is expensive. Second, there is a switching overhead for transitioning between branches and this overhead depends on the transition pair. LiteReconfig, an efficient and adaptive video object detection framework, addresses these challenges. LiteReconfig features a cost-benefit analyzer to decide which features to use, and which execution branch to run, at inference time. Furthermore, LiteReconfig has a content-aware accuracy prediction model, to select an execution branch tailored for frames in a video stream. We demonstrate that LiteReconfig achieves significantly improved accuracy under a set of varying latency objectives than existing systems, while maintaining up to 50 fps on an NVIDIA AGX Xavier board. Our code, with DOI, is available at https://doi.org/10.5281/zenodo.6345733.
Slashing the disaggregation tax in heterogeneous data centers with FractOS
作者: Vilanova, Llu'{\i
关键词: capabilities, data center, distributed systems, operating systems, resource disaggregation
Abstract
Disaggregated heterogeneous data centers promise higher efficiency, lower total costs of ownership, and more flexibility for data-center operators. However, current software stacks can levy a high tax on application performance. Applications and OSes are designed for systems where local PCIe-connected devices are centrally managed by CPUs, but this centralization introduces unnecessary messages through the shared data-center network in a disaggregated system.We present FractOS, a distributed OS that is designed to minimize the network overheads of disaggregation in heterogeneous data centers. FractOS elevates devices to be first-class citizens, enabling direct peer-to-peer data transfers and task invocations among them, without centralized application and OS control. FractOS achieves this through: (1) new abstractions to express distributed applications across services and disaggregated devices, (2) new mechanisms that enable devices to securely interact with each other and other data-center services, (3) a distributed and isolated OS layer that implements these abstractions and mechanisms, and can run on host CPUs and SmartNICs.Our prototype shows that FractOS accelerates real-world heterogeneous applications by 47%, while reducing their network traffic by 3\texttimes{
OS scheduling with nest: keeping tasks close together on warm cores
作者: Lawall, Julia and Chhaya-Shailesh, Himadri and Lozi, Jean-Pierre and Lepers, Baptiste and Zwaenepoel, Willy and Muller, Gilles
关键词: Linux kernel, scheduling
Abstract
To best support highly parallel applications, Linux’s CFS scheduler tends to spread tasks across the machine on task creation and wakeup. It has been observed, however, that in a server environment, such a strategy leads to tasks being unnecessarily placed on long-idle cores that are running at lower frequencies, reducing performance, and to tasks being unnecessarily distributed across sockets, consuming more energy. In this paper, we propose to exploit the principle of core reuse, by constructing a nest of cores to be used in priority for task scheduling, thus obtaining higher frequencies and using fewer sockets. We implement the Nest scheduler in the Linux kernel. While performance and energy usage are comparable to CFS for highly parallel applications, for a range of applications using fewer tasks than cores, Nest improves performance 10%–2\texttimes{
Kite: lightweight critical service domains
作者: Mehrab, A K M Fazla and Nikolaev, Ruslan and Ravindran, Binoy
关键词: Xen, hypervisor, unikernel, virtual machine
Abstract
Converged multi-level secure (MLS) systems, such as Qubes OS or SecureView, heavily rely on virtualization and service virtual machines (VMs). Traditionally, driver domains - isolated VMs that run device drivers - and daemon VMs use full-blown general-purpose OSs. It seems that specialized lightweight OSs, known as unikernels, would be a better fit for those. Surprisingly, to this day, driver domains can only be built from Linux. We discuss how unikernels can be beneficial in this context - they improve security and isolation, reduce memory overheads, and simplify software configuration and deployment. We specifically propose to use unikernels that borrow device drivers from existing general-purpose OSs.We present Kite which implements network and storage unikernel-based VMs and serve two essential classes of devices. We compare our approach against Linux using a number of typical micro- and macrobenchmarks used for networking and storage. Our approach achieves performance similar to that of Linux. However, we demonstrate that the number of system calls and ROP gadgets can be greatly reduced with our approach compared to Linux. We also demonstrate that our approach has resilience to an array of CVEs (e.g., CVE-2021-35039, CVE-2016-4963, and CVE-2013-2072), smaller image size, and improved startup time. Finally, unikernelizing is doable for the remaining (non-driver) service VMs as evidenced by our unikernelized DHCP server.
Fleche: an efficient GPU embedding cache for personalized recommendations
作者: Xie, Minhui and Lu, Youyou and Lin, Jiazhen and Wang, Qing and Gao, Jian and Ren, Kai and Shu, Jiwu
关键词: GPU cache, deep learning recommendation models, embedding lookup, memory management
Abstract
Deep learning based models have dominated current production recommendation systems. However, the gap between CPU-side DRAM data accessing and GPU processing still impedes their inference performance. GPU-resident cache can bridge this gap, but we find that existing systems leave the benefits to cache the embedding table, a huge sparse structure, on GPU unexploited. In this paper, we present Fleche, a holistic cache scheme with detailed designs for efficient GPU-resident embedding caching. Fleche (1) uses one cache backend for all embedding tables to improve the total cache utilization, and (2) merges small kernel calls into one unitary call to reduce the overhead of kernel maintenance (e.g., kernel launching and synchronizing). Furthermore, we carefully design the cache query workflow for finer-grain parallelism. Evaluations with real-world datasets show that compared with the prior art, Fleche significantly improves the throughput of embedding layer by 2.0 – 5.4\texttimes{
GNNLab: a factored system for sample-based GNN training over GPUs
作者: Yang, Jianbang and Tang, Dahai and Song, Xiaoniu and Wang, Lei and Yin, Qiang and Chen, Rong and Yu, Wenyuan and Zhou, Jingren
关键词: caching policy, graph neural networks, sample-based GNN training
Abstract
We propose GNNLab, a sample-based GNN training system in a single machine multi-GPU setup. GNNLab adopts a factored design for multiple GPUs, where each GPU is dedicated to the task of graph sampling or model training. It accelerates both tasks by eliminating GPU memory contention. To balance GPU workloads, GNNLab applies a global queue to bridge GPUs asynchronously and adopts a simple yet effective method to adaptively allocate GPUs for different tasks. GNNLab further leverages temporarily switching to avoid idle waiting on GPUs. Furthermore, GNNLab proposes a new pre-sampling based caching policy that takes both sampling algorithms and GNN datasets into account, and shows an efficient and robust caching performance. Evaluations on three representative GNN models and four real-life graphs show that GNNLab outperforms the state-of-the-art GNN systems DGL and PyG by up to 9.1\texttimes{
Out-of-order backprop: an effective scheduling technique for deep learning
作者: Oh, Hyungjun and Lee, Junyeol and Kim, Hyeongju and Seo, Jiwon
关键词: deep learning systems
Abstract
Neural network training requires a large amount of computation and thus GPUs are often used for the acceleration. While they improve the performance, GPUs are underutilized during the training. This paper proposes out-of-order (ooo) back-prop, an effective scheduling technique for neural network training. By exploiting the dependencies of gradient computations, ooo backprop enables to reorder their executions to make the most of the GPU resources. We show that the GPU utilization in single- and multi-GPU training can be commonly improve by applying ooo backprop and prioritizing critical operations. We propose three scheduling algorithms based on ooo backprop. For single-GPU training, we schedule with multi-stream ooo computation to mask the kernel launch overhead. In data-parallel training, we reorder the gradient computations to maximize the overlapping of computation and parameter communication; in pipeline-parallel training, we prioritize critical gradient computations to reduce the pipeline stalls. We evaluate our optimizations with twelve neural networks and five public datasets. Compared to the respective state of the art training systems, our algorithms improve the training throughput by 1.03–1.58\texttimes{
D3: a dynamic deadline-driven approach for building autonomous vehicles
作者: Gog, Ionel and Kalra, Sukrit and Schafhalter, Peter and Gonzalez, Joseph E. and Stoica, Ion
关键词: No keywords
Abstract
Autonomous vehicles (AVs) must drive across a variety of challenging environments that impose continuously-varying deadlines and runtime-accuracy tradeoffs on their software pipelines. A deadline-driven execution of such AV pipelines requires a new class of systems that enable the computation to maximize accuracy under dynamically-varying deadlines. Designing these systems presents interesting challenges that arise from combining ease-of-development of AV pipelines with deadline specification and enforcement mechanisms.Our work addresses these challenges through D3 (Dynamic Deadline-Driven), a novel execution model that centralizes the deadline management, and allows applications to adjust their computation by modeling missed deadlines as exceptions. Further, we design and implement ERDOS, an open-source realization of D3 for AV pipelines that exposes finegrained execution events to applications, and provides mechanisms to speculatively execute computation and enforce deadlines between an arbitrary set of events. Finally, we address the crucial lack of AV benchmarks through our state-of-the-art open-source AV pipeline, Pylot, that works seamlessly across simulators and real AVs. We evaluate the efficacy of D3 and ERDOS by driving Pylot across challenging driving scenarios spanning 50km, and observe a 68% reduction in collisions as compared to prior execution models.
Varuna: scalable, low-cost training of massive deep learning models
作者: Athlur, Sanjith and Saran, Nitika and Sivathanu, Muthian and Ramjee, Ramachandran and Kwatra, Nipun
关键词: distributed systems, large scale DNN training, systems for machine learning
Abstract
Systems for training massive deep learning models (billions of parameters) today assume and require specialized “hyperclusters”: hundreds or thousands of GPUs wired with specialized high-bandwidth interconnects such as NV-Link and Infiniband. Besides being expensive, such dependence on hyperclusters and custom high-speed inter-connects limits the size of such clusters, creating (a) scalability limits on job parallelism; (b) resource fragmentation across hyperclusters.In this paper, we present Varuna a new system that enables training massive deep learning models on commodity networking. Varuna makes thrifty use of networking resources and automatically configures the user’s training job to efficiently use any given set of resources. Therefore, Varuna is able to leverage “low-priority” VMs that cost about 5x cheaper than dedicated GPUs, thus significantly reducing the cost of training massive models. We demonstrate the efficacy of Varuna by training massive models, including a 200 billion parameter model, on 5x cheaper “spot VMs”, while maintaining high training throughput. Varuna improves end-to-end training time for language models like BERT and GPT-2 by up to 18x compared to other model-parallel approaches and up to 26% compared to other pipeline parallel approaches on commodity VMs.The code for Varuna is available at https://github.com/microsoft/varuna.
Characterizing the performance of intel optane persistent memory: a close look at its on-DIMM buffering
作者: Xiang, Lingfeng and Zhao, Xingsheng and Rao, Jia and Jiang, Song and Jiang, Hong
关键词: Optane DCPMM, performance characterization, persistent memory
Abstract
We present a comprehensive and in-depth study of Intel Optane DC persistent memory (DCPMM). Our focus is on exploring the internal design of Optane’s on-DIMM read-write buffering and its impacts on application-perceived performance, read and write amplifications, the overhead of different types of persists, and the tradeoffs between persistency models. While our measurements confirm the results of the existing profiling studies, we have new discoveries and offer new insights. Notably, we find that read and write are managed differently in separate on-DIMM read and write buffers. Comparable in size, the two buffers serve distinct purposes. The read buffer offers higher concurrency and effective on-DIMM prefetching, leading to high read bandwidth and superior sequential performance. However, it does not help hide media access latency. In contrast, the write buffer offers limited concurrency but is a critical stage in a pipeline that supports asynchronous write in the DDR-T protocol. Surprisingly, in addition to write coalescing, the write buffer delivers lower than read and consistent write latency regardless of the working set size, the type of write, the access pattern, or the persistency model. Furthermore, we discover that the mismatch between cacheline access granularity and the 3D-Xpoint media access granularity negatively impacts the effectiveness of CPU cache prefetching and leads to wasted persistent memory bandwidth.Our proposition is to decouple read and write in the performance analysis and optimization of persistent programs. We present three case studies based on this insight and demonstrate considerable performance improvements. We verify the results on two generations of Optane DCPMM.