CGO 2024

A Tensor Algebra Compiler for Sparse Differentiation

作者: Shaikhha, Amir and Huot, Mathieu and Hashemian, Shideh
关键词: sparse tensor algebra, automatic differentiation, semi-ring dictionaries

Abstract

Sparse tensors are prevalent in many data-intensive applications. However, existing automatic differentiation (AD) frameworks are tailored towards dense tensors, which makes it a challenge to efficiently compute gradients through sparse tensor operations. This is due to irregular sparsity patterns that can result in substantial memory and computational overheads. We propose a novel framework that enables the efficient AD of sparse tensors. The key aspects of our work include a compilation pipeline leveraging two intermediate DSLs with AD-agnostic domain-specific optimizations followed by efficient C++ code generation. We showcase the effectiveness of our framework in terms of performance and scalability through extensive experimentation, outperforming state-of-the-art alternatives across a variety of synthetic and real-world datasets.

DOI: 10.1109/CGO57630.2024.10444787

Energy-Aware Tile Size Selection for Affine Programs on GPUs

作者: Jayaweera, Malith and Kong, Martin and Wang, Yanzhi and Kaeli, David
关键词: loop tiling, energy optimization, affine transformations, GPUs

Abstract

Loop tiling is a high-order transformation used to increase data locality and performance. While previous work has considered its application to several domains and architectures, its potential impact on energy efficiency has been largely ignored. In this work, we present an Energy-Aware Tile Size Selection Scheme (EATSS) for affine programs targeting GPUs. We automatically derive non-linear integer formulations for affine programs and use the Z3 solver to find effective tile sizes that meet architectural resource constraints, while maximizing performance and minimizing energy consumption. Our approach builds on the insight that reducing the liveness of in-cache data, together with exploiting automatic power scaling, can lead to substantial gains in performance and energy efficiency. We evaluate EATSS on NVIDIA Xavier and GA100 GPUs, and report median performance-per-Watt improvement relative to PPCG on several affine kernels. On Polybench kernels, we achieve 1.5\texttimes{

DOI: 10.1109/CGO57630.2024.10444795

PolyTOPS： Reconfigurable and Flexible Polyhedral Scheduler

作者: Consolaro, Gianpietro and Zhang, Zhen and Razanajato, Harenome and Lossing, Nelson and Tchoulak, Nassim and Susungi, Adilla and Alves, Artur Cesar Araujo and Zhang, Renwei and Barthou, Denis and Ancourt, Corinne and Bastoul, Cedric
关键词: polyhedral optimization, polyhedral scheduling, configurability, flexibility

Abstract

Polyhedral techniques have been widely used for automatic code optimization in low-level compilers and higherlevel processes. Loop optimization is central to this technique, and several polyhedral schedulers like Feautrier, Pluto, isl and Tensor Scheduler have been proposed, each of them targeting a different architecture, parallelism model, or application scenario. The need for scenario-specific optimization is growing due to the heterogeneity of architectures. One of the most critical cases is represented by NPUs (Neural Processing Units) used for AI, which may require loop optimization with different objectives. Another factor to be considered is the framework or compiler in which polyhedral optimization takes place. Different scenarios, depending on the target architecture, compilation environment, and application domain, may require different kinds of optimization to best exploit the architecture feature set.We introduce a new configurable polyhedral scheduler, PolyTOPS, that can be adjusted to various scenarios with straightforward, high-level configurations. This scheduler allows the creation of diverse scheduling strategies that can be both scenario-specific (like state-of-the-art schedulers) and kernel-specific, breaking the concept of a one-size-fits-all scheduler approach. PolyTOPS has been used with isl and CLooG as code generators and has been integrated in MindSpore AKG deep learning compiler. Experimental results in different scenarios show good performance: a geomean speedup of 7.66x on MindSpore (for the NPU Ascend architecture) hybrid custom operators over isl scheduling, a geomean speedup up to 1.80x on PolyBench on different multicore architectures over Pluto scheduling. Finally, some comparisons with different state-of-the-art tools are presented in the PolyMage scenario.

DOI: 10.1109/CGO57630.2024.10444791

AskIt： Unified Programming Interface for Programming with Large Language Models

作者: Okuda, Katsumi and Amarasinghe, Saman
关键词: domain specific language, code generation, large language model, software engineering, artificial intelligence

Abstract

Large Language Models (LLMs) exhibit a unique phenomenon known as emergent abilities, demonstrating adeptness across numerous tasks, from text summarization to code generation. While these abilities open up novel avenues in software design and crafting, their incorporation presents substantial challenges. Developers face decisions regarding the use of LLMs for directly performing tasks within applications as well as for generating and executing code to accomplish these tasks. Moreover, effective prompt design becomes a critical concern, given the necessity of extracting data from natural language outputs. To address these complexities, this paper introduces AskIt, a domain-specific language (DSL) specifically designed for LLMs. AskIt simplifies LLM integration by providing a unified interface that not only allows for direct task execution using LLMs but also supports the entire cycle of code generation and execution. This dual capability is achieved through (1) type-guided output control, (2) template-based function definitions, and (3) prompt generation for both usage modes. Our evaluations underscore AskIt’s effectiveness. Across 50 tasks, AskIt generated concise prompts, achieving a 16.14 % reduction in prompt length compared to benchmarks. Additionally, by enabling a seamless transition between using LLMs directly in applications and for generating code, AskIt achieved significant efficiency improvements, as observed in our GSM8K benchmark experiments. The implementations of AskIt in TypeScript and Python are available at https://github.com/katsumiok/ts-askit and https://github.com/katsumiok/pyaskit, respectively.

DOI: 10.1109/CGO57630.2024.10444830

Revealing Compiler Heuristics through Automated Discovery and Optimization

作者: Seeker, Volker and Cummins, Chris and Cole, Murray and Franke, Bj"{o
关键词: search methodologies, compiler optimization, differential testing

Abstract

Tuning compiler heuristics and parameters is well known to improve optimization outcomes dramatically. Prior works have tuned command line flags and a few expert identified heuristics. However, there are an unknown number of heuristics buried, unmarked and unexposed inside the compiler as a consequence of decades of development without auto-tuning being foremost in the minds of developers. Many may not even have been considered heuristics by the developers who wrote them. The result is that auto-tuning search and machine learning can optimize only a tiny fraction of what could be possible if all heuristics were available to tune. Manually discovering all of these heuristics hidden among millions of lines of code and exposing them to auto-tuning tools is a Herculean task that is simply not practical. What is needed is a method of automatically finding these heuristics to extract every last drop of potential optimization.In this work, we propose Heureka, a framework that automatically identifies potential heuristics in the compiler that are highly profitable optimization targets and then automatically finds available tuning parameters for those heuristics with minimal human involvement. Our work is based on the following key insight: When modifying the output of a heuristic within an acceptable value range, the calling code using that output will still function correctly and produce semantically correct results. Building on that, we automatically manipulate the output of potential heuristic code in the compiler and decide using a Differential Testing approach if we found a heuristic or not. During output manipulation, we also explore acceptable value ranges of the targeted code. Heuristics identified in this way can then be tuned to optimize an objective function.We used Heureka to search for heuristics among eight thousand functions from the LLVM optimization passes, which is about 2% of all available functions. We then use identified heuristics to tune the compilation of 38 applications from the NAS and Polybench benchmark suites. Compared to an -Ozbaseline we reduce binary sizes by up to 11.6% considering single heuristics only and up to 19.5% when stacking the effects of multiple identified tuning targets and applying a random search with minimal search effort. Generalizing from existing analysis results, Heureka needs, on average, a little under an hour on a single machine to identify relevant heuristic targets for a previously unseen application.

DOI: 10.1109/CGO57630.2024.10444847

SLaDe： A Portable Small Language Model Decompiler for Optimized Assembly

作者: Armengol-Estap'{e
关键词: decompilation, neural decompilation, transformer, language models, type inference

Abstract

Decompilation is a well-studied area with numerous high-quality tools available. These are frequently used for security tasks and to port legacy code. However, they regularly generate difficult-to-read programs and require a large amount of engineering effort to support new programming languages and ISAs. Recent interest in neural approaches has produced portable tools that generate readable code. Nevertheless, to-date such techniques are usually restricted to synthetic programs without optimization, and no models have evaluated their portability. Furthermore, while the code generated may be more readable, it is usually incorrect.This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence Transformer trained over real-world code and augmented with a type inference engine. We utilize a novel tokenizer, dropout-free regularization, and type inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches. Unlike standard approaches, SLaDe can infer out-of-context types and unlike neural approaches, it generates correct code.We evaluate SLaDe on over 4,000 ExeBench functions on two ISAs and at two optimization levels. SLaDe is up to 6\texttimes{

DOI: 10.1109/CGO57630.2024.10444788

TapeFlow： Streaming Gradient Tapes in Automatic Differentiation

作者: Hakimi, Milad and Shriraman, Arrvindh
关键词: automatic differentiation, gradients, streaming algorithms, back propagation

Abstract

Computing gradients is a crucial task in many domains, including machine learning, physics simulations, and scientific computing. Automatic differentiation (AD) computes gradients for arbitrary imperative code. In reverse mode AD, an auxiliary structure, the tape, is used to transfer intermediary values required for gradient computation. The challenge is how to organize the tape in the memory hierarchy since it has a high reuse distance, lacks temporal locality, and inflates working set by 2 — 4\texttimes{

DOI: 10.1109/CGO57630.2024.10444805

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

作者: Jangda, Abhinav and Maleki, Saeed and Dehnavi, Maryam Mehri and Musuvathi, Madan and Saarikivi, Olli
关键词: CUDA, GPU, generalized matrix multiplication, convolution, fine-grained synchronization, machine learning

Abstract

Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in one or more waves. However, the number of tiles is not always a multiple of the number of execution units. Thus, tiles executed in the final wave can under-utilize the GPU.To address this issue, we present cuSync, a framework for synchronizing dependent kernels using a user-defined finegrained synchronization policy to improve the GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing independent tiles of dependent kernels concurrently. We also present a compiler to generate diverse fine-grained synchronization policies based on dependencies between kernels. Our experiments found that synchronizing CUDA kernels using cuSync reduces the inference times of four popular ML models: MegatronLM GPT-3 by up to 15%, LLaMA by up to 14%, ResNet-38 by up to 22%, and VGG-19 by up to 16% over several batch sizes.

DOI: 10.1109/CGO57630.2024.10444873

Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs

作者: Murtovi, Alnis and Georgakoudis, Giorgis and Parasyris, Konstantinos and Liao, Chunhua and Laguna, Ignacio and Steffen, Bernhard
关键词: compiler, code duplication, LLVM, GPU

Abstract

Compilers use a wide range of advanced optimizations to improve the quality of the machine code they generate. In most cases, compiler optimizations rely on precise analyses to be able to perform the optimizations. However, whenever a control-flow merge is performed information is lost as it is not possible to precisely reason about the program anymore. One existing solution to this issue is code duplication, which involves duplicating instructions from merge blocks to their predecessors.This paper introduces a novel and more aggressive approach to code duplication, grounded in loop unrolling and control-flow unmerging that enables subsequent optimizations that cannot be enabled by applying only one of these transformations.We implemented our approach inside LLVM, and evaluated its performance on a collection of GPU benchmarks in CUDA. Our results demonstrate that, even when faced with branch divergence, which complicates code duplication across multiple branches and increases the associated cost, our optimization technique achieves performance improvements of up to 81%.

DOI: 10.1109/CGO57630.2024.10444819

Retargeting and Respecializing GPU Workloads for Performance Portability

作者: Ivanov, Ivan R. and Zinenko, Oleksandr and Domke, Jens and Endo, Toshio and Moses, William S.
关键词: No keywords

Abstract

In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher performance and lower costs have led to a significant diversification of architecture designs, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using newer advanced features of the architecture.We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs by performing automatic translation from CUDA and simultaneously adjust the program granularity to fit the size of target GPUs.Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 27% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.

DOI: 10.1109/CGO57630.2024.10444828

Seer： Predictive Runtime Kernel Selection for Irregular Problems

作者: Swann, Ryan and Osama, Muhammad and Sangaiah, Karthik and Mahmud, Jalal
关键词: GPU, kernel predictor, load balancing, sparse linear algebra

Abstract

Modern GPUs are designed for regular problems and suffer from load imbalance when processing irregular data. Prior to our work, a domain expert selects the best kernel to map fine-grained irregular parallelism to a GPU. We instead propose Seer, an abstraction for producing a simple, reproduceable, and understandable decision tree selector model which performs runtime kernel selection for irregular workloads. To showcase our framework, we conduct a case study in Sparse Matrix Vector Multiplication (SpMV), in which Seer predicts the best strategy for a given dataset with an improvement of 2\texttimes{

DOI: 10.1109/CGO57630.2024.10444812

AXI4MLIR： User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators

作者: Agostini, Nicolas Bohm and Haris, Jude and Gibson, Perry and Jayaweera, Malith and Rubin, Norm and Tumeo, Antonino and Abell'{a
关键词: MLIR, AXI, compilers, code generation

Abstract

This paper addresses the need for automatic and efficient generation of host driver code for arbitrary custom AXI-based accelerators targeting linear algebra algorithms, an important workload in various applications, including machine learning and scientific computing. While existing tools have focused on automating accelerator prototyping, little attention has been paid to the host-accelerator interaction. This paper introduces AXI4MLIR, an extension of the MLIR compiler framework designed to facilitate the automated generation of host-accelerator driver code. With new MLIR attributes and transformations, AXI4MLIR empowers users to specify accelerator features (including their instructions) and communication patterns and exploit the host memory hierarchy. We demonstrate AXI4MLIR’s versatility across different types of accelerators and problems, showcasing significant CPU cache reference reductions (up to 56%) and up to a 1.65\texttimes{

DOI: 10.1109/CGO57630.2024.10444801

Ecmas： Efficient Circuit Mapping and Scheduling for Surface Code

作者: Zhu, Mingzheng and Fu, Hao and Wu, Jun and Zhang, Chi and Xie, Wei and Li, Xiang-Yang
关键词: surface code, compilation, execution time

Abstract

As the leading candidate of quantum error correction codes, surface code suffers from significant overhead, such as execution time. Reducing the circuit’s execution time not only enhances its execution efficiency but also improves fidelity. However, finding the shortest execution time is NP-hard.In this work, we study the surface code mapping and scheduling problem. To reduce the execution time of a quantum circuit, we first introduce two novel metrics: Circuit Parallelism Degree and Chip Communication Capacity to quantitatively characterize quantum circuits and chips. Then, we propose a resource-adaptive mapping and scheduling method, named Ecmas, with customized initialization of chip resources for each circuit. Ecmas can dramatically reduce the execution time in both double defect and lattice surgery models. Furthermore, we provide an additional version Ecmas-ReSu for sufficient qubits, which is performance-guaranteed and more efficient. Extensive numerical tests on practical datasets show that Ecmas outperforms the state-of-the-art methods by reducing the execution time by 51.5% on average for double defect model. Ecmas can reach the optimal result in most benchmarks, reducing the execution time by up to 13.9% for lattice surgery model.

DOI: 10.1109/CGO57630.2024.10444874

PresCount： Effective Register Allocation for Bank Conflict Reduction

作者: Guan, Xiaofeng and Zhou, Hao and Bao, Guoqing and Li, Handong and Zhu, Liang and Yao, Jianguo
关键词: compiler, register allocation, bank conflict

Abstract

Modern processors with large multi-banked register files often rely on hardware solutions to resolve bank conflicts efficiently. However, these hardware-based methods, while flexible, can incur runtime penalties and restrict the exploration of optimized hardware designs. In contrast, compiler-based methods for register bank assignments avoid runtime overhead. However, incorporating bank assignment into the complex register allocation process presents significant challenges, leading existing methods to adopt conservative approaches to avoid potential side effects.This paper introduces the novel register allocation method PresCount, which enhances the coloring strategy for the Register Conflict Graph (RCG) and incorporates a bank pressure tracking mechanism to improve performance. The integrated register bank assigner in PresCount effectively reduces bank conflicts, achieving remarkable reductions of 43.28% and 27.76%, respectively, compared to existing methods on platforms with rich register banks and limited register budgets, as demonstrated by SPECfp and CNN-KERNEL benchmarks.Furthermore, a subgroup splitting technique is introduced to facilitate register allocation under the bank-subgroup register file design, specifically our Domain-Specific Architecture (DSA) for AI computing. This technique demonstrates an impressive 99.85% reduction in bank conflicts for domain-specific kernel functions.By addressing the challenges of bank conflicts in register allocation, the proposed PresCount method showcases significant improvements in performance and efficiency for platforms with different register configurations and domain-specific workloads, allowing for more flexible exploration of optimized hardware designs.

DOI: 10.1109/CGO57630.2024.10444841

Tackling the Matrix Multiplication Micro-kernel Generation with Exo

作者: Castell'{o
关键词: code generation, high performance, Exo, linear algebra, micro-kernels

Abstract

The optimization of the matrix multiplication (or GEMM) has been a need during the last decades. This operation is considered the flagship of current linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its widespread use in a large variety of scientific applications. The GEMM is usually implemented following the GotoBLAS philosophy, which tiles the GEMM operands and uses a series of nested loops for performance improvement. These approaches extract the maximum computational power of the architectures through small pieces of hardware-oriented, high-performance code called micro-kernel. However, this approach forces developers to generate, with a non-negligible effort, a dedicated micro-kernel for each new hardware.In this work, we present a step-by-step procedure for generating micro-kernels with the Exo compiler that perform close to (or even better than) manually developed microkernels written with intrinsic functions or assembly language. Our solution also improves the portability of the generated code, since a hardware target is fully specified by a concise library-based description of its instructions.

DOI: 10.1109/CGO57630.2024.10444883

One Automaton to Rule Them All： Beyond Multiple Regular Expressions Execution

作者: Cicolini, Luisa and Carloni, Filippo and Santambrogio, Marco D. and Conficconi, Davide
关键词: pattern matching, multi-level compilation, regular expressions, automata merging, parallel execution

Abstract

Regular Expressions (REs) matching is crucial to identify strings exhibiting certain morphological properties in a data stream, resulting paramount in contexts such as deep packet inspection in computer security and genome analysis in bioinformatics. Yet, due to their intrinsic data-dependence characteristics, REs represent a complex computational kernel, and numerous solutions investigate pattern-matching efficiency in different directions. However, most of them lack a comprehensive ruleset optimization approach to truly push the pattern matching performance when considering multiple REs together. Thus, exploiting REs morphological similarities within the same dataset allows memory reduction when storing the patterns and drastically improves the dataset-matching throughput.Based on this observation, we propose the Multi-RE Finite State Automata (MFSA) that extends the Finite State Automata (FSA) model to improve REs parallelization by leveraging similarities within a specific application ruleset. We design a multi-level compilation framework to manage REs merging and optimization to produce MFSA(s). Furthermore, we extend iNFAnt algorithm for MFSAs execution with the novel iMFAnt engine. Our evaluation investigates the MFSA size-reduction impact and the execution throughput compared with the one of multiple FSA in both single- and multi-threaded configurations. This approach shows an average 71.95% compression in terms of states, introducing limited compilation time overhead. Besides, best iMFAnt achieves a geomean 5.99\texttimes{

DOI: 10.1109/CGO57630.2024.10444810

Whose Baseline Compiler Is It Anyway?

作者: Titzer, Ben L.
关键词: compilers, JITs, single-pass, baseline, compilation time, tradeoff, instrumentation webassembly

Abstract

Compilers face an intrinsic tradeoff between compilation speed and code quality. The tradeoff is particularly stark in a dynamic setting where JIT compilation time contributes to application runtime. Many systems now employ multiple compilation tiers, where one tier offers fast compile speed while another has much slower compile speed but produces higher quality code. With proper heuristics on when to use each, the overall performance is better than using either compiler in isolation. At the introduction of WebAssembly into the Web platform in 2017, most engines employed optimizing compilers and pre-compiled entire modules before execution. Yet since that time, all Web engines have introduced new “baseline” compiler tiers for Wasm to improve startup time. Further, many new non-web engines have appeared, some of which also employ simple compilers. In this paper, we demystify single-pass compilers for Wasm, explaining their internal algorithms and tradeoffs, as well as providing a detailed empirical study of those employed in production. We show the design of a new single-pass compiler for a research Wasm engine that integrates with an in-place interpreter and host garbage collector using value tags, while also supporting flexible instrumentation. In experiments, we measure the effectiveness of optimizations targeting value tags and find, somewhat surprisingly, that the runtime overhead can be reduced to near zero. We also assess the relative compile speed and execution time of six baseline compilers and place these baseline compilers in a two-dimensional tradeoff space with other execution tiers for Wasm.

DOI: 10.1109/CGO57630.2024.10444855

Enabling Fine-Grained Incremental Builds by Making Compiler Stateful

作者: Han, Ruobing and Zhao, Jisheng and Kim, Hyesoon
关键词: No keywords

Abstract

Incremental builds are commonly employed in software development, involving minor changes to existing source code that is then frequently recompiled. Speeding up incremental builds not only enhances the software development workflow but also improves CI/CD systems by enabling faster verification steps.Current solutions for incremental builds primarily rely on build systems that analyze file dependencies to avoid unnecessary recompilation of unchanged files. However, for the files that do undergo changes, these build systems simply invoke compilers to recompile them from scratch. This approach reveals a fundamental asymmetry in the system: while build systems operate in a stateful manner, compilers are stateless. As a result, incremental builds are applied only at a coarse-grained level, focusing on entire source files, rather than at a more fine-grained level that considers individual code sections.In this paper, we propose an innovative approach for enabling the fine-grained incremental build by introducing statefulness into compilers. Under this paradigm, the compiler leverages its profiling history to expedite the compilation process of modified source files, thereby reducing overall build time. Specifically, the stateful compiler retains dormant information of compiler passes executed in previous builds and uses this data to bypass dormant passes during subsequent incremental compilations.We also outline the essential changes needed to transform conventional stateless compilers into stateful ones. For practical evaluation, we modify the Clang compiler to adopt a stateful architecture and evaluate its performance on real-world C++ projects. Our comparative study indicates that the stateful version outperforms the standard Clang compiler in incremental builds, accelerating the end-to-end build process by an average of 6.72%.

DOI: 10.1109/CGO57630.2024.10444865

Compile-Time Analysis of Compiler Frameworks for Query Compilation

作者: Engelke, Alexis and Schwarz, Tobias
关键词: fast compilation, LLVM, JIT compilation, query compilation

Abstract

Low compilation times are highly important in contexts of Just-in-time compilation. This not only applies to language runtimes for Java, WebAssembly, or JavaScript, but is also crucial for database systems that employ query compilation as the primary measure for achieving high throughput in combination with low query execution time.We present a performance comparison and detailed analysis of the compile times of the JIT compilation back-ends provided by GCC, LLVM, Cranelift, and a single-pass compiler in the context of database queries. Our results show that LLVM achieves the highest execution performance, but can compile substantially faster when tuning for low compilation time. Cranelift achieves a similar run-time performance to unoptimized LLVM, but compiles just 20–35% faster and is outperformed by the singlepass compiler, which compiles code 16x faster than Cranelift at similar execution performance.

DOI: 10.1109/CGO57630.2024.10444856

DrPy： Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications

作者: Cui, Jinku and Zhao, Qidong and Hao, Yueming and Liu, Xu
关键词: No keywords

Abstract

Python has become an increasingly popular programming language, especially in the areas of data analytics and machine learning. Many modern Python packages employ a multi-layer design: the Python layer manages various packages and expresses high-level algorithms; the native layer is written in C/C++/Fortran/CUDA for efficient computation. Typically, each layer manages its own computation and memory and exposes APIs for cross-layer interactions. Without holistic optimization, performance inefficiencies can exist at the boundary between layers.In this paper, we develop DrPy, a novel profiler that pinpoints such memory inefficiencies across layers in Python applications. Unlike existing tools, DrPy takes a hybrid and fine-grained approach to track memory objects and their usage in both Python and native layers. DrPy correlates the behavior of memory objects across layers and builds an object flow graph to pinpoint memory inefficiencies. In addition, DrPy captures rich information associated with object flow graphs, such as call paths and source code attribution to guide intuitive code optimization. Guided by DrPy, we are able to optimize many Python applications with non-trivial performance improvement. Many optimization patches have been validated by application developers and committed to application repositories.

DOI: 10.1109/CGO57630.2024.10444862

SCHEMATIC： Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems

作者: Reymond, Hugo and B'{e
关键词: embedded systems, intermittent computing, memory management, checkpointing

Abstract

Battery-free devices enable sensing in hard-to-access locations, opening up new opportunities in various fields such as healthcare, space, or civil engineering. Such devices harvest ambient energy and store it in a capacitor. Due to the unpredictable nature of the harvested energy, a power failure can occur at any time, resulting in a loss of all non-persistent information (e.g., processor registers, data stored in volatile memory). Checkpointing volatile data in non-volatile memory allows the system to recover after a power failure, but raises two issues: (i) spatial and temporal placement of checkpoints; (ii) memory allocation of variables between volatile and non-volatile memory, with the overall objective of using energy as efficiently as possible.While many techniques rely on the developer to address these issues, we present Schematic, a compiler technique that automates checkpoint placement and memory allocation to minimize the overall energy consumption. Schematic ensures that programs will eventually terminate (forward progress property). Moreover, checkpoint placement and memory allocation adapt to the size of the energy buffer and the capacity of volatile memory. Schematic takes advantage of volatile memory (VM) to reduce the energy consumed, by automatically placing the most used variables in VM.We tested Schematic for different experimental settings (size of volatile memory and capacitor) and results show an average energy reduction of 51% compared to related techniques.

DOI: 10.1109/CGO57630.2024.10444789

Latent Idiom Recognition for a Minimalist Functional Array Language Using Equality Saturation

作者: Van der Cruysse, Jonathan and Dubach, Christophe
关键词: equality saturation, functional programming, array programming, pattern matching, libraries

Abstract

Accelerating programs is typically done by recognizing code idioms matching high-performance libraries or hardware interfaces. However, recognizing such idioms automatically is challenging. The idiom recognition machinery is difficult to write and requires expert knowledge. In addition, slight variations in the input program might hide the idiom and defeat the recognizer.This paper advocates for the use of a minimalist functional array language supporting a small, but expressive, set of operators. The minimalist design leads to a tiny sets of rewrite rules, which encode the language semantics. Crucially, the same minimalist language is also used to encode idioms. This removes the need for hand-crafted analysis passes, or for having to learn a complex domain-specific language to define the idioms.Coupled with equality saturation, this approach is able to match the core functions from the BLAS and PyTorch libraries on a set of computational kernels. Compared to reference C kernel implementations, the approach produces a geometric mean speedup of 1.46\texttimes{

DOI: 10.1109/CGO57630.2024.10444879

BEC： Bit-Level Static Analysis for Reliability against Soft Errors

作者: Ko, Yousun and Burgstaller, Bernd
关键词: static analysis, abstract interpretation, soft errors, fault injection pruning, instruction scheduling, LLVM, RISC-V

Abstract

Soft errors are a type of transient digital signal corruption that occurs in digital hardware components such as the internal flip-flops of CPU pipelines, the register file, memory cells, and even internal communication buses. Soft errors are caused by environmental radioactivity, magnetic interference, lasers, and temperature fluctuations, either unintentionally, or as part of a deliberate attempt to compromise a system and expose confidential data.We propose a bit-level error coalescing (BEC) static program analysis and its two use cases to understand and improve program reliability against soft errors. The BEC analysis tracks each bit corruption in the register file and classifies the effect of the corruption by its semantics at compile time. The usefulness of the proposed analysis is demonstrated in two scenarios, fault injection campaign pruning, and reliability-aware program transformation. Experimental results show that bit-level analysis pruned up to 30.04% of exhaustive fault injection campaigns (13.71 % on average), without loss of accuracy. Program vulnerability was reduced by up to 13.11% (4.94% on average) through bitlevel vulnerability-aware instruction scheduling. The analysis has been implemented within LLVM and evaluated on the RISC-V architecture.To the best of our knowledge, the proposed BEC analysis is the first bit-level compiler analysis for program reliability against soft errors. The proposed method is generic and not limited to a specific computer architecture.

DOI: 10.1109/CGO57630.2024.10444844

Boosting the Performance of Multi-Solver IFDS Algorithms with Flow-Sensitivity Optimizations

作者: Li, Haofeng and Lu, Jie and Meng, Haining and Cao, Liqing and Li, Lian and Gao, Lin
关键词: multi-solver IFDS

Abstract

The IFDS (Inter-procedural, Finite, Distributive, Subset) algorithms are popularly used to solve a wide range of analysis problems. In particular, many interesting problems are formulated as multi-solver IFDS problems which expect multiple interleaved IFDS solvers to work together. For instance, taint analysis requires two IFDS solvers, one forward solver to propagate tainted data-flow facts, and one backward solver to solve alias relations at the same time. For such problems, large amount of additional data-flow facts need to be introduced for flow-sensitivity. This often leads to poor performance and scalability, as evident in our experiments and previous work. In this paper, we propose a novel approach to reduce the number of introduced additional data-flow facts while preserving flow-sensitivity and soundness.We have developed a new taint analysis tool, SADroid, and evaluated it on 1,228 open-source Android APPs. Evaluation results show that SADroid significantly outperforms FlowDroid (the state-of-the-art multi-solver IFDS taint analysis tool) without affecting precision and soundness: the run time performance is sped up by up to 17.89X and memory usage is optimized by up to 9X.

DOI: 10.1109/CGO57630.2024.10444884

Representing Data Collections in an SSA Form

作者: McMichen, Tommy and Greiner, Nathan and Zhong, Peter and Sossai, Federico and Patel, Atmn and Campanoni, Simone
关键词: compilers, intermediate representation, optimization

Abstract

Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MemOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of MemOIR is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, MemOIR compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our MemOIR compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%.

DOI: 10.1109/CGO57630.2024.10444817

Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation

作者: He, Wenlei and Yu, Hongtao and Wang, Lei and Oh, Taewook
关键词: profile guided optimization, feedback directed optimization, sampling, instrumentation, context-sensitive profiling, compiler

Abstract

The ever increasing scale of modern data center demands more effective optimizations, as even a small percentage of performance improvement can result in a significant reduction in data-center cost and its environmental footprint. However, the diverse set of workloads running in data centers also challenges the scalability of optimization solutions. Profile-guided optimization (PGO) is a promising technique to improve application performance. Sampling-based PGO is widely used in data-center applications due to its low operational overhead, but the performance gains are not as substantial as the instrumentation-based counterpart. The high operational overhead of instrumentation-based PGO, on the other hand, hinders its large-scale adoption, despite its superior performance gains.In this paper, we propose CSSPGO, a context-sensitive sampling-based PGO framework with pseudo-instrumentation. CSSPGO offers a more balanced solution to push sampling-based PGO performance closer to instrumentation-based PGO while maintaining minimal operational overhead. It leverages pseudoinstrumentation to improve profile quality without incurring the overhead of traditional instrumentation. It also enriches profile with context-sensitivity to aid more effective optimizations through a novel profiling methodology using synchronized LBR and stack sampling. CSSPGO is now used to optimize over 75% of Meta’s data center CPU cycles. Our evaluation with production workloads demonstrates 1%-5% performance improvement on top of state-of-the-art sampling-based PGO.

DOI: 10.1109/CGO57630.2024.10444807

Compiler Testing with Relaxed Memory Models

作者: Geeson, Luke and Smith, Lee
关键词: No keywords

Abstract

Finding bugs is key to the correctness of compilers in wide use today. If the behaviour of a compiled program, as allowed by its architecture memory model, is not a behaviour of the source program under its source model, then there is a bug. This holds for all programs, but we focus on concurrency bugs that occur only with two or more threads of execution. We focus on testing techniques that detect such bugs in C/C++ compilers.We seek a testing technique that automatically covers concurrency bugs up to fixed bounds on program sizes and that scales to find bugs in compiled programs with many lines of code. Otherwise, a testing technique can miss bugs. Unfortunately, the state-of-the-art techniques are yet to satisfy all of these properties.We present the T'{e

DOI: 10.1109/CGO57630.2024.10444836

High-Throughput, Formal-Methods-Assisted Fuzzing for LLVM

作者: Fan, Yuyou and Regehr, John
关键词: No keywords

Abstract

It is very difficult to thoroughly test a compiler, and as a consequence it is common for released versions of production compilers to contain bugs that cause them to crash and to emit incorrect object code. We created alive-mutate, a mutation-based fuzzing tool that takes test cases written by humans and randomly modifies them, based on the hypothesis that while compiler developers are fundamentally good at writing tests, they also tend to miss corner cases. Alive-mutate is integrated with the Alive2 translation validation tool for LLVM, which is useful because it checks the behavior of optimizations for all possible values of input variables. Alive-mutate is also integrated with the LLVM middle-end, allowing it to perform mutations, optimizations, and formal verification of the optimizations all within a single program—avoiding numerous sources of overhead. Alive-mutate’s fuzzing throughput is 12x higher, on average, than a fuzzing workflow that runs mutation, optimization, and formal verification in separate processes. So far we have used alive-mutate to find and report 33 previously unknown bugs in LLVM.

DOI: 10.1109/CGO57630.2024.10444854

EasyTracker： A Python Library for Controlling and Inspecting Program Execution

作者: Barollet, Th'{e
关键词: teaching, visualization, debug, instrumentation

Abstract

Learning to program involves building a mental representation of how a machine executes instructions and stores data in memory. To help students, teachers often use visual representations to illustrate the execution of programs or particular concepts in their lectures. As a famous example, teachers often represent references/pointers with arrows pointing to objects or memory locations. While these visual representations are mostly hand-drawn, there is a tendency to supplement them with tools.However, building such a tool from scratch requires much effort and a high level of debugging technical expertise, while existing tools are difficult to adapt to different contexts. This article presents EasyTracker, a Python library targeting teachers who are not debugging experts. By providing ways of controlling the execution and inspecting the state of programs, EasyTracker simplifies the development of tools that generate tuned visual representations from the controlled execution of a program. The controlled program can be written either in Python, C, or assembly languages. To ease the development of visualization tools working for programs in different languages and to allow the building of web-based tools, EasyTracker provides a language-agnostic and serializable representation of the state of a running program.

DOI: 10.1109/CGO57630.2024.10444823

OptiWISE： Combining Sampling and Instrumentation for Granular CPI Analysis

作者: Guo, Yuxin and Chadwick, Alex W. and Erd\H{o
关键词: No keywords

Abstract

Despite decades of improvement in compiler technology, it remains necessary to profile applications to improve performance. Existing profiling tools typically either sample hardware performance counters or instrument the program with extra instructions to analyze its execution. Both techniques are valuable with different strengths and weaknesses, but do not always correctly identify optimization opportunities.We present OptiWISE, a profiling tool that runs the program twice, once with low-overhead sampling to accurately measure performance, and once with instrumentation to accurately capture control flow and execution counts. OptiWISE then combines this information to give a highly detailed per-instruction CPI metric by computing the ratio of samples to execution counts, as well as aggregated information such as costs per loop, source-code line, or function.We evaluate OptiWISE to show it has an overhead of 8.1\texttimes{

DOI: 10.1109/CGO57630.2024.10444771

EasyView： Bringing Performance Profiles into Integrated Development Environments

作者: Zhao, Qidong and Chabbi, Milind and Liu, Xu
关键词: profiling, software optimization, performance measurement, visualization, tools

Abstract

Dynamic program performance analysis (also known as profiling) is well-known for its powerful capabilities of identifying performance inefficiencies in software packages. Although a large number of profiling techniques are developed in academia and industry, very few of them are widely used by software developers in their regular software developing activities. There are three major reasons. First, the profiling tools (also known as profilers) are disjoint from the coding environments such as IDEs and editors; frequently switching focus between them significantly complicates the entire cycle of software development. Second, mastering various tools to interpret their analysis results requires substantial efforts; even worse, many tools have their own design of graphical user interfaces (GUI) for data presentation, which steepens the learning curves. Third, most existing profilers expose few interfaces to support user-defined analysis, which makes the tools less customizable to fulfill diverse user demands.We develop EasyView, a general solution to integrate the interpretation and visualization of various profiling results in the coding environments, which bridges software developers closer with profilers during the code development cycle. The novelty of EasyView lies in its significant improvement on the usability of profilers. EasyView not only provides deep insights to support intuitive analysis and optimization in a simple interface, but also enhances user experiences in using the profilers effectively and efficiently in the IDEs. Our evaluation shows that EasyView is able to support various profilers for different languages and provide unique insights into performance inefficiencies in different domains. Our user studies show that EasyView can largely improve the usability of profilers in software development cycles via facilitating performance debugging efforts.

DOI: 10.1109/CGO57630.2024.10444840

Experiences Building an MLIR-Based SYCL Compiler

作者: Tiotto, Ettore and P'{e
关键词: SYCL, MLIR, compiler, optimization, heterogeneous programming

Abstract

Similar to other programming models, compilers for SYCL, the open programming model for heterogeneous computing based on C++, would benefit from access to higher-level intermediate representations. The loss of high-level structure and semantics caused by premature lowering to low-level intermediate representations and the inability to reason about host and device code simultaneously present major challenges for SYCL compilers.The MLIR compiler framework, through its dialect mechanism, allows to model domain-specific, high-level intermediate representations and provides the necessary facilities to address these challenges.This work therefore describes practical experience with the design and implementation of an MLIR-based SYCL compiler. By modeling key elements of the SYCL programming model in host and device code in the MLIR dialect framework, the presented approach enables the implementation of powerful device code optimizations as well as analyses across host and device code.Compared to two LLVM-based SYCL implementations, this yields speedups of up to 4.3x on a collection of SYCL benchmark applications.Finally, this work also discusses challenges encountered in the design and implementation and how these could be addressed in the future.

DOI: 10.1109/CGO57630.2024.10444866

Unveiling and Vanquishing Goroutine Leaks in Enterprise Microservices： A Dynamic Analysis Approach

作者: Saioc, Georgian-Vlad and Shirchenko, Dmitriy and Chabbi, Milind
关键词: go, channel, memory leak, concurrency, goroutine, leakprof, message-passing

Abstract

Go is a modern programming language gaining popularity in enterprise microservice systems. Concurrency is a first-class citizen in Go with lightweight “goroutines” as the building blocks of concurrent execution. Go advocates message-passing to communicate and synchronize among goroutines. Improper use of message passing in Go can result in “partial deadlock” (interchangeably called a goroutine leak), a subtle concurrency bug where a blocked sender (receiver) never finds a corresponding receiver (sender), causing the blocked goroutine to leak memory, via its call stack and objects reachable from the stack.In this paper, we systematically study the prevalence of message passing and the resulting partial deadlocks in ≈75 Million lines of Uber’s Go monorepo hosting ≈2500 microservices. We develop two lightweight, dynamic analysis tools: Goleak and LeakProf, designed to identify partial deadlocks. Goleak detects partial deadlocks during unit testing and prevents the introduction of new bugs. Conversely, LeakProf uses goroutine profiles obtained from services deployed in production to pinpoint intricate bugs arising from complex control flow, unexplored interleavings, or the absence of test coverage. We share our experience and insights deploying these tools in developer workflows in a large industrial setting. Using Goleak we unearthed 857 pre-existing goroutine leaks in the legacy code and prevented the introduction of ≈260 new leaks over one year period. Using LeakProf we found 24 and fixed 21 goroutine leaks, which resulted in up to 34% speedup and 9.2\texttimes{

DOI: 10.1109/CGO57630.2024.10444835

A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules

作者: Jiang, Jinhu and Liang, Chaoyi and Dong, Rongchao and Yang, Zhaohui and Zhou, Zhongjun and Wang, Wenwen and Yew, Pen-Chung and Zhang, Weihua
关键词: No keywords

Abstract

System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation.In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU’s helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode.

DOI: 10.1109/CGO57630.2024.10444850

Instruction Scheduling for the GPU on the GPU

作者: Shobaki, Ghassan and Muyan-"{O
关键词: parallel compiler optimization, instruction scheduling, ant colony optimization (ACO), GPU computing, multi-objective optimization

Abstract

In this paper, we show how to use the GPU to parallelize a precise instruction scheduling algorithm that is based on Ant Colony Optimization (ACO). ACO is a nature-inspired intelligent-search technique that has been used to compute precise solutions to NP-hard problems in operations research (OR). Such intelligent-search techniques were not used in the past to solve NP-hard compiler optimization problems, because they require substantially more computation than the heuristic techniques used in production compilers. In this work, we show that parallelizing such a compute-intensive technique on the GPU makes using it in compilation reasonably practical. The register-pressure-aware instruction scheduling problem addressed in this work is a multi-objective optimization problem that is significantly more complex than the problems that were previously solved using parallel ACO on the GPU. We describe a number of techniques that we have developed to efficiently parallelize an ACO algorithm for solving this multi-objective optimization problem on the GPU. The target processor is also a GPU. Our experimental evaluation shows that parallel ACO-based scheduling on the GPU runs up to 27 times faster than sequential ACO-based scheduling on the CPU, and this leads to reducing the total compile time of the rocPRIM benchmarks by 21%. ACO-based scheduling improves the execution-speed of the compiled benchmarks by up to 74% relative to AMD’s production scheduler. To the best of our knowledge, our work is the first successful attempt to parallelize a compiler optimization algorithm on the GPU.

DOI: 10.1109/CGO57630.2024.10444869

JITSPMM： Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication

作者: Fu, Qiang and Rolinger, Thomas B. and Huang, H. Howie
关键词: SpMM, just-in-time instruction generation, performance profiling, performance optimization

Abstract

Achieving high performance for Sparse Matrix-Matrix Multiplication (SpMM) has received increasing research attention, especially on multi-core CPUs, due to the large input data size in applications such as graph neural networks (GNNs). Most existing solutions for SpMM computation follow the ahead-of-time (AOT) compilation approach, which compiles a program entirely before it is executed. AOT compilation for SpMM faces three key limitations: unnecessary memory access, additional branch overhead, and redundant instructions. These limitations stem from the fact that crucial information pertaining to SpMM is not known until runtime. In this paper, we propose JitSpMM, a just-in-time (JIT) assembly code generation framework to accelerated SpMM computation on multi-core CPUs with SIMD extensions. First, JitSpMM integrates the JIT assembly code generation technique into three widely-used workload division methods for SpMM to achieve balanced workload distribution among CPU threads. Next, with the availability of runtime information, JitSpMM employs a novel technique, coarse-grain column merging, to maximize instruction-level parallelism by unrolling the performance-critical loop. Furthermore, JitSpMM intelligently allocates registers to cache frequently accessed data to minimizing memory accesses, and employs selected SIMD instructions to enhance arithmetic throughput. We conduct a performance evaluation of JitSpMM and compare it two AOT baselines. The first involves existing SpMM implementations compiled using the Intel icc compiler with auto-vectorization. The second utilizes the highly-optimized SpMM routine provided by Intel MKL. Our results show that JitSpMM provides an average improvement of 3.8\texttimes{

DOI: 10.1109/CGO57630.2024.10444827

oneDNN Graph Compiler： A Hybrid Approach for High-Performance Deep Learning Compilation

作者: Li, Jianhui and Qin, Zhennan and Mei, Yijie and Cui, Jingze and Song, Yunfei and Chen, Ciyong and Zhang, Yifei and Du, Longsheng and Cheng, Xianhang and Jin, Baihui and Zhang, Yan and Ye, Jason and Lin, Eric and Lavery, Dan
关键词: deep learning compiler, code generation and optimization, high-performance library

Abstract

With the rapid development of deep learning models and hardware support for dense computing, the deep learning (DL) workload characteristics changed significantly from a few hot spots on compute-intensive operations to a broad range of operations scattered across the models. Accelerating a few compute-intensive operations using the expert-tuned implementation of primitives doesn’t fully exploit the performance potential of AI hardware. Various efforts have been made to compile a full deep neural network (DNN) graph. One of the biggest challenges is to achieve high-performance tensor compilation by generating expert-level performance code for the dense compute-intensive operations and applying compilation optimization at the scope of DNN computation graph across multiple compute-intensive operations.We present oneDNN Graph Compiler, a tensor compiler that employs a hybrid approach of using techniques from both compiler optimization and expert-tuned kernels for high-performance code generation of the deep neural network graph. oneDNN Graph Compiler addresses unique optimization challenges in the deep learning domain, such as low-precision computation, aggressive fusion of graph operations, optimization for static tensor shapes and memory layout, constant weight optimization, and memory buffer reuse. Experimental results demonstrate significant performance gains over existing tensor compiler and primitives library for performance-critical DNN computation graphs and end-to-end models on Intel® Xeon® Scalable Processors.

DOI: 10.1109/CGO57630.2024.10444871