CGO 2022

A compiler framework for optimizing dynamic parallelism on GPUs

作者: Olabi, Mhd Ghaith and Luna, Juan G'{o
关键词: No keywords

Abstract

Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted beforehand. However, prior works have shown that dynamic parallelism may impose a high performance penalty when a large number of small grids are launched. The large number of launches results in high launch latency due to congestion, and the small grid sizes result in hardware underutilization.To address this issue, we propose a compiler framework for optimizing the use of dynamic parallelism in applications with nested parallelism. The framework features three key optimizations: thresholding, coarsening, and aggregation. Thresholding involves launching a grid dynamically only if the number of child threads exceeds some threshold, and serializing the child threads in the parent thread otherwise. Coarsening involves executing the work of multiple thread blocks by a single coarsened block to amortize the common work across them. Aggregation involves combining multiple child grids into a single aggregated grid.Thresholding is sometimes applied manually by programmers in the context of dynamic parallelism. We automate it in the compiler and discuss the challenges associated with doing so. Coarsening is sometimes applied as an optimization in other contexts. We propose to apply coarsening in the context of dynamic parallelism and automate it in the compiler as well. Aggregation has been automated in the compiler by prior work. We enhance aggregation by proposing a new aggregation technique that uses multi-block granularity. We also integrate these three optimizations into an open-source compiler framework to simplify the process of optimizing dynamic parallelism code.Our evaluation shows that our compiler framework improves the performance of applications with nested parallelism by a geometric mean of 43.0X over applications that use dynamic parallelism, 8.7X over applications that do not use dynamic parallelism, and 3.6X over applications that use dynamic parallelism with aggregation alone as proposed in prior work.

DOI: 10.1109/CGO53902.2022.9741284

Automatic horizontal fusion for GPU kernels

作者: Li, Ao and Zheng, Bojian and Pekhimenko, Gennady and Long, Fan
关键词: GPGPU, code generation, optimization

Abstract

We present automatic horizontal fusion, a novel optimization technique that complements the standard kernel fusion techniques for GPU programs. Unlike the standard fusion, whose goal is to eliminate intermediate data round trips, our horizontal fusion technique aims to increase the thread-level parallelism to hide instruction latencies. We also present HFuse, a new source to source CUDA compiler that implements automatic horizontal fusion. Our experimental results show that the horizontal fusion can speed up the running time by 2.5%-60.8%. Our results reveal that the horizontal fusion is especially beneficial for fusing kernels with instructions that require different kinds of GPU resources (e.g., a memory-intensive kernel and a compute-intensive kernel).

DOI: 10.1109/CGO53902.2022.9741270

DARM： control-flow melding for SIMT thread divergence reduction

作者: Saumya, Charitha and Sundararajah, Kirshanthan and Kulkarni, Milind
关键词: GPGPUs, compiler optimizations, control-flow divergence

Abstract

GPGPUs use the Single-Instruction-Multiple-Thread (SIMT) execution model where a group of threads—wavefront or warp—execute instructions in lockstep. When threads in a group encounter a branching instruction, not all threads in the group take the same path, a phenomenon known as control-flow divergence. The control-flow divergence causes performance degradation because both paths of the branch must be executed one after the other. Prior research has primarily addressed this issue through architectural modifications. We observe that certain GPGPU kernels with control-flow divergence have similar control-flow structures with similar instructions on both sides of a branch. This structure can be exploited to reduce control-flow divergence by melding the two sides of the branch allowing threads to reconverge early, reducing divergence. In this work, we present DARM, a compiler analysis and transformation framework that can meld divergent control-flow structures with similar instruction sequences. We show that DARM can reduce the performance degradation from control-flow divergence.

DOI: 10.1109/CGO53902.2022.9741285

Efficient execution of OpenMP on GPUs

作者: Huber, Joseph and Cornelius, Melanie and Georgakoudis, Giorgis and Tian, Shilei and Diaz, Jose M Monsalve and Dinel, Kuter and Chapman, Barbara and Doerfert, Johannes
关键词: GPU, LLVM, OpenMP, offloading, optimization

Abstract

OpenMP is the preferred choice for CPU parallelism in High-Performance-Computing (HPC) applications written in C, C++, or Fortran. As HPC systems became heterogeneous, OpenMP introduced support for accelerator offloading via the target directive. This allowed porting existing (CPU) code onto GPUs, including well established CPU parallelism paradigms. However, there are architectural differences between CPU and GPU execution which make common patterns, like forking and joining threads, single threaded execution, or sharing of local (stack) variables, in general costly on the latter. So far it was left to the user to identify and avoid non-efficient code patterns, most commonly by writing their OpenMP offloading codes in a kernel-language style which resembles CUDA more than it does traditional OpenMP.In this work we present OpenMP-aware program analyses and optimizations that allow efficient execution of the generic, CPU-centric parallelism model provided by OpenMP on GPUs. Our implementation in LLVM/Clang maps various common OpenMP patterns found in real world applications efficiently to the GPU. As static analysis is inherently limited we provide actionable and informative feedback to the user about the performed and missed optimizations, together with ways for the user to annotate the program for better results. Our extensive evaluation using several HPC proxy applications shows significantly improved GPU kernel times and reduction in resources requirements, such as GPU registers.

DOI: 10.1109/CGO53902.2022.9741290

GraphIt to CUDA compiler in 2021 LOC： a case for high-performance DSL implementation via staging with BuilDSL

作者: Brahmakshatriya, Ajay and Amarasinghe, Saman
关键词: code-generation, data-flow analysis, domain-specific-languages, multi-stage programming

Abstract

Domain-Specific Languages (DSLs) provide the optimum balance between generalization and specialization that is crucial to getting the best performance for a particular domain. DSLs like Halide and GraphIt and their rich scheduling languages allow users to generate an implementation best suited for the algorithm and input. DSLs also provide the right abstraction for generating code for diverse architectures like GPUs, CPUs, and hardware accelerators. DSL compilers are massive, typically spanning tens of thousands of lines of code and need a frontend, some analysis and transformation passes, and target-specific code generation. These implementations usually require a great deal of compiler knowledge and domain experts cannot prototype DSLs without getting compiler experts involved.Using multi-stage programming in a high-level language like Scala, OCaml, or C++, is a great solution because it provides easy-to-use frontend and automatic code generation abilities. The DSL writers typically implement their abstraction as a library in the multi-stage programming language and use it to generate specialized code by providing partial inputs. This solves the problem only partially because DSLs like GraphIt have shown that several domain-specific analyses and transformations need to be performed to get the best performance. Special care has to be taken when targeting massively parallel architectures like GPUs where factors like load balancing, warp divergence, coalesced memory accesses play a critical role.In this paper, we demonstrate how to build an end-to-end DSL compiler framework and a graph DSL using multi-stage programming in C++. We show how the staged types can be extended to perform domain-specific data flow and control flow analyses and transformations. We also show how our generated CUDA code matches the performance of the code generated from the state-of-the-art graph DSL, GraphIt. We achieve all this in a very small fraction (8.4%) of the code size required to implement the traditional DSL compiler.

DOI: 10.1109/CGO53902.2022.9741280

A compiler for sound floating-point computations using affine arithmetic

作者: Rivera, Joao and Franchetti, Franz and P"{u
关键词: affine arithmetic, floating-point arithmetic, guaranteed computations, source-to-source compiler

Abstract

Floating-point arithmetic is extensively used in scientific and engineering applications to approximate real arithmetic. Unfortunately, floating-point arithmetic is not a sound implementation of real arithmetic, i.e., it may produce different results, does not provide error guarantees, and the errors can become arbitrarily large. In this paper, we introduce SafeGen, a source-to-source compiler that rewrites a given C program using floating-point arithmetic to an efficient C program performing the same computation soundly, i.e., it returns an error bound that is guaranteed to contain the correct result of the program if it had been executed in real arithmetic. Equivalently, it gives a precision certificate on the number of correct bits in the result. SafeGen uses affine arithmetic (AA) that keeps accuracy high compared to interval arithmetic by preserving linear correlations between variables. To mitigate its high cost, SafeGen combines a novel form of static analysis to identify these correlations with a flexible policy-based approach for their selection. SafeGen supports SIMD intrinsics in the input and can output SIMD-optimized code. Our results show that SafeGen-generated code is 30–70 times faster than manually rewritten code using AA libraries. Equivalently, SafeGen can offer many more bits of certified accuracy within a reduced time budget.

DOI: 10.1109/CGO53902.2022.9741286

Aggregate update problem for multi-clocked dataflow languages

作者: Kallwies, Hannes and Leucker, Martin and Scheffel, Torben and Schmitz, Malte and Thoma, Daniel
关键词: No keywords

Abstract

Dataflow languages have, as well as functional languages, immutable semantics, which is often implemented by copying values. A common compiler optimization known from functional languages involves analyzing which data structures can be modified in-place instead of copying them. This paper presents a novel algorithm to this so called Aggregate Update Problem for multi-clocked dataflow languages, i.e. those that allow streams to have events at disjoint timestamps, like e.g. Lucid, Lustre and Signal. Unrestricted multi-clocked languages require a static triggering analysis on how events and hence data values are read, written and replicated. We use TeSSLa as a generic stream transformation language with a small set of operators to develop our ideas. We implemented the solution in a TeSSLa compiler targeting the Java VM via Scala code generation which combines persistent data structures and mutable data structures for those data values which allow in-place editing. Our empirical evaluation shows considerable speedup for use cases where queues, maps or sets are dominant data structures.

DOI: 10.1109/CGO53902.2022.9741275

CompilerGym： robust, performant compiler optimization environments for AI research

作者: Cummins, Chris and Wasti, Bram and Guo, Jiadong and Cui, Brandon and Ansel, Jason and Gomez, Sahir and Jain, Somya and Liu, Jia and Teytaud, Olivier and Steiner, Benoit and Tian, Yuandong and Leather, Hugh
关键词: No keywords

Abstract

Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly, but compiler research has a high entry barrier. Unlike in other domains, compiler and AI researchers do not have access to the datasets and frameworks that enable fast iteration and development of ideas, and getting started requires a significant engineering investment. What is needed is an easy, reusable experimental infrastructure for real world compiler optimization tasks that can serve as a common benchmark for comparing techniques, and as a platform to accelerate progress in the field.We introduce CompilerGym, a set of environments for real world compiler optimization tasks, and a toolkit for exposing new optimization tasks to compiler researchers. CompilerGym enables anyone to experiment on production compiler optimization problems through an easy-to-use package, regardless of their experience with compilers. We build upon the popular OpenAI Gym interface enabling researchers to interact with compilers using Python and a familiar API.We describe the CompilerGym architecture and implementation, characterize the optimization spaces and computational efficiencies of three included compiler environments, and provide extensive empirical evaluations. Compared to prior works, CompilerGym offers larger datasets and optimization spaces, is 27X more computationally efficient, is fault-tolerant, and capable of detecting reproducibility bugs in the underlying compilers.In making it easy for anyone to experiment with compilers - irrespective of their background - we aim to accelerate progress in the AI and compiler research domains.

DOI: 10.1109/CGO53902.2022.9741258

PALMED： throughput characterization for superscalar architectures

作者: Derumigny, Nicolas and Bastian, Th'{e
关键词: code selection, compiler, performance debugging, performance model, port mapping, superscalar architecture, throughput

Abstract

In a super-scalar architecture, the scheduler dynamically assigns micro-operations (μOPs) to execution ports. The port mapping of an architecture describes how an instruction decomposes into μOPs and lists for each μOP the set of ports it can be mapped to. It is used by compilers and performance debugging tools to characterize the performance throughput of a sequence of instructions repeatedly executed as the core component of a loop.This paper introduces a dual equivalent representation: The resource mapping of an architecture is an abstract model where, to be executed, an instruction must use a set of abstract resources, themselves representing combinations of execution ports. For a given architecture, finding a port mapping is an important but difficult problem. Building a resource mapping is a more tractable problem and provides a simpler and equivalent model. This paper describes Palmed, a tool that automatically builds a resource mapping for pipelined, super-scalar, out-of-order CPU architectures. Palmed does not require hardware performance counters, and relies solely on runtime measurements.We evaluate the pertinence of our dual representation for throughput modeling by extracting a representative set of basic-blocks from the compiled binaries of the SPEC CPU 2017 benchmarks. We compared the throughput predicted by existing machine models to that produced by Palmed, and found comparable accuracy to state-of-the art tools, achieving sub-10 % mean square error rate on this workload on Intel’s Skylake microarchitecture.

DOI: 10.1109/CGO53902.2022.9741289

SRTuner： effective compiler optimization customization by exposing synergistic relations

作者: Park, Sunghyun and Latifi, Salar and Park, Yongjun and Behroozi, Armand and Jeon, Byungsoo and Mahlke, Scott
关键词: auto-tuning, compiler, optimization

Abstract

Despite ceaseless efforts, extremely large and complex optimization space makes even the state-of-the-art compilers fail in delivering the most performant setting that can fully utilize the underlying hardware. Although this inefficiency suggests opportunity for tuning, it has been challenging for prior tuning methods to consider the complex interactions between optimizations and maximize the tuning quality while handling local optima efficiently.To tackle this problem, we suggest an intelligent auto-tuning strategy, called SRTuner, which searches for the best optimization setting by exposing important optimization interactions and directly using them to focus on promising subspaces. To reveal high-impact inter-optimization relations, SRTuner proposes a multistage structure and a distribution-based estimation method that approximates the impact of an optimization effectively. Besides, to efficiently handle local optima, our technique defines optimization decisions as a series of multi-armed bandit problems to formulate the exploration-exploitation dilemma.SRTuner is evaluated with three representative compilers from various domains on different target hardware: GCC (traditional C/C++ compiler) on CPU, TVM (domain-specific machine learning compiler) on GPU, and OpenCL compilers (kernel compiler for heterogeneous computing) on both CPU/GPU. Results show that SRTuner accelerates target executions by 1.24X, 2.03X and 34.4X compared to the highest level of optimization provided by each compiler and outperforms state-of-the-art works by 1.04X-1.14X.As a byproduct of our unique tuning strategy, SRTuner can offer synergistic optimizations for each workload, which allows it to in part identify why it outperformed current compilers. With this information, we are able to find important optimizations that each compiler misused and demonstrate how this information can benefit future tuning strategies.

DOI: 10.1109/CGO53902.2022.9741263

Recovering container class types in C++ binaries

作者: Wang, Xudong and Xu, Xuezheng and Li, Qingan and Yuan, Mengting and Xue, Jingling
关键词: binary code analysis, containers, template classes, type inference

Abstract

We present Tiara, a novel approach to recovering container classes in C++ binaries. Given a variable address in a C++ binary, Tiara first applies a new type-relevant slicing algorithm incorporated with a decay function, Tslice, to obtain an inter-procedural forward slice of instructions expressed as a CFG to summarize how the variable is used in the binary (as our primary contribution). Tiara then makes use of a GCN (Graph Convolutional Network) to learn and predict the container type for the variable (as our secondary contribution). According to our evaluation, Tiara can advance the state of the art in inferring commonly used container types in a set of eight large real-world COTS C++ binaries efficiently (in terms of the overall analysis time) and effectively (in terms of precision, recall and F1 score).

DOI: 10.1109/CGO53902.2022.9741274

Automatic generation of debug headers through BlackBox equivalence checking

作者: Kurhe, Vaibhav Kiran and Karia, Pratik and Gupta, Shubhani and Rose, Abhishek and Bansal, Sorav
关键词: No keywords

Abstract

Modern compiler optimization pipelines are large and complex, and it is rather cumbersome and error-prone for compiler developers to preserve debugging information across optimization passes. An optimization can add, remove, or reorder code and variables, which makes it difficult to associate the generated code statements and values with the source code statements and values. Moreover recent proposals for automatic generation of optimizations (e.g., through search algorithms) have not previously considered the preservation of debugging information.We demonstrate the application of a blackbox equivalence checker to automatically populate the debugging information in the debug headers of the optimized executables compiled from C programs. A blackbox equivalence checker can automatically compute equivalence proofs between the original source code and the optimized executable code without the knowledge of the exact transformations performed by the compiler/optimizer. We present an algorithm that uses these formal equivalence proofs to improve the executable’s debugging headers. We evaluate this approach on benchmarks derived from the Testsuite of Vectorizing Compilers (TSVC) compiled through three different compilers: GCC, Clang/LLVM, and ICC. We demonstrate significant improvements in the debuggability of the optimized executable code in these experiments. The benefits of these improvements can be transparently realized through any standard debugger, such as GDB, to debug the updated executable.

DOI: 10.1109/CGO53902.2022.9741273

Gadgets splicing： dynamic binary transformation for precise rewriting

作者: Tian, Linan and Shi, Yangyang and Chen, Liwei and Yang, Yanqi and Shi, Gang
关键词: No keywords

Abstract

Many systems and applications depend on binary rewriting technology to analyze and retrofit software binaries when source code is not available, including binary instrumentation, profiling and security policy reinforcement. However, the investigations have found that many static binary rewriters still fail to accurately transform all legal instructions in binaries. Dynamic binary rewriters allow for accuracy, but coverage and rewriting efficiency are limited. Therefore, the existing binary rewriting technology cannot meet all the needs of binary rewriting. In this paper, we present GRIN, a novel binary rewriting tool that allows for high-precision instruction identification. In GRIN, we propose a gadget-based entry address analysis technique. It identifies the entry addresses of the basic blocks in the binary by gathering and executing the basic blocks related to the computation of the entry addresses of the basic blocks. By traversing from these entries as the new entries of the program, we guarantee the correctness of the identified instructions. We have implemented the prototype of GRIN and evaluated on the SPEC2006 and the whole set of GNU Coreutils. We demonstrate that the precision of GRIN is improved to 99.92% compared to current state-of-the-art techniques.

DOI: 10.1109/CGO53902.2022.9741259

Lambda the ultimate SSA： optimizing functional programs in SSA

作者: Bhat, Siddharth and Grosser, Tobias
关键词: functional programming, optimizing compilers

Abstract

Static Single Assignment (SSA) is the workhorse of modern optimizing compilers for imperative programming languages. However, functional languages have been slow to adopt SSA and prefer to use intermediate representations based on minimal lambda calculi due to SSA’s inability to express higher-order constructs. We exploit a new SSA construct — regions — in order to express functional optimizations via classical SSA-based reasoning. Region optimization currently relies on ad-hoc analyses and transformations on imperative programs. These ad-hoc transformations are sufficient for imperative languages as regions are used in a limited fashion. In contrast, we use regions pervasively to model sub-expressions in our functional IR. This motivates us to systematize region optimizations. We extend classical SSA reasoning to regions for functional-style analyses and transformations. We implement a new SSA+regions based backend for LEAN4, a theorem prover that implements a purely functional, dependently typed programming language. Our backend is feature-complete and handles all constructs of LEAN4’s functional intermediate representation λrc within the SSA framework. We evaluate our proposed region optimizations by optimizing λrc within an SSA+regions based framework implemented in MLIR and demonstrating performance parity with the current LEAN4 backend. We believe our work will pave the way for a unified optimization framework capable of representing, analyzing, and optimizing both functional and imperative languages.

DOI: 10.1109/CGO53902.2022.9741279

NOELLE offers empowering LLVM extensions

作者: Matni, Angelo and Deiana, Enrico Armenio and Su, Yian and Gross, Lukas and Ghosh, Souradip and Apostolakis, Sotiris and Xu, Ziyang and Tan, Zujun and Chaturvedi, Ishita and Homerding, Brian and McMichen, Tommy and August, David I. and Campanoni, Simone
关键词: No keywords

Abstract

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.

DOI: 10.1109/CGO53902.2022.9741276

Hecate： performance-aware scale optimization for homomorphic encryption compiler

作者: Lee, Yongwoo and Heo, Seonyeong and Cheon, Seonyoung and Jeong, Shinnung and Kim, Changsu and Kim, Eunkyung and Lee, Dongyoon and Kim, Hanjun
关键词: compiler, deep learning, homomorphic encryption, privacy-preserving machine learning

Abstract

Despite the benefit of Fully Homomorphic Encryption (FHE) that supports encrypted computation, writing an efficient FHE application is challenging due to magnitude scale management. Each FHE operation increases scales of ciphertext and leaving the scales high harms performance of the following FHE operations. Thus, rescaling ciphertext is inevitable to optimize an FHE application, but since FHE requires programmers to match the rescaling levels of operands of each FHE operation, programmers should rescale ciphertext reflecting the entire FHE application. Although recently proposed FHE compilers reduce the programming burden by automatically manipulating ciphertext scales, they fail to fully optimize the FHE application because they greedily rescale the ciphertext without considering their performance impacts throughout the entire application.This work proposes Hecate, a new FHE compiler framework that optimizes scales of ciphertext reflecting their rescaling levels and performance impact. With a new type system that embeds the scale and rescaling level, and a new rescaling operation called downscale, Hecate makes various scale management plans, analyzes their expected performance, and finds the optimal rescaling points throughout the entire FHE application. This work implements Hecate on top of the MLIR framework with a Python frontend and shows that Hecate achieves 27% speedup over the state-of-the-art approach for various FHE applications.

DOI: 10.1109/CGO53902.2022.9741265

Unified compilation for lossless compression and sparse computing

作者: Donenfeld, Daniel and Chou, Stephen and Amarasinghe, Saman
关键词: compressed domain processing, lossless compression, sparse tensor algebra

Abstract

This paper shows how to extend sparse tensor algebra compilers to support lossless compression techniques, including variants of run-length encoding and Lempel-Ziv compression. We develop new abstractions to represent losslessly compressed data as a generalized form of sparse tensors, with repetitions of values (which are compressed out in storage) represented by non-scalar, dynamic fill values. We then show how a compiler can use these abstractions to emit efficient code that computes on losslessly compressed data. By unifying lossless compression with sparse tensor algebra, our technique is able to generate code that computes with both losslessly compressed data and sparse data, as well as generate code that computes directly on compressed data without needing to first decompress it.Our evaluation shows our technique generates efficient image and video processing kernels that compute on losslessly compressed data. We find that the generated kernels are up to 16.3X faster than equivalent dense kernels generated by TACO, a tensor algebra compiler, and up to 16.1X faster than OpenCV, a widely used image processing library.

DOI: 10.1109/CGO53902.2022.9741282

Loop rolling for code size reduction

作者: Rocha, Rodrigo C. O. and Petoumenos, Pavlos and Franke, Bj"{o
关键词: LLVM, code-size reduction, compiler optimization, loop optimization, loop rerolling

Abstract

Code size is critical for resource-constrained devices, where memory and storage are limited. Compilers, therefore, should offer optimizations aimed at code reduction. One such optimization is loop rerolling, which transforms a partially unrolled loop into a fully rolled one. However, existing techniques are limited and rarely applicable to real-world programs. They are incapable of handling partial rerolling or straight-line code.In this paper, we propose RoLAG, a novel code-size optimization that creates loops out of straight-line code. It identifies isomorphic code by aligning SSA graphs in a bottom-up fashion. The aligned code is later rolled into a loop. In addition, we propose several optimizations that increase the amount of aligned code by identifying specific patterns of code. Finally, an analysis is used to estimate the profitability of the rolled loop before deciding which version should be kept in the code.Our evaluation of RoLAG on full programs from MiBench and SPEC 2017 show absolute reductions of up to 88 KB while LLVM’s technique is hardly distinguishable from the baseline with no rerolling. Finally, our results show that RoLAG is highly applicable to real-world code extracted from popular GitHub repositories. RoLAG is triggered several orders of magnitude more often than LLVM’s rerolling, resulting in meaningful reductions on real-world functions.

DOI: 10.1109/CGO53902.2022.9741256

Solving PBQP-based register allocation using deep reinforcement learning

作者: Kim, Minsu and Park, Jeong-Keun and Moon, Soo-Mook
关键词: No keywords

Abstract

Irregularly structured registers are hard to abstract and allocate. Partitioned Boolean quadratic programming (PBQP) is a useful abstraction to represent complex register constraints, even those in highly irregular processors of automated test equipment (ATE) of DRAM memory chips. The PBQP problem is NP-hard, requiring a heuristic solution. If no spill is allowed as in ATE, however, we have to enumerate more to find a solution rather than to approximate, since a spill means a total compilation failure. We propose solving the PBQP problem with deep reinforcement learning (Deep-RL), more specifically, a model-based approach using Monte Carlo tree search and deep neural network as used in Alphazero, a proven Deep-RL technology. Through elaborate training with random PBQP graphs, our Deep-RL solver could cut the search space sharply, making an enumeration-based solution more affordable. Furthermore, by employing backtracking with a proper coloring order, Deep-RL can find a solution with modestly-trained neural networks with even less search space. Our experiments show that Deep-RL can successfully find a solution for 10 product-level ATE programs while searching much fewer (e.g., 1/3,500) states than the previous PBQP enumeration solver. Also, when applied to C programs in llvm-test-suite for regular CPUs, it achieves a competitive performance to the existing PBQP register allocator in LLVM.

DOI: 10.1109/CGO53902.2022.9741272

F3M： fast focused function merging

作者: Stirling, Sean and Rocha, Rodrigo C. O. and Hazelwood, Kim and Leather, Hugh and O’Boyle, Michael and Petoumenos, Pavlos
关键词: LLVM, code-size reduction, compiler optimization, function merging

Abstract

From IoT devices to datacenters, code size is important, motivating ongoing research in binary reduction. A key technique is the merging of similar functions to reduce code redundancy. Success, however, depends on accurately identifying functions that can be profitably merged. Attempting to merge all function pairs is prohibitively expensive. Current approaches, therefore, employ summaries to estimate similarity. However these summaries often give little information about how well two programs will merge. To make things worse, they rely on exhaustive search across all summaries; impractical for real-world programs.In this work, we propose a new technique for matching similar functions. We use a hash-based approach that better captures code similarity and, at the same time, significantly reduces the search space by focusing on the most promising candidates. Experimental results show that our similarity metric has a better correlation with merging profitability. This improves the average code size reduction by 6 percentage points, while it reduces the overhead of function merging by 1.8x on average and by as much as 597x for large applications. Faster merging and reduced code size to compile at later stages mean that our approach introduces little to no compile time overhead, while in many cases it makes compilation faster by up to 30%.

DOI: 10.1109/CGO53902.2022.9741269

Sound, precise, and fast abstract interpretation with tristate numbers

作者: Vishwanathan, Harishankar and Shachnai, Matan and Narayana, Srinivas and Nagarakatte, Santosh
关键词: abstract domains, eBPF, kernel extensions, program verification, static analysis

Abstract

Extended Berkeley Packet Filter (BPF) is a language and run-time system that allows non-superusers to extend the Linux and Windows operating systems by downloading user code into the kernel. To ensure that user code is safe to run in kernel context, BPF relies on a static analyzer that proves properties about the code, such as bounded memory access and the absence of operations that crash. The BPF static analyzer checks safety using abstract interpretation with several abstract domains. Among these, the domain of tnums (tristate numbers) is a key domain used to reason about the bitwise uncertainty in program values. This paper formally specifies the tnum abstract domain and its arithmetic operators. We provide the first proofs of soundness and optimality of the abstract arithmetic operators for tnum addition and subtraction used in the BPF analyzer. Further, we describe a novel sound algorithm for multiplication of tnums that is more precise and efficient (runs 33% faster on average) than the Linux kernel’s algorithm. Our tnum multiplication is now merged in the Linux kernel.

DOI: 10.1109/CGO53902.2022.9741267

作者: Xu, Xuezheng and Wang, Xudong and Xue, Jingling
关键词: GNN, program embedding, program repair

Abstract

We address the problem of finding context embeddings for faulty locations to allow a learning-based APR tool to learn and predict the repair operators used at the faulty locations. We introduce M3V, a new multi-modal multi-view context embedding approach, which represents the context of a faulty location in two modalities: (1) texts that capture its signature in a natural language using the tree-LSTM model, and (2) graphs that capture its structure with two views, data and control dependences, using the GNN model. We then fuse these two modalities to learn a probabilistic classifier from correct code that, once given a faulty location, will produce a probabilistic distribution over a set of repair operators. We have evaluated M3V against the state-of-the-art context embedding approaches in repairing two common types of bugs in Java, null pointer exceptions (NPE) and index out of bounds (OOB). Trained and tested with 75673 code samples from 20 real-world projects, a learning-based APR tool can predict repair operators more effectively with our context embeddings in repairing NPE bugs, by achieving higher accuracies (11% – 41%) and higher F1 scores (16% – 143%). For OOB bugs, these improvements are 9% – 30% and 15% – 79%, respectively.

DOI: 10.1109/CGO53902.2022.9741261

Enabling near real-time NLU-driven natural language programming through dynamic grammar graph-based translation

作者: Nan, Zifan and Shen, Xipeng and Guan, Hui
关键词: dynamic programming, natural language programming, program synthesis

Abstract

Recently, natural language (NL)-based program synthesis has drawn increasing interest. Conventional methods that depend on some predefined domain-specific rules suffer from the lack of robustness and generality. Recent efforts on adopting deep learning to map queries to code requires a large number of labeled examples, making them not applicable on domains with scarce labeled examples. Although a third alternative, natural language understanding (NLU)-driven approach addresses the problems, the long response time hinders its adoption in practice, especially in an interactive scenario. This paper presents a solution to enable near real-time NLU-driven NL programming. The solution features a new algorithm, dynamic grammar graph-based translation (DGGT), for identifying the best grammar tree for a query via dynamic programming. It also introduces two new optimizations, grammar-based pruning and orphan node relocation, to further reduce the search space and address the special complexities from queries. Evaluations on two domains, text editing and program source code analysis, show that the DGGT algorithm and the optimizations shortens the response time of a state-of-the-art NLU-driven synthesizer by up to 1887X (25-133X on average) while improving the accuracy by 2-12%.

DOI: 10.1109/CGO53902.2022.9741262

SPNC： an open-source MLIR-based compiler for fast sum-product network inference on CPUs and GPUs

作者: Sommer, Lukas and Axenie, Cristian and Koch, Andreas
关键词: CPU, GPU, LLVM, MLIR, machine learning, sum-product networks

Abstract

Sum-Product Networks (SPNs) are an alternative to the widely used Neural Networks (NNs) for machine learning. SPNs can not only reason about (un)certainty by qualifying their output with a probability, they also allow fast (tractable) inference by having run-times that are just linear w.r.t. the network size.We present SPNC, the first tool flow for generating fast native code for SPN inference on both CPUs and GPUs, including the use of vectorized/SIMD execution. To this end, we add two SPN-specific dialects to the MLIR framework and discuss their lowering towards the execution targets.We evaluate our approach on two applications, for which we consider performance, scaling to very large SPNs, and compile vs execution-time trade-offs. In this manner, we achieve multiple orders of magnitude in speed-ups over existing SPN support libraries.

DOI: 10.1109/CGO53902.2022.9741277

Distill： domain-specific compilation for cognitive models

作者: Vesel'{y
关键词: JIT compilers, Python, cognitive models, domain-specific compilation, human brain

Abstract

Computational models of cognition enable a better understanding of the human brain and behavior, psychiatric and neurological illnesses, clinical interventions to treat illnesses, and also offer a path towards human-like artificial intelligence. Cognitive models are also, however, laborious to develop, requiring composition of many types of computational tasks, and suffer from poor performance as they are generally designed using high-level languages like Python. In this work, we present Distill, a domain-specific compilation tool to accelerate cognitive models while continuing to offer cognitive scientists the ability to develop their models in flexible high-level languages. Distill uses domain-specific knowledge to compile Python-based cognitive models into LLVM IR, carefully stripping away features like dynamic typing and memory management that add performance overheads without being necessary for the underlying computation of the models. The net effect is an average of 27X performance improvement in model execution over state-of-the-art techniques using Pyston and PyPy. Distill also repurposes classical compiler data flow analyses to reveal properties about data flow in cognitive models that are useful to cognitive scientists. Distill is publicly available, integrated in the PsyNeuLink cognitive modeling environment, and is already being used by researchers in the brain sciences.

DOI: 10.1109/CGO53902.2022.9741278

Optimizing GPU deep learning operators with polyhedral scheduling constraint injection

作者: Bastoul, Cedric and Zhang, Zhen and Razanajato, Harenome and Lossing, Nelson and Susungi, Adilla and de Juan, Javier and Filhol, Etienne and Jarry, Baptiste and Consolaro, Gianpietro and Zhang, Renwei
关键词: polyhedral model, scheduling, vectorization

Abstract

Automatic parallel code generation from high-level abstractions such as those manipulated by artificial intelligence and deep learning (AI/DL) frameworks heavily rely on compiler techniques for automatic parallelization and optimization. Many recent advances rely on the polyhedral framework for this task because of its ability to model and to apply a wide range of loop transformations. However, modeling the complexity of the target architecture and of efficient cost models to decide about the best transformation is in general out of reach for a framework based on linear/affine constraints. In this work, we propose to decouple the polyhedral framework into linear and non-linear components. We introduce the constraint tree abstraction which may be generated by a non-linear optimizer and injected to the polyhedral optimization process to build better solutions. We present how to benefit from such a mechanism to generate efficient codes for GPU in the context of AI/DL operators. Our constraint injection allows to drive the polyhedral scheduler towards efficient solutions for load/store vectorization relying both on memory coalescing and vector types. We implemented our scheduler supporting constraint injection and our constraint construction system within a production AI/DL framework. Experiments on well known neural networks show the efficiency of this approach with respect to state-of-the-art polyhedral scheduling for GPU.

DOI: 10.1109/CGO53902.2022.9741260

Comprehensive accelerator-dataflow co-design optimization for convolutional neural networks

作者: Vaidya, Miheer and Sukumaran-Rajam, Aravind and Rountev, Atanas and Sadayappan, P.
关键词: No keywords

Abstract

The design space of possible schedules for mapping a Convolutional Neural Network layer onto a spatial accelerator array, referred as the dataflow, is enormous. The co-design of key architectural parameters (such as number of processing elements, sizes of register files and scratchpad memories) along with the dataflow to optimize the implementation of one or more CNN stages makes the design space explosively larger. Several recent efforts have addressed the design-space exploration problem for CNN accelerators via heuristics or limited search strategies. In this paper we develop the first optimization approach that uses analytical modeling and the solution of constrained nonlinear optimization problems for comprehensive algorithm-architecture co-design optimization. Using the Timeloop accelerator modeling framework, we demonstrate that the new optimization methodology can enable significant improvements over prior accelerator designs for both energy minimization and performance maximization.

DOI: 10.1109/CGO53902.2022.9741281

A compiler framework for optimizing dynamic parallelism on GPUs

Abstract

Automatic horizontal fusion for GPU kernels

Abstract

DARM： control-flow melding for SIMT thread divergence reduction

Abstract

Efficient execution of OpenMP on GPUs

Abstract

GraphIt to CUDA compiler in 2021 LOC： a case for high-performance DSL implementation via staging with BuilDSL

Abstract

A compiler for sound floating-point computations using affine arithmetic

Abstract

Aggregate update problem for multi-clocked dataflow languages

Abstract

CompilerGym： robust, performant compiler optimization environments for AI research

Abstract

PALMED： throughput characterization for superscalar architectures

Abstract

SRTuner： effective compiler optimization customization by exposing synergistic relations

Abstract

Recovering container class types in C++ binaries

Abstract

Automatic generation of debug headers through BlackBox equivalence checking

Abstract

Gadgets splicing： dynamic binary transformation for precise rewriting

Abstract

Lambda the ultimate SSA： optimizing functional programs in SSA

Abstract

NOELLE offers empowering LLVM extensions

Abstract

Hecate： performance-aware scale optimization for homomorphic encryption compiler

Abstract

Unified compilation for lossless compression and sparse computing

Abstract

Loop rolling for code size reduction

Abstract

Solving PBQP-based register allocation using deep reinforcement learning

Abstract

F3M： fast focused function merging

Abstract

Sound, precise, and fast abstract interpretation with tristate numbers

Abstract

M3V： multi-modal multi-view context embedding for repair operator prediction

Abstract

Enabling near real-time NLU-driven natural language programming through dynamic grammar graph-based translation

Abstract

SPNC： an open-source MLIR-based compiler for fast sum-product network inference on CPUs and GPUs

Abstract

Distill： domain-specific compilation for cognitive models

Abstract

Optimizing GPU deep learning operators with polyhedral scheduling constraint injection

Abstract

Comprehensive accelerator-dataflow co-design optimization for convolutional neural networks

Abstract