Data-flow analysis and optimization for data coherence in heterogeneous architectures

https://doi.org/10.1016/j.jpdc.2019.04.004Get rights and content

Highlights

  • DCA: a set of two data-flow analyses that seek to identify CPU/GPU accesses.

  • DCO: creates shared buffers between CPU/GPU and inserts calls to keep data coherence.

  • A technique that tries to remove data offloading during GPU computation.

  • Speed-up of up to 8.87x on representative benchmarks on integrated and discrete GPUs.

Abstract

Although heterogeneous computing has enabled developers to achieve impressive program speed-ups, the cost of moving and keeping data coherent between host and device may easily eliminate any performance gains achieved by acceleration. To deal with this problem, this paper introduces DCA: a pair of two data-flow analyses that determine how variables are used by host/device at each program point. It also introduces DCO, a code optimization technique that uses DCA information to: (a) allocate OpenCL shared buffers between host and devices; and (b) insert appropriate OpenCL function calls into program points so as to minimize the number of data coherence operations. We have used the AClang compiler to measure the impact of DCA and DCO when generating code from Parboil, Polybench and Rodinia benchmarks for a set of discrete/integrated GPUs. The experimental results showed speed-ups of up to 5.25x (average of 1.39x) on an ARM Mali-T880 and up to 8.87x (average of 1.66x) on an NVIDIA GPU Pascal Titan X.

Introduction

With the advent of heterogeneous computing, many parallel programming models have emerged seeking to leverage the performance of sequential code by offloading computation kernels from a host machine (e.g. CPU) to an acceleration device (e.g. GPU). Computation offloading is typically achieved by annotating program fragments (e.g. hot loops) so that their execution is mapped to dedicated hardware like GPUs, APUs, FPGAs, among others. Most of these models use source code annotation standards like OpenACC and OpenMP or specialized language and libraries as in CUDA and OpenCL. While they differ in the way the kernel code is written, all such models require data to be offloaded to the device and the result of the computation brought back to the host.

The task of offloading a kernel into an acceleration device can have a large impact on the overall program performance, particularly when the time required to move data in/out of the device approaches the time needed to perform the actual computation [5], [8]. This is a common problem in a discrete GPUs (e.g. NVIDIA T880), where host and device do not share the same memory and data has to move through an interface card (e.g. PCI). On the other hand, even if host and device share the same memory, as in the case of integrated GPUs (e.g. ARM Mali), coherence must be assured for shared data so as to avoid inconsistency between the computation in both sides. Integrated GPUs are commonly found in architectures with limited memory resources, and stringent design constraints such as smartphones.

Although there has been a number of efforts to automatically address data coherence across host-device boundaries [1], [27] no universal hardware coherence protocol standard has yet been defined for heterogeneous systems. The exception is a recent interconnect architecture called NVLINK [7], proposed by NVIDIA, that uses CUDA8 to enable automatic coherence between CPU and GPU data by means of a page fault mechanism [6].

Outside the NVIDIA/CUDA domain, host-device coherence has no hardware support and needs to be addressed in software by the programmer/compiler. For the case of integrated GPUs, coherence is performed in software by means of specific map/unmap function calls that copy variables modified by the device/host back to the host/device thus squashing any old copies of the data that they might be holding. This technique can also be applied to discrete GPUs when OpenCL kernels are used. For example, NVIDIA recommends using map/unmap as a memory allocation best practice when programming their GPUs with OpenCL [18]. This paper addresses the problem of data-coherence optimization for integrated and discrete GPUs applying the techniques described above.

Coherence function calls are typically inserted by the programmer using functions from specialized libraries (e.g. OpenCL) or by a compiler that naively inserts such calls at the entry/exit of the kernels. It is important to highlight that a non-optimal insertion of map/unmap calls can result in unnecessary coherence operations thus impacting the overall program performance. Hence, optimizing compilers targeting heterogeneous systems should provide code optimization techniques capable of performing the following tasks: (a) identify variables that can be allocated in a shared memory space between CPU and GPU, when using integrated GPUs; (b) minimize host/device data movement through an optimized insertion of map/unmap calls, so as to keep the data used by host and device coherent.

Finding the best locations to insert map/unmap calls intosource code can be cast as the Data Coherence Optimization (DCO) problem and involves: (1) identifying the blocks of code where shared variables are used by different devices (e.g. CPU or GPU) and (2) inserting map/unmap calls to minimize the need of data coherence operations among host and devices. Since coherence and data offloading impact program performance, these problems are inter-dependent and should be addressed together.

The Control-Flow Graph (CFG) of Fig. 1a–b illustrates the need for DCO. In Fig. 1a basic block B0 dispatches and executes kernel KernelGPU which modifies a shared array A. To keep A coherent with the CPU host, a non-expert programmer could insert a map(A) call at the end of B0 as shown in Fig. 1a. Furthermore, a naive compiler could insert a map call at the end of B0 to assure data coherence. This will make the GPU update its copy of A after KernelGPU finishes and the flow of execution goes through B1. On the other hand, if the execution takes the program through B2, then array A is not accessed and the cost of performing data coherence becomes an overhead. To avoid that, the programmer or compiler should have inserted the map instruction on the edge that connects B0 to B1, instead of inserting it at the end of B0. To address this problem, this paper makes the following contributions:

  • 1.

    Data Coherence Analysis (DCA), a pair of data-flow analyses: (a) Memory Usage Analysis (MUA) that determines which kind of operation is performed with a given variable; and (b) Device Memory Analysis (DMA) that tracks who computes the operation, host or device.

  • 2.

    Data Coherence Optimization (DCO), a technique that uses DCA to insert OpenCL function calls map and unmap into program points so as to minimize the amount of data coherence operations required between host and device.

  • 3.

    Shared Buffer Allocation (SBA), an optimization that detects which data is allocated by the CPU and offloaded to/from an integrated GPU, and translates, whenever possible, CPU allocation calls (e.g. malloc) to OpenCL create buffer calls; the goal is to maximize the usage of the CPU/GPU shared memory, thus minimizing the need of coherence operations.

The first two contributions in the list above have been published in an earlier version of this work [28]. This current version extends that preliminary report with the idea of shared buffer allocation, plus an extensive experimental evaluation. In particular, we now bring a comparative analysis on how DCO performs when applied onto a newer integrated GPU with more computational power than the device we had available in 2017. In such experiments, we have compared the results of ARM/Mali-T880 with its successor ARM/Mali-G71. Furthermore, we discuss the impact of DCO when running on discrete GPUs. To this effect, we have evaluated our ideas on an NVIDIA Pascal Titan X; hence, showing how DCO improves computations on discrete GPUs through the use of pinned memory.

The rest of the paper is organized as follows. Section 2 details the costs of data offloading and coherence operations in a typical heterogeneous platform. Section 3 presents an overview of the AClang compiler, a LLVM based tool capable of automatically translating OpenMP 4.X annotated loops to OpenCL kernels. Section 4 introduces two data-flow analyses that compose DCA (Data Coherence Analysis) and their mathematical formulation. Section 5 describes how DCA is used to design DCO (Data Coherence Optimization). DCO leverages on DCA to maximize host/device shared memory usage, while minimizes the need of coherence operations. Experimental evaluation is described in Section 6. Section 7 discusses related work and Section 8 concludes the work.

Section snippets

Background

Heterogeneous computing has shown that specialized acceleration devices (e.g. GPUs) can provide significant performance improvement for a range of applications [24]. However, knowledge about the architecture of the targeted device is critical to reap the full benefits of its specialized hardware. For instance, programming a CPU/GPU platform is made difficult by the subtleties required for a correct access to the shared memory between them. Fortunately, specialized high-level languages

The AClang compiler

AClang [21] is an LLVM/Clang based compiler aimed at implementing the OpenMP Accelerator Model. It adds a new runtime library to LLVM/Clang that supports OpenMP offloading to devices like GPUs and FPGAs. Kernel functions are extracted from the OpenMP region and are dispatched as OpenCL or SPIR code to be loaded and executed by OpenCL drivers.

AClang OpenCL runtime library has two main functionalities: (a) it hides the complexity of OpenCL code from the compiler; and (b) it provides a mapping

Data Coherence Analysis (DCA)

To perform the optimizations that we have introduced in the previous section, we resort to a pair of inter-procedural data-flow analyses. We shall call the combination of these two techniques the Data Coherence Analysis. This data-flow analysis gives us the necessary information to insert map and unmap calls into programs.

The first of these data-flow algorithms is called Memory Usage Analysis (MUA), and the second is called Device Memory Analysis (DMA). Both algorithms propagate information in

Data Coherence Optimization (DCO)

The AClang compiler leverages on DCA to implement two optimization steps. The first step is named Shared Buffer Allocation (SBA) (Section 5.1) and seeks to maximize the usage of shared memory buffers between CPU and GPU. The second step gives the name of the optimization Data Coherence Optimization (DCO), and it seeks to minimize the insertion of the map and unmap function calls required to maintain data coherence between CPU and GPU. All the transformations performed by these two steps occur

Experimental evaluation

AClang with DCO has been evaluated using three integrated CPU–GPU architectures: (a) a mobile Exynos 8895 with an ARM Mali-G71 MP20 GPU (32 × 850 Mhz) running Android OS, v7.0 (Nougat); (b) a mobile Exynos 8890 Octa-core CPU (4 × 2.3 GHz Mongoose & 4 × 1.6 GHz Cortex-A53) integrated with an ARM Mali-T880 MP12 GPU (12 × 650 Mhz) running Android OS, v6.0 (Marshmallow); and (c) a laptop with 2.4 GHz dual-core Intel Core i5 processor integrated with an Intel Iris GPU with 40 execution units.

Related work

Previous work has shown that sharing host/device buffers in a shared memory integrated CPU–GPU can considerably improve program performance when comparing to a separate CPU–GPU architectures. Nilakant et al. [12] showed that using shared buffers in an integrated CPU–GPU outperforms by 15% to 50% the same application running on a separate CPU–GPU architecture. Backes et al. [3] showed a 30% improvement in the overall execution time when running the real-time image processing application on an

Conclusion and future works

This paper described DCA, a set of dataflow analysis that has its root in the observation that making variables used by both CPU and GPU shared, one can avoid unnecessary data offloading. Moreover, it proposes DCO, an optimization that allows variable buffer sharing between CPU and GPU. Preliminary results show that DCO indeed improves the speedup of applications with large datasets, complex algorithms or medium-to-large kernel duration when running in integrated GPUs. As for future work, we

Acknowledgments

This work is supported by Samsung, Brazil (grant 4716.08) and FAPESP Center for Computational Engineering and Sciences, Brazil (grant 13/08293-7).

Rafael Sousa received the M.Sc. in computer science from the State University of Campinas UNICAMP, Brazil, in 2016. He is currently a Ph.D. student at UNICAMP. His research interests are in code optimization, heterogeneous computing, parallel computing/programming, and high performance computing.

References (33)

  • AgarwalNeha et al.

    Selective GPU caches to eliminate CPU–GPU HW cache coherence

  • André LuizCamargos Tavares et al.

    Parameterized construction of program representations for sparse dataflow analyses

  • BackesLuna et al.

    Experiences in speeding up computer vision applications on mobile computing platforms

  • BastoulC.

    Code generation in the polyhedral model is easier than you think

  • CheShuai et al.

    Dymaxion: optimizing memory access patterns for heterogeneous systems

  • CUDA 8 Features Revealed. https://devblogs.nvidia.com/parallelforall/cuda-8-features-revealed, 2016. (Accessed: 02...
  • FoleyD. et al.

    Ultra-performance Pascal GPU and NVLink interconnect

    IEEE Micro

    (2017)
  • FujiiYusuke et al.

    Data transfer matters for GPU computing

  • GeladoIsaac et al.

    An asymmetric distributed shared memory model for heterogeneous parallel systems

    ACM SIGARCH Comput. Archit. News

    (2010)
  • JablinThomas B. et al.

    Dynamically managed data for CPU–GPU architectures

  • JKnoopJens et al.

    Lazy code motion

  • KarthikNilakant et al.

    On the efficacy of APUs for heterogeneous graph computation

  • KathleenKnobe et al.

    Array SSA form and its use in parallelization

  • LeeSeyong et al.

    OpenMP to gpgpu: a compiler framework for automatic translation and optimization

    ACM Sigplan Not.

    (2009)
  • LiaoC. et al.

    Early experiences with the OpenMP accelerator model

  • MendonçaGleison et al.

    DawnCC: automatic annotation for data parallelism and offloading

  • Rafael Sousa received the M.Sc. in computer science from the State University of Campinas UNICAMP, Brazil, in 2016. He is currently a Ph.D. student at UNICAMP. His research interests are in code optimization, heterogeneous computing, parallel computing/programming, and high performance computing.

    Marcio Pereira received the Ph.D. in computer science from the State University of Campinas UNICAMP, and University of Alberta, Canada. He is currently a Postdoctoral Researcher at the Institute of Computing, UNICAMP. He has proven experience in conducting technological innovation projects and new product development and IT services. Currently, his interests are in the fields of programming languages, dynamic compilation, virtual machines, code optimization, heterogeneous computing, and neural networks.

    Fernando Pereira received the Ph.D. in computer science from University of California, Los Angeles in 2008. He is currently an associate professor at the Department of Computer Science of the Federal University of Minas Gerais, where he coordinates the Compilers Lab. His research mission is to develop principles and techniques that help programmers to use efficiently and safely all the resources of the modern hardware. Together with his students, he published papers in important venues in the field of programming languages, and contributed to several open-source compilers, such as LLVM, GHC and Mozilla IonMonkey.

    Guido Araujo received the Ph.D. in electrical engineering from Princeton University in 1997. He is currently a full professor of computer science and engineering at UNICAMP. His current research interests are in the areas of code optimization, transactional memory, cloud computing, parallel programming, and heterogeneous computing, which are explored in close cooperation with the industry.

    View full text