Two-phase execution of binary applications on CPU/GPU machines

https://doi.org/10.1016/j.compeleceng.2014.02.002Get rights and content

Highlights

  • Proposes a two-phase virtual execution environment for executing binary applications on CPU/GPU architectures.

  • Presents an efficient way of extracting hot spots from sequential binary code.

  • Designs an efficient mechanism for mapping hot spots to GPUs.

Abstract

High computational power of GPUs (Graphics Processing Units) offers a promising accelerator for general-purpose computing. However, the need for dedicated programming environments has made the usage of GPUs rather complicated, and a GPU cannot directly execute binary code of a general-purpose application. This paper proposes a two-phase virtual execution environment (GXBIT) for automatically executing general-purpose binary applications on CPU/GPU architectures. GXBIT incorporates two execution phases. The first phase is responsible for extracting parallel hot spots from the sequential binary code. The second phase is responsible for generating the hybrid executable (both CPU and GPU instructions) for execution. This virtual execution environment works well for any applications that run repeatedly. The performance of generated CUDA (Compute Unified Device Architecture) code from GXBIT on a number of benchmarks is close to 63% of the hand-tuned GPU code. It also achieves much better overall performance than the native platforms.

Introduction

As the tremendous raw performance of modern GPUs [1] has evolved to incorporate additional programmability [2], [3], [4], many researchers have employed GPUs as co-processors to accelerate general-purpose applications. In many cases, performing general purpose applications on graphics hardware can provide significant advantages over implementing them on traditional CPUs [5].

However, there are many issues to be resolved in employing GPUs to accelerate general purpose applications. (1) Source Code Rewriting: The existing environments, such as CUDA [2] and Stream SDK [3], for programming GPUs are based on dedicated programming and are different from conventional languages, such as C and C++. These constraints force the programmers to rewrite the hot spots of the source code and then transform them into the form of code that can be executed by GPUs. This is troublesome, even for a programmer with great patience. (2) Source Code Partitioning: Constrained by their inherent features, not all parts of conventional applications are appropriate for GPUs. Only hot spots of the source code may be executed in parallel by GPUs. Furthermore, a nested loop with data dependence in its loop body also cannot be directly executed by GPUs. (3) Coordinating CPU and GPU: Since the CPU/GPU based architecture is a heterogeneous platform, the problems of coordinating CPU and GPU, and then enabling them to work in harmony need to be resolved. (4) Code Incompatibility: Since GPU hardware is evolving rapidly, code developed for a certain kind of GPU may be incompatible with other kinds or different generations of the same GPUs. In contrast to source code incompatibility, a much tougher problem is binary incompatibility [6], which means an application statically compiled with a certain compiler may not work on a platform without the specific GPU. (5) Unavailability of Source Code: Since much source code is reserved by certain commercial entities, only the binary executable may be available. There are also situations where the source code is lost and only the binary executable is available. To provide a more universal service, this issue should be considered.

In order to cope with the abovementioned issues, this paper proposes a virtual execution environment-GXBIT, for automatically and transparently executing general-purpose applications binaries on CPU/GPU architectures. In order to construct a better environment as well as to support binary code, GXBIT employs a dynamic binary translator-CrossBit [7], as the foundation tool. However, the high overhead of dynamic binary translation is in conflict with our acceleration objective. To resolve this problem, GXBIT introduces a two-phase execution model and tries to perform the time-consuming tasks in the first phase, so that the performance of the second phase execution can be improved. In overall terms, we make the following contributions: (1) a two-phase virtual execution environment for executing binary applications on CPU/GPU architectures; (2) an efficient way of extracting hot spots from sequential binary code; (3) an efficient mechanism for mapping hot spots to GPUs.

The remainder of this paper is structured as follows. Section 2 reviews the related work. Section 3 overviews the background of the CUDA programming model and the dynamic binary translator CrossBit. Section 4 describes the overall architecture of GXBIT. Section 5 presents the implementation of GXBIT. Section 6 evaluates system performance. We conclude our paper in Section 7.

Section snippets

Related work

Many researchers are using GPUs as co-processors to accelerate their general purpose applications. However, the need of dedicated programming environments has made the usage of GPUs rather complicated. To avoid this deficiency, several studies have been conducted. (1) Using a polyhedral compiler transformation framework: Baskaran et al. [11] develops a code transformation system that can automatically generate parallel CUDA code from sequential C input code. (2) Further improving

Background

This section provides an overview of the CUDA programming model and background information on the dynamic binary translator CrossBit, on which our GXBIT is based.

Overview of Gxbit

The overall workflow of GXBIT can be divided into two phases (as in Fig. 5). (1) In the first phase, GXBIT first employs a dynamic binary translator (with binary instrumentation) to collect the intermediate representation (IR) of the source code as well as the profile information. Then, a static analyzer is triggered to extract the hot spots (with no data dependence) and related information. Finally, the intermediate code translator is called to translate the extracted hot spots to the form of

Implementation

This section describes the key techniques for implementing GXBIT, which include the intermediate representation (IR) of GXBIT, extracting the hot spots and related information, finding the hot spots with no data dependencies, and the communication mechanisms between CPU and GPU. In this paper, we do not incorporate the process of translating the extracted hot spots to the code that can be directly run on the underlying CPU platforms. For this information, we can refer to [20]. In that paper, it

Performance evaluation

In this section, we present a performance evaluation of GXIBT. Table 1 shows the hardware and software configurations of the experimental environment. We first carry out the experiment on comparing the overall performance among GXBIT, native platform, and the original version of CrossBit. Then, an experiment is carried out to inspect the effectiveness of kernel functions generated by GXBIT. In these experiments, we use four applications; matrix multiplication and ConvolutionFFT2D are from CUDA

Conclusions and future work

GXBIT is a virtual execution environment for automatically and transparently executing general-purpose binary applications on CPU/GPU based architectures. Under GXBIT, the control-intensive parts of the code are executed by the CPU, and the compute-intensive parts are executed by the GPU. Profiting from the underlying technique of dynamic binary translation, GXBIT can take the binary code as input. GXBIT achieves these goals by employing a two-phase execution model. By doing this, the potential

Acknowledgments

This work was supported by the academic and technical leader recruiting policy of Anhui University, the National Natural Science Foundation of China (Grant Nos. 61300169 and 60970107), the International Cooperation Program of China (Grant No. 2011DFA10850), the Major Program for Basic Research of Shanghai (Grant No. 10DZ1500200), The Key Program for Basic Research of Shanghai (Grant No. 10511500102). We are grateful for Professor Yun Yang from Swinburne University of Technology in Australia for

Erzhou Zhu is currently a lecturer with School of Computer Science and Technology, Anhui University (Hefei, China). He received his Ph.D. degree in computer science from Shanghai Jiao Tong University (Shanghai, China) in 2012. His current research interests include, but are not limited to, program analysis, computer architecture, compiling technology, virtualization and cloud computing.

References (20)

  • Erik Lindholm et al.

    ‘NVIDIA Tesla: a unified graphics and computing architecture

    IEEE Micro

    (2008)
  • NVIDIA: ‘CUDA Programming Guide 1.1’,...
  • Buck Ian, Foley Tim, Horn Daniel, Sugerman Jeremy, Fatahalian Kayvon, Houston Mike, Hanrahan Pat. ‘Brook for GPUs:...
  • John E. Stone et al.

    OpenCL: A parall el programming standard for heterogeneous computing systems

    Comput Sci Eng

    (2010)
  • John D. Owens et al.

    A survey of general-purpose computation on graphics hardware

    Comput Graph Forum

    (2007)
  • Clark Nathan, Bolme Jason, Chu Micheal, Mahlke Scott, Biles Stuart, Flautner Krisztian. An architecture framework for...
  • Yang Yindong, Guan Haibing, Zhu Erzhou, Yang Hongbo, Liu Bo. CrossBit: a multi-sources and multi-targets DBT. In: The...
  • Bastoul Cedric. Code generation in the polyhedral model is easier than you think. In: Proceedings of the 13th...
  • Bastoul C, Cohen A, Girbal S, Sharma S, Temam O. Putting polyhedral transformations to work. In: LCPC’16 International...
  • Bondhugula U, Hartono A, Ramanujan J, Sadayappan P. Apractical automatic polyhedral parallelizer and locality...
There are more references available in the full text version of this article.

Cited by (3)

  • Accelerating aerial image simulation using improved CPU/GPU collaborative computing

    2015, Computers and Electrical Engineering
    Citation Excerpt :

    There is a growing trend towards the institutional use of multiple computing resources (usually heterogeneous) as a sole computing resource [19]. Although the CPU–GPU collaborative computing has been well studied [20–24], most of these research ignore the SIMD method in CPU parallel. With introducing SIMD instructions, the computing power gap between CPU and GPU will be tremendously narrowed.

  • An optimized magnetostatic field solver on GPU using open computing language

    2017, Concurrency and Computation: Practice and Experience
  • A Deep Collaborative Computing Based SAR Raw Data Simulation on Multiple CPU/GPU Platform

    2017, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

Erzhou Zhu is currently a lecturer with School of Computer Science and Technology, Anhui University (Hefei, China). He received his Ph.D. degree in computer science from Shanghai Jiao Tong University (Shanghai, China) in 2012. His current research interests include, but are not limited to, program analysis, computer architecture, compiling technology, virtualization and cloud computing.

Ruhui Ma received his Ph.D. degree in computer science from Shanghai Jiao Tong University (Shanghai, China) in 2011. He is currently an Assistant Professor with the Faculty of Computer Science, Shanghai Jiao Tong University (Shanghai, China). His main research interests include virtual machines, computer architecture and compiling.

Yang Hou is a PH.D. candidate in UM-SJTU Joint Institute, Shanghai Jiao Tong University (Shanghai, China). His main research interests include virtual machines and computer architecture.

Yindong Yang received his Ph.D. degree in computer science from Shanghai Jiao Tong University (Shanghai, China) in 2012. His main research interests include virtual machines, computer architecture and compiling.

Feng Liu is currently a professor with School of Computer Science and Technology, Anhui University (Hefei, China). He received his Ph.D. degree in computer science from University of Science and Technology of China (Hefei, China) in 2003. His current research interests include computer architecture, parallel computing, and cloud computing.

Haibing Guan received his Ph.D. degree in computer science from TongJi University (Shanghai, China) in 1999. He is currently a Professor with the Faculty of Computer Science, Shanghai Jiao Tong University (Shanghai, China). His current research interests include, but are not limited to, computer architecture, compiling, virtualization and hardware/software co-design.

Reviews processed and recommended for publication to Editor-in-Chief by Associate Editor Dr. Jian Li.

View full text