Skip to main content
Log in

The Implementation of a High Performance GPGPU Compiler

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

In this paper we present our experience in developing an optimizing compiler for general purpose computation on graphics processing units (GPGPU) based on the Cetus compiler framework. The input to our compiler is a naïve GPU kernel procedure, which is functionally correct but without any consideration for performance optimization. Our compiler applies a set of optimization techniques to the naive kernel and generates the optimized GPU kernel. Our compiler supports optimizations for GPU kernels using either global memory or texture memory. The implementation of our compiler is facilitated with a source-to-source compiler infrastructure, Cetus. The code transformation in the Cetus compiler framework is called a pass. We classify all the passes used in our work into two categories: functional passes and optimization passes. The functional passes translate input kernels into desired intermediate representation, which clearly represents memory access patterns and thread configurations. A series of optimization passes improve the performance of the kernels by adapting them to the target GPGPU architecture. Our experiments show that the optimized code achieves very high performance, either superior or very close to highly fine-tuned libraries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.W.: An adaptive performance modling tool for GPU architectures. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2010)

  2. Lee, S.-I., Johnson, T., Eigenmann, R.: Cetus—an extensible compiler infrastructure for source- to-source transformation. In: Proceedings of Workshops on Languages and Compilers for Parallel Computing (2003)

  3. Lee, S., Min, S.-J., Eigenmann, R.: OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2009)

  4. Lee, J., Lakshminarayana, N.B., Kim, H., Vuduc, R.: Many-thread aware prefetching mechanisms for gpgpu applications. IEEE/ACM International Symposium on Microarchitecture (2010)

  5. Liu, Y., Zhang, E.Z., Shen, X.: A cross-input adaptive framework for GPU programs optimization. In: Proceedings of IEEE International Parallel and Distributed Processing, Symposium (2009)

  6. NVIDIA CUDA C Programming Guide 3.1. (2010)

  7. OpenCL. http://www.khronos.org/opencl/

  8. Ruetsch, G., Micikevicius, P.: Optimize Matrix Transpose in CUDA. NVIDIA (2009)

  9. Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S., Stratton, J.A., Hwu,W.W.: Optimization space pruning for a multi-threaded GPU. International Symposium on Code Generation and Optimization (2008)

  10. Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2008)

  11. Stratton, J.A., Stone, S.S., Hwu, W.W.: MCUDA: An Efficient Implementation of CUDA Kernels on Multicores. IMPACT Technical Report IMPACT-08-01, UIUC, Feb (2008)

  12. Ueng, S., Lathara, M., Baghsorkhi, S.S., Hwu, W.W.: CUDA-lite: Reducing GPU programming complexity. In: Proceedings of Workshops on Languages and Compilers for Parallel Computing (2008)

  13. Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. ACM SIGPLAN conference on Programming Language Design and Implementation (2010)

  14. Yang, Y., Xiang, P., Kong, J., Mantor, M., Zhou, H.: A unified optimizing compiler framework for different GPGPU architectures. In: ACM Transactions on Architecture and Code, Optimization (2012)

  15. Yang, Y., Zhou, H.: http://code.google.com/p/gpgpucompiler/

Download references

Acknowledgments

This work is supported by the National Science Foundation, CAREER award CCF-0968667.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Yang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, Y., Zhou, H. The Implementation of a High Performance GPGPU Compiler. Int J Parallel Prog 41, 768–781 (2013). https://doi.org/10.1007/s10766-012-0228-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-012-0228-3

Keywords

Navigation