skip to main content
10.1145/1735688.1735695acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Compiling Python to a hybrid execution environment

Published: 14 March 2010 Publication History

Abstract

A new compilation framework enables the execution of numerical-intensive applications, written in Python, on a hybrid execution environment formed by a CPU and a GPU. This compiler automatically computes the set of memory locations that need to be transferred to the GPU, and produces the correct mapping between the CPU and the GPU address spaces. Thus, the programming model implements a virtual shared address space. This framework is implemented as a combination of unPython, an ahead-of-time compiler from Python/NumPy to the C programming language, and jit4GPU, a just-in-time compiler from C to the AMD CAL interface. Experimental evaluation demonstrates that for some benchmarks the generated GPU code is 50 times faster than generated OpenMP code. The GPU performance also compares favorably with optimized CPU BLAS code for single-precision computations in most cases.

References

[1]
Pypy project (2009-09-30). http://codespeak.net/pypy/dist/pypy/doc/.
[2]
D. Ancona, M. Ancona, A Cuni, and N. Matsakis. "RPython: A step towards reconciling dynamically and statically typed OO languages". In Object-Oriented Programming Systems, Languages, and Applications (OOPSLA), pages 53--64, Montreal, QC, Canada, 2007.
[3]
Stefan Behnel, Robert Bradshaw, and Dag Sverre Seljebotn. Cython: C-Extensions for Python (2009-09-30). http://www.cython.org.
[4]
Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, and Armin Rigo. Tracing the meta-level: Pypy's tracing jit compiler. In Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Progr. Systems, pages 18--25, Genova, Italy, 2009.
[5]
Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. Brook for GPUs: stream computing on graphics hardware. ACM Transactions on Graphics (TG), 23(3):777--786, 2004.
[6]
Nvidia CUDA (2009-09-30). http://www.nvidia.com/cuda.
[7]
Mark Dufour. Shed Skin - An experimental (restricted) Python to C++ compiler (2009-09-30). http://code.google.com/p/shedskin/.
[8]
Alexandre E. Eichenberger, Kathryn O'Brien, Kevin O'Brien, Peng Wu, Tong Chen, Peter H. Oden, Daniel A. Prener, Janice C. Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind. Optimizing compiler for the cell processor. In Parallel Architectures and Compilation Techniques (PACT), pages 161--172, St. Louis, MO, USA, 2005.
[9]
Greg Ewing. Pyrex - a Language for Writing Python Extension Modules (2009-09-30). http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/.
[10]
Rahul Garg. A compiler for parallel execution of numerical Python programs on graphics processing units. Master's thesis, Computing Science, Univ. of Alberta, Edmonton, AB, Canada, September 2009.
[11]
Wendy Jones. Beginning DirectX 10 Game Programming. Course Technology Press, Boston, MA, USA, 1st edition, 2007.
[12]
M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. Dynamic management of scratch-pad memory space. In Design Automation Conference (DAC), pages 690--695, 2001.
[13]
Francois Labonte, Peter Mattson, William Thies, Ian Buck, Christos Kozyrakis, and Mark Horowitz. The Stream Virtual Machine. In Parallel Architectures and Compilation Techniques (PACT), pages 267--277, Antibes Juan-les-Pins, France, 2004.
[14]
Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 101--110, Raleigh, NC, USA, 2009.
[15]
Lian Li, Lin Gao, and Jingling Xue. Memory coloring: A compiler approach for scratchpad memory management. In Parallel Architectures and Compilation Techniques (PACT), pages 329--338, St. Louis, MO, USA, 2005.
[16]
Lian Li, Hui Wu, Hui Feng, and Jingling Xue. Towards data tiling for whole programs in scratchpad memory allocation. In Asia-Pacific Conference Advances in Computer Systems Architecture (ACSAC), pages 23--25, Seoul, Korea, August 2007. Springer.
[17]
Michael D. McCool, Zheng Qin, and Tiberiu S. Popa. Shader metaprogramming. In ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 57--68, Saarbrucken, Germany, 2002. Eurographics Association.
[18]
OpenCL - The open standard for parallel programming of heterogeneous systems (2009-09-30). http://www.khronos.org/opencl/.
[19]
Yunheung Paek, Jay Hoeflinger, and David Padua. Efficient and precise array access analysis. ACM Transactions on Programming Languages and Systems (TOPLAS), 24(1):65--109, 2002.
[20]
Armin Rigo and Samuele Pedroni. PyPy's approach to virtual machine construction. In Workshop Companion to Object-Oriented Programming Systems, Languages, and Applications (OOPSLA), pages 944--953, Portland, OR, USA, 2006.
[21]
Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. Run-time Assisted Interprocedural Analysis of Memory Access Patterns. Technical report, Department of Computer Science, Texas A&M University, 2001.
[22]
Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. Hybrid analysis: static & dynamic memory reference analysis. International Journal of Parallel Programming, 31(4):251--283, 2003.
[23]
Bratin Saha, Xiaocheng Zhou, Hu Chen, Ying Gao, Shoumeng Yan, Mohan Rajagopalan, Jesse Fang, Peinan Zhang, Ronny Ronen, and Avi Mendelson. Programming model for a heterogeneous x86 platform. In Conference on Programming Language Design and Implementation (PLDI), pages 431--440, Dublin, Ireland, 2009.
[24]
John A. Stratton, Sam S. Stone, and Wen-Mei W. Hwu. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Workshop on Languages and Compilers and Parallel Computing (LCPC), pages 16--30, Edmonton, AB, Canada, August 2008.
[25]
William Thies, Michael Karczmarek, and Saman Amarasinghe. StreamIt: A Language for Streaming Applications. In Compiler Construction (CC), pages 49--84, 2002.
[26]
Sumesh Udayakumaran and Rajeev Barua. Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), pages 276--286, San Jose, California, USA, 2003.
[27]
A. Udupa, R. Govindarajan, and M. J Thazhuthaveetil. Software Pipelined Execution of Stream Programs on GPUs. In International Symposium on Code Generation and Optimization (CGO), pages 200--209, Seattle, WA, USA, 2009.
[28]
Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian, Milind Girkar, Nick Y. Yang, Guei-Yuan Lueh, and Hong Wang. EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system. In Conference on Programming Language Design and Implementation (PLDI), pages 156--166, San Diego, CA, USA, 2007. ACM.

Cited By

View all
  • (2024)APPy: Annotated Parallelism for Python on GPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641575(113-125)Online publication date: 17-Feb-2024
  • (2022)Py2CyProceedings of the Genetic and Evolutionary Computation Conference Companion10.1145/3520304.3534037(1950-1955)Online publication date: 9-Jul-2022
  • (2019)Better Late Than NeverProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3326604(207-218)Online publication date: 17-Jun-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
GPGPU-3: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
March 2010
124 pages
ISBN:9781605589350
DOI:10.1145/1735688
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2010

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

GPGPU-3

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)APPy: Annotated Parallelism for Python on GPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641575(113-125)Online publication date: 17-Feb-2024
  • (2022)Py2CyProceedings of the Genetic and Evolutionary Computation Conference Companion10.1145/3520304.3534037(1950-1955)Online publication date: 9-Jul-2022
  • (2019)Better Late Than NeverProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3326604(207-218)Online publication date: 17-Jun-2019
  • (2016)VizGenACM Transactions on Graphics10.1145/2980179.298240335:6(1-13)Online publication date: 5-Dec-2016
  • (2014)VelociraptorProceedings of the 23rd international conference on Parallel architectures and compilation10.1145/2628071.2628097(317-330)Online publication date: 24-Aug-2014
  • (2014)Just-in-time shape inference for array-based languagesProceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming10.1145/2627373.2627382(50-55)Online publication date: 9-Jun-2014
  • (2014)Reducing Compiler-Inserted Instrumentation in Unified-Parallel-C Code GenerationProceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing10.1109/SBAC-PAD.2014.34(270-277)Online publication date: 22-Oct-2014
  • (2014)BohriumProceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops10.1109/IPDPSW.2014.44(312-321)Online publication date: 19-May-2014
  • (2014)Transparent GPU Execution of NumPy ApplicationsProceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops10.1109/IPDPSW.2014.114(1002-1010)Online publication date: 19-May-2014
  • (2013)HidpProceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO.2013.6494994(1-11)Online publication date: 23-Feb-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media