research-article

JCudaMP: OpenMP/Java on CUDA

Authors:
Georg Dotzler

University of Erlangen-Nuremberg, Martensstr, Erlangen, Germany

University of Erlangen-Nuremberg, Martensstr, Erlangen, Germany
View Profile

,
Ronald Veldema

University of Erlangen-Nuremberg, Martensstr, Erlangen, Germany

University of Erlangen-Nuremberg, Martensstr, Erlangen, Germany
View Profile

,
Michael Klemm

University of Erlangen-Nuremberg, Martensstr, Erlangen, Germany

University of Erlangen-Nuremberg, Martensstr, Erlangen, Germany
View Profile

IWMSE '10: Proceedings of the 3rd International Workshop on Multicore Software EngineeringMay 2010Pages 10–17https://doi.org/10.1145/1808954.1808959

Published:01 May 2010Publication History

IWMSE '10: Proceedings of the 3rd International Workshop on Multicore Software Engineering

Pages 10–17

ABSTRACT

We present an OpenMP framework for Java that can exploit an available graphics card as an application accelerator. Dynamic languages (Java, C#, etc.) pose a challenge here because of their write-once-run-everywhere approach. This renders it impossible to make compile-time assumptions on whether and which type of accelerator or graphics card might be available in the system at run-time.

We present an execution model that dynamically analyzes the running environment to find out what hardware is attached. Based on the results it dynamically rewrites the bytecode and generates the necessary gpGPU code on-the-fly.

Furthermore, we solve two extra problems caused by the combination of Java and CUDA. First, CUDA-capable hardware usually has little memory (compared to main memory). However, as Java is a pointer-free language, array data can be stored in main memory and buffered in GPU memory. Second, CUDA requires one to copy data to and from the graphics card's memory explicitly. As modern languages use many small objects, this would involve many copy operations when done naively. This is exacerbated because Java uses arrays-of-arrays to implement multi-dimensional arrays. A clever copying technique and two new array packages allow for more efficient use of CUDA.

References

Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable Parallel Programming with CUDA. Queue 6(2) (2008) 40--53 Google ScholarDigital Library
Klemm, M., Bezold, M., Veldema, R., Philippsen, M.: JaMP: An Implementation of OpenMP for a Java DSM. Concurrency and Computation: Practice and Experience 18(19) (2007) 2333--2352 Google ScholarDigital Library
Scarpino, M.: Programming the Cell Processor: For Games, Graphics, and Computation. Prentice Hall PTR, Upper Saddle River, NJ (2008) Google ScholarDigital Library
Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for GPUs: stream computing on graphics hardware. In: SIGGRAPH '04, Los Angeles, CA (2004) 777--786 Google ScholarDigital Library
Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Dubey, P., Junkins, S., Lake, A., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Abrash, M., Sugerman, J., Hanrahan, P.: Larrabee: A Many-Core x86 Architecture for Visual Computing. IEEE Micro 29(1) (2009) 10--21 Google ScholarDigital Library
Lee, S., Min, S. J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Symp. on Principles and Practice of Parallel Programming, Raleigh, NC (2008) 101--110 Google ScholarDigital Library
Lin, Y., Terboven, C., an Mey, D., Copty, N.: Automatic scoping of variables in parallel regions of an openmp program. In Chapman, B. M., ed.: WOMPAT. Volume 3349 of Lecture Notes in Computer Science., Springer (2004) 83--97 Google ScholarDigital Library
Midkiff, S., Moreira, J., Snir, M.: Java For Numerically Intensive Computing: From Flops To Gigaflops. In: Symp. on the Frontiers of Massively Parallel Computation, Annapolis, MA (1999) 251--261 Google ScholarDigital Library
Black, F., Scholes, M.: The pricing of options and corporate liabilities. Journal of Political Economy 81(3) (1973) 637--54Google ScholarCross Ref
Wolf-Gladrow, D.: Lattice-Gas Cellular Automata and Lattice Boltzmann Models. Number 1725 in Lecture Notes in Mathematics. Springer (2000)Google Scholar
Matsumoto, M., Nishimura, T.: Mersenne Twister: a 623-dimensionally Equidistributed Uniform Pseudo-random Number Generator. ACM Trans. Model. Comput. Simul. 8(1) (1998) 3--30 Google ScholarDigital Library
JCuda. http://www.jcuda.org/Google Scholar
Barrachina, S., Castillo, M., Igual, F., Mayo, R., Quintana-Orti, E.: Evaluation and tuning of the Level 3 CUBLAS for graphics processors. In: Intl. Parallel and Distributed Processing Symp., Miami, FL (2008) 1--8Google ScholarCross Ref
Stratton., J., Stone., S., Hwu, W. M. W.: MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs, Edmonton, Canada (2008) 16--30 Google ScholarDigital Library
cCool, M., Toit, S. D.: Metaprogramming GPUs with Sh. AK Peters Ltd (2004) Google ScholarDigital Library
Breitbart, J.: CuPP -- A framework for easy CUDA integration. In: HIPS: High-Level Parallel Programming Models and Supportive Environments, Rome, Italy (2009) 1--8 Google ScholarDigital Library
Ueng, S. Z., Lathara, M., Baghsorkhi, S., Hwu, W. M. W.: CUDA-Lite: Reducing GPU Programming Complexity. In: Languages and Compilers for Parallel Computing, Edmonton, Canada (2008) 1--15 Google ScholarDigital Library
Khronos. http://www.khronos.org/opencl/Google Scholar
Wolfe, M.: More iteration space tiling. In: Proc. of the 1989 ACM/IEEE conference on Supercomputing, Reno, Nevada (1989) 655--664 Google ScholarDigital Library
Guo, J., Bikshandi, G., Fraguela, B. B., Garzaran, M. J., Padua, D.: Programming with Tiles. In: PPoPP '08: Proc. of the 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, Salt Lake City, UT (2008) 111--122 Google ScholarDigital Library

Index Terms

JCudaMP: OpenMP/Java on CUDA
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Source code generation
    2. General programming languages
      1. Language features
        Concurrent programming structures

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IWMSE '10: Proceedings of the 3rd International Workshop on Multicore Software Engineering
May 2010
72 pages
ISBN:9781605589640
DOI:10.1145/1808954
General Chairs:
Victor Pankratius
Karlsruhe Institute of Technology, Germany
,
Michael Philippsen
University of Erlangen-Nuremberg, Germany
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 424
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

JCudaMP: OpenMP/Java on CUDA

IWMSE '10: Proceedings of the 3rd International Workshop on Multicore Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Vectorizing Unstructured Mesh Computations for Many-core Architectures