ABSTRACT
We present an OpenMP framework for Java that can exploit an available graphics card as an application accelerator. Dynamic languages (Java, C#, etc.) pose a challenge here because of their write-once-run-everywhere approach. This renders it impossible to make compile-time assumptions on whether and which type of accelerator or graphics card might be available in the system at run-time.
We present an execution model that dynamically analyzes the running environment to find out what hardware is attached. Based on the results it dynamically rewrites the bytecode and generates the necessary gpGPU code on-the-fly.
Furthermore, we solve two extra problems caused by the combination of Java and CUDA. First, CUDA-capable hardware usually has little memory (compared to main memory). However, as Java is a pointer-free language, array data can be stored in main memory and buffered in GPU memory. Second, CUDA requires one to copy data to and from the graphics card's memory explicitly. As modern languages use many small objects, this would involve many copy operations when done naively. This is exacerbated because Java uses arrays-of-arrays to implement multi-dimensional arrays. A clever copying technique and two new array packages allow for more efficient use of CUDA.
- Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable Parallel Programming with CUDA. Queue 6(2) (2008) 40--53 Google ScholarDigital Library
- Klemm, M., Bezold, M., Veldema, R., Philippsen, M.: JaMP: An Implementation of OpenMP for a Java DSM. Concurrency and Computation: Practice and Experience 18(19) (2007) 2333--2352 Google ScholarDigital Library
- Scarpino, M.: Programming the Cell Processor: For Games, Graphics, and Computation. Prentice Hall PTR, Upper Saddle River, NJ (2008) Google ScholarDigital Library
- Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for GPUs: stream computing on graphics hardware. In: SIGGRAPH '04, Los Angeles, CA (2004) 777--786 Google ScholarDigital Library
- Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Dubey, P., Junkins, S., Lake, A., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Abrash, M., Sugerman, J., Hanrahan, P.: Larrabee: A Many-Core x86 Architecture for Visual Computing. IEEE Micro 29(1) (2009) 10--21 Google ScholarDigital Library
- Lee, S., Min, S. J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Symp. on Principles and Practice of Parallel Programming, Raleigh, NC (2008) 101--110 Google ScholarDigital Library
- Lin, Y., Terboven, C., an Mey, D., Copty, N.: Automatic scoping of variables in parallel regions of an openmp program. In Chapman, B. M., ed.: WOMPAT. Volume 3349 of Lecture Notes in Computer Science., Springer (2004) 83--97 Google ScholarDigital Library
- Midkiff, S., Moreira, J., Snir, M.: Java For Numerically Intensive Computing: From Flops To Gigaflops. In: Symp. on the Frontiers of Massively Parallel Computation, Annapolis, MA (1999) 251--261 Google ScholarDigital Library
- Black, F., Scholes, M.: The pricing of options and corporate liabilities. Journal of Political Economy 81(3) (1973) 637--54Google ScholarCross Ref
- Wolf-Gladrow, D.: Lattice-Gas Cellular Automata and Lattice Boltzmann Models. Number 1725 in Lecture Notes in Mathematics. Springer (2000)Google Scholar
- Matsumoto, M., Nishimura, T.: Mersenne Twister: a 623-dimensionally Equidistributed Uniform Pseudo-random Number Generator. ACM Trans. Model. Comput. Simul. 8(1) (1998) 3--30 Google ScholarDigital Library
- JCuda. http://www.jcuda.org/Google Scholar
- Barrachina, S., Castillo, M., Igual, F., Mayo, R., Quintana-Orti, E.: Evaluation and tuning of the Level 3 CUBLAS for graphics processors. In: Intl. Parallel and Distributed Processing Symp., Miami, FL (2008) 1--8Google ScholarCross Ref
- Stratton., J., Stone., S., Hwu, W. M. W.: MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs, Edmonton, Canada (2008) 16--30 Google ScholarDigital Library
- cCool, M., Toit, S. D.: Metaprogramming GPUs with Sh. AK Peters Ltd (2004) Google ScholarDigital Library
- Breitbart, J.: CuPP -- A framework for easy CUDA integration. In: HIPS: High-Level Parallel Programming Models and Supportive Environments, Rome, Italy (2009) 1--8 Google ScholarDigital Library
- Ueng, S. Z., Lathara, M., Baghsorkhi, S., Hwu, W. M. W.: CUDA-Lite: Reducing GPU Programming Complexity. In: Languages and Compilers for Parallel Computing, Edmonton, Canada (2008) 1--15 Google ScholarDigital Library
- Khronos. http://www.khronos.org/opencl/Google Scholar
- Wolfe, M.: More iteration space tiling. In: Proc. of the 1989 ACM/IEEE conference on Supercomputing, Reno, Nevada (1989) 655--664 Google ScholarDigital Library
- Guo, J., Bikshandi, G., Fraguela, B. B., Garzaran, M. J., Padua, D.: Programming with Tiles. In: PPoPP '08: Proc. of the 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, Salt Lake City, UT (2008) 111--122 Google ScholarDigital Library
Index Terms
- JCudaMP: OpenMP/Java on CUDA
Recommendations
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Comments