ABSTRACT
Heterogeneous systems equipped with traditional processors (CPUs) and graphics processing units (GPUs) have enabled processing large data sets. With new programming models, such as OpenCL and CUDA, programmers are encouraged to offload data parallel workloads to GPUs as much as possible in order to fully utilize the available resources. Unfortunately, offloading work is strictly limited by the size of the physical memory on a specific GPU. In this paper, we present Virtual Address Space for Throughput processors (VAST), an automatic GPU memory management system that provides an OpenCL program with the illusion of a virtual memory space. Based on the available physical memory on the target GPU, VAST does the following: automatically partitions the data parallel workload into chunks; efficiently extracts the precise working set required for the divided workload; rearranges the working set in contiguous memory space; and, transforms the kernel to operate on the reorganized working set. With VAST, the programmer is responsible for developing a data parallel kernel in OpenCL without concern for physical memory space limitations of individual GPUs. VAST transparently handles code generation dealing with the constraints of the actual physical memory and improves the retargetability of the OpenCL with moderate overhead. Experiments demonstrate that a real GPU, NVIDIA GTX 760 with 2 GB of memory, can compute any size of data without program changes achieving 2.6x speedup over CPU exeuction, which is a realistic alternative for large data computation.
- Clang: A C language family frontend for LLVM. http://clang.llvm.org.Google Scholar
- Intel(R) SDK for opencl applications 2013. http://software.intel.com/en-us/vcsource/tools/opencl-sdk.Google Scholar
- NVIDIA GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.Google Scholar
- Intel(R) Core i7-3700 Desktop Processor Series, 2012. http://download.intel.com/support/processors/corei7/sb/core_i7-3700_d.pdf.Google Scholar
- CUDA C Programming Guide, 2014. http://docs.nvidia.com/cuda.Google Scholar
- OpenACC, directives for accelerators, 2014. http://www.openacc.org.Google Scholar
- OpenCL - the open standard for parallel programming of heterogeneous systems, 2014.Google Scholar
- OpenHMPP, Hybrid Multicore Parallel Programming, 2014. http://www.openhmpp.org.Google Scholar
- A. Basumallik and R. Eigenmann. Optimizing irregular shared-memory applications for distributed-memory systems. In Proc. of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 119--128, 2006. Google ScholarDigital Library
- A. Bensoussan, C. T. Clingen, and R. C. Daley. The Multics virtual memory: concepts and design. Communications of the ACM, 15(5):308--318, May 1972. Google ScholarDigital Library
- J. Canny and H. Zhao. Big Data Analytics with Small Footprint: Squaring the Cloud. In Proc. of the 19th International Conference on Knowledge Discovery and Data Mining, pages 95--103, 2013. Google ScholarDigital Library
- A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 381--392, 2011. Google ScholarDigital Library
- T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynamically managed data for CPU-GPU architectures. In Proc. of the 2012 International Symposium on Code Generation and Optimization, pages 165--174, 2012. Google ScholarDigital Library
- T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. Automatic CPU-GPU communication management and optimization. In Proc. of the '11 Conference on Programming Language Design and Implementation, pages 142--151, 2011. Google ScholarDigital Library
- F. Ji, H. Lin, and X. Ma. RSVM: A region-based software virtual memory for GPU. In Proc. of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pages 269--278, 2013. Google ScholarDigital Library
- J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proc. of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 277--288, 2011. Google ScholarDigital Library
- J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In Proc. of the 2012 International Conference on Supercomputing, pages 341--352, 2012. Google ScholarDigital Library
- K. Kofler, I. Grasso, B. Cosenza, and T. Fahringer. An Automatic Input-sensitive Approach for Heterogeneous Task Partitioning. In Proc. of the 2013 International Conference on Supercomputing, pages 149--160, 2013. Google ScholarDigital Library
- C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75--86, 2004. Google ScholarDigital Library
- J. Lee, M. Samadi, Y. Park, and S. Mahlke. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proc. of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pages 245--256, 2013. Google ScholarDigital Library
- C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proc. of the 42nd Annual International Symposium on Microarchitecture, pages 45--55, 2009. Google ScholarDigital Library
- S.-J. Min and R. Eigenmann. Optimizing irregular shared-memory applications for clusters. In Proc. of the 2008 International Conference on Supercomputing, pages 256--265, 2008. Google ScholarDigital Library
- B. Pichai, L. Hsu, and A. Bhattacharjee. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In 19th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 743--758, 2014. Google ScholarDigital Library
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 73--82, 2008. Google ScholarDigital Library
- M. Samadi, A. Hormati, M. Mehrara, J. Lee, and S. Mahlke. Adaptive input-aware compilation for graphics engines. In Proc. of the '12 Conference on Programming Language Design and Implementation, pages 13--22, 2012. Google ScholarDigital Library
- D. Sayre. Is automatic "folding" of programs efficient enough to displace manual? Communications of the ACM, 12(12):656--660, Dec. 1969. Google ScholarDigital Library
- S. Sharma, R. Ponnusamy, B. Moon, H. Yuan-Shin, R. Das, and J. Saltz. Run-time and compile-time support for adaptive irregular problems. pages 97--106, 1994. Google ScholarDigital Library
- L. Torczon and K. Cooper. Engineering A Compiler. Morgan Kaufmann Publishers Inc., 2nd edition, 2011. Google ScholarDigital Library
- Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In Proc. of the '10 Conference on Programming Language Design and Implementation, pages 86--97, 2010. Google ScholarDigital Library
Index Terms
- VAST: the illusion of a large memory space for GPUs
Recommendations
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing SystemsWith fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
Optimization and Implementation of LBM Benchmark on Multithreaded GPU
DSDE '10: Proceedings of the 2010 International Conference on Data Storage and Data EngineeringWith fast development of transistor technology, Graphic Processing Unit(GPU) is increasingly used in the non-graphics applications, and major GPU hardware vendors have introduced software stacks for their own GPUs, such as Brook+ for AMD GPU. Compared ...
On the Programmability and Performance of Heterogeneous Platforms
ICPADS '13: Proceedings of the 2013 International Conference on Parallel and Distributed SystemsGeneral-purpose computing on an ever-broadening array of parallel devices has led to an increasingly complex and multi-dimensional landscape with respect to programmability and performance optimization. The growing diversity of parallel architectures ...
Comments