skip to main content
10.1145/2628071.2628075acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

VAST: the illusion of a large memory space for GPUs

Published:24 August 2014Publication History

ABSTRACT

Heterogeneous systems equipped with traditional processors (CPUs) and graphics processing units (GPUs) have enabled processing large data sets. With new programming models, such as OpenCL and CUDA, programmers are encouraged to offload data parallel workloads to GPUs as much as possible in order to fully utilize the available resources. Unfortunately, offloading work is strictly limited by the size of the physical memory on a specific GPU. In this paper, we present Virtual Address Space for Throughput processors (VAST), an automatic GPU memory management system that provides an OpenCL program with the illusion of a virtual memory space. Based on the available physical memory on the target GPU, VAST does the following: automatically partitions the data parallel workload into chunks; efficiently extracts the precise working set required for the divided workload; rearranges the working set in contiguous memory space; and, transforms the kernel to operate on the reorganized working set. With VAST, the programmer is responsible for developing a data parallel kernel in OpenCL without concern for physical memory space limitations of individual GPUs. VAST transparently handles code generation dealing with the constraints of the actual physical memory and improves the retargetability of the OpenCL with moderate overhead. Experiments demonstrate that a real GPU, NVIDIA GTX 760 with 2 GB of memory, can compute any size of data without program changes achieving 2.6x speedup over CPU exeuction, which is a realistic alternative for large data computation.

References

  1. Clang: A C language family frontend for LLVM. http://clang.llvm.org.Google ScholarGoogle Scholar
  2. Intel(R) SDK for opencl applications 2013. http://software.intel.com/en-us/vcsource/tools/opencl-sdk.Google ScholarGoogle Scholar
  3. NVIDIA GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.Google ScholarGoogle Scholar
  4. Intel(R) Core i7-3700 Desktop Processor Series, 2012. http://download.intel.com/support/processors/corei7/sb/core_i7-3700_d.pdf.Google ScholarGoogle Scholar
  5. CUDA C Programming Guide, 2014. http://docs.nvidia.com/cuda.Google ScholarGoogle Scholar
  6. OpenACC, directives for accelerators, 2014. http://www.openacc.org.Google ScholarGoogle Scholar
  7. OpenCL - the open standard for parallel programming of heterogeneous systems, 2014.Google ScholarGoogle Scholar
  8. OpenHMPP, Hybrid Multicore Parallel Programming, 2014. http://www.openhmpp.org.Google ScholarGoogle Scholar
  9. A. Basumallik and R. Eigenmann. Optimizing irregular shared-memory applications for distributed-memory systems. In Proc. of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 119--128, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Bensoussan, C. T. Clingen, and R. C. Daley. The Multics virtual memory: concepts and design. Communications of the ACM, 15(5):308--318, May 1972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Canny and H. Zhao. Big Data Analytics with Small Footprint: Squaring the Cloud. In Proc. of the 19th International Conference on Knowledge Discovery and Data Mining, pages 95--103, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 381--392, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynamically managed data for CPU-GPU architectures. In Proc. of the 2012 International Symposium on Code Generation and Optimization, pages 165--174, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. Automatic CPU-GPU communication management and optimization. In Proc. of the '11 Conference on Programming Language Design and Implementation, pages 142--151, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. Ji, H. Lin, and X. Ma. RSVM: A region-based software virtual memory for GPU. In Proc. of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pages 269--278, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proc. of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 277--288, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In Proc. of the 2012 International Conference on Supercomputing, pages 341--352, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Kofler, I. Grasso, B. Cosenza, and T. Fahringer. An Automatic Input-sensitive Approach for Heterogeneous Task Partitioning. In Proc. of the 2013 International Conference on Supercomputing, pages 149--160, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75--86, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Lee, M. Samadi, Y. Park, and S. Mahlke. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proc. of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pages 245--256, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proc. of the 42nd Annual International Symposium on Microarchitecture, pages 45--55, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S.-J. Min and R. Eigenmann. Optimizing irregular shared-memory applications for clusters. In Proc. of the 2008 International Conference on Supercomputing, pages 256--265, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Pichai, L. Hsu, and A. Bhattacharjee. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In 19th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 743--758, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 73--82, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Samadi, A. Hormati, M. Mehrara, J. Lee, and S. Mahlke. Adaptive input-aware compilation for graphics engines. In Proc. of the '12 Conference on Programming Language Design and Implementation, pages 13--22, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Sayre. Is automatic "folding" of programs efficient enough to displace manual? Communications of the ACM, 12(12):656--660, Dec. 1969. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Sharma, R. Ponnusamy, B. Moon, H. Yuan-Shin, R. Das, and J. Saltz. Run-time and compile-time support for adaptive irregular problems. pages 97--106, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. L. Torczon and K. Cooper. Engineering A Compiler. Morgan Kaufmann Publishers Inc., 2nd edition, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In Proc. of the '10 Conference on Programming Language Design and Implementation, pages 86--97, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. VAST: the illusion of a large memory space for GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation
        August 2014
        514 pages
        ISBN:9781450328098
        DOI:10.1145/2628071

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PACT '14 Paper Acceptance Rate54of144submissions,38%Overall Acceptance Rate121of471submissions,26%

        Upcoming Conference

        PACT '24
        International Conference on Parallel Architectures and Compilation Techniques
        October 14 - 16, 2024
        Southern California , CA , USA

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader