skip to main content
10.1145/2628071.2628075acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

VAST: the illusion of a large memory space for GPUs

Published: 24 August 2014 Publication History

Abstract

Heterogeneous systems equipped with traditional processors (CPUs) and graphics processing units (GPUs) have enabled processing large data sets. With new programming models, such as OpenCL and CUDA, programmers are encouraged to offload data parallel workloads to GPUs as much as possible in order to fully utilize the available resources. Unfortunately, offloading work is strictly limited by the size of the physical memory on a specific GPU. In this paper, we present Virtual Address Space for Throughput processors (VAST), an automatic GPU memory management system that provides an OpenCL program with the illusion of a virtual memory space. Based on the available physical memory on the target GPU, VAST does the following: automatically partitions the data parallel workload into chunks; efficiently extracts the precise working set required for the divided workload; rearranges the working set in contiguous memory space; and, transforms the kernel to operate on the reorganized working set. With VAST, the programmer is responsible for developing a data parallel kernel in OpenCL without concern for physical memory space limitations of individual GPUs. VAST transparently handles code generation dealing with the constraints of the actual physical memory and improves the retargetability of the OpenCL with moderate overhead. Experiments demonstrate that a real GPU, NVIDIA GTX 760 with 2 GB of memory, can compute any size of data without program changes achieving 2.6x speedup over CPU exeuction, which is a realistic alternative for large data computation.

References

[1]
Clang: A C language family frontend for LLVM. http://clang.llvm.org.
[2]
Intel(R) SDK for opencl applications 2013. http://software.intel.com/en-us/vcsource/tools/opencl-sdk.
[3]
NVIDIA GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.
[4]
Intel(R) Core i7-3700 Desktop Processor Series, 2012. http://download.intel.com/support/processors/corei7/sb/core_i7-3700_d.pdf.
[5]
CUDA C Programming Guide, 2014. http://docs.nvidia.com/cuda.
[6]
OpenACC, directives for accelerators, 2014. http://www.openacc.org.
[7]
OpenCL - the open standard for parallel programming of heterogeneous systems, 2014.
[8]
OpenHMPP, Hybrid Multicore Parallel Programming, 2014. http://www.openhmpp.org.
[9]
A. Basumallik and R. Eigenmann. Optimizing irregular shared-memory applications for distributed-memory systems. In Proc. of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 119--128, 2006.
[10]
A. Bensoussan, C. T. Clingen, and R. C. Daley. The Multics virtual memory: concepts and design. Communications of the ACM, 15(5):308--318, May 1972.
[11]
J. Canny and H. Zhao. Big Data Analytics with Small Footprint: Squaring the Cloud. In Proc. of the 19th International Conference on Knowledge Discovery and Data Mining, pages 95--103, 2013.
[12]
A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 381--392, 2011.
[13]
T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynamically managed data for CPU-GPU architectures. In Proc. of the 2012 International Symposium on Code Generation and Optimization, pages 165--174, 2012.
[14]
T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. Automatic CPU-GPU communication management and optimization. In Proc. of the '11 Conference on Programming Language Design and Implementation, pages 142--151, 2011.
[15]
F. Ji, H. Lin, and X. Ma. RSVM: A region-based software virtual memory for GPU. In Proc. of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pages 269--278, 2013.
[16]
J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proc. of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 277--288, 2011.
[17]
J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In Proc. of the 2012 International Conference on Supercomputing, pages 341--352, 2012.
[18]
K. Kofler, I. Grasso, B. Cosenza, and T. Fahringer. An Automatic Input-sensitive Approach for Heterogeneous Task Partitioning. In Proc. of the 2013 International Conference on Supercomputing, pages 149--160, 2013.
[19]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75--86, 2004.
[20]
J. Lee, M. Samadi, Y. Park, and S. Mahlke. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proc. of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pages 245--256, 2013.
[21]
C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proc. of the 42nd Annual International Symposium on Microarchitecture, pages 45--55, 2009.
[22]
S.-J. Min and R. Eigenmann. Optimizing irregular shared-memory applications for clusters. In Proc. of the 2008 International Conference on Supercomputing, pages 256--265, 2008.
[23]
B. Pichai, L. Hsu, and A. Bhattacharjee. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In 19th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 743--758, 2014.
[24]
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 73--82, 2008.
[25]
M. Samadi, A. Hormati, M. Mehrara, J. Lee, and S. Mahlke. Adaptive input-aware compilation for graphics engines. In Proc. of the '12 Conference on Programming Language Design and Implementation, pages 13--22, 2012.
[26]
D. Sayre. Is automatic "folding" of programs efficient enough to displace manual? Communications of the ACM, 12(12):656--660, Dec. 1969.
[27]
S. Sharma, R. Ponnusamy, B. Moon, H. Yuan-Shin, R. Das, and J. Saltz. Run-time and compile-time support for adaptive irregular problems. pages 97--106, 1994.
[28]
L. Torczon and K. Cooper. Engineering A Compiler. Morgan Kaufmann Publishers Inc., 2nd edition, 2011.
[29]
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In Proc. of the '10 Conference on Programming Language Design and Implementation, pages 86--97, 2010.

Cited By

View all
  • (2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
  • (2024)RT-Swap: Addressing GPU Memory Bottlenecks for Real-Time Multi-DNN Inference2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS)10.1109/RTAS61025.2024.00037(373-385)Online publication date: 13-May-2024
  • (2024)GrOUT: Transparent Scale-Out to Overcome UVM's Oversubscription Slowdowns2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00132(696-705)Online publication date: 27-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation
August 2014
514 pages
ISBN:9781450328098
DOI:10.1145/2628071
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gpu
  2. optimization
  3. virtual memory

Qualifiers

  • Research-article

Conference

PACT '14
Sponsor:
  • IFIP WG 10.3
  • SIGARCH
  • IEEE CS TCPP
  • IEEE CS TCAA

Acceptance Rates

PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;
Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
  • (2024)RT-Swap: Addressing GPU Memory Bottlenecks for Real-Time Multi-DNN Inference2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS)10.1109/RTAS61025.2024.00037(373-385)Online publication date: 13-May-2024
  • (2024)GrOUT: Transparent Scale-Out to Overcome UVM's Oversubscription Slowdowns2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00132(696-705)Online publication date: 27-May-2024
  • (2023)An Intelligent Framework for Oversubscription Management in CPU-GPU Unified MemoryJournal of Grid Computing10.1007/s10723-023-09646-121:1Online publication date: 14-Feb-2023
  • (2021)The art of balanceProceedings of the VLDB Endowment10.14778/3476311.347637814:12(2999-3013)Online publication date: 28-Oct-2021
  • (2021)TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware SchedulingACM Transactions on Architecture and Code Optimization10.1145/349121819:1(1-23)Online publication date: 6-Dec-2021
  • (2021)IRIS: A Portable Runtime System Exploiting Multiple Heterogeneous Programming Systems2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622873(1-8)Online publication date: 20-Sep-2021
  • (2020)Automated Partitioning of Data-Parallel Kernels using Polyhedral CompilationWorkshop Proceedings of the 49th International Conference on Parallel Processing10.1145/3409390.3409403(1-10)Online publication date: 17-Aug-2020
  • (2020)Batch-Aware Unified Memory Management in GPUs for Irregular WorkloadsProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378529(1357-1370)Online publication date: 9-Mar-2020
  • (2020)GPU swap-aware schedulerProceedings of the 35th Annual ACM Symposium on Applied Computing10.1145/3341105.3373866(1222-1227)Online publication date: 30-Mar-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media