research-article

VAST: the illusion of a large memory space for GPUs

Authors:

Mehrzad Samadi,

Scott MahlkeAuthors Info & Claims

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

Pages 443 - 454

https://doi.org/10.1145/2628071.2628075

Published: 24 August 2014 Publication History

Abstract

Heterogeneous systems equipped with traditional processors (CPUs) and graphics processing units (GPUs) have enabled processing large data sets. With new programming models, such as OpenCL and CUDA, programmers are encouraged to offload data parallel workloads to GPUs as much as possible in order to fully utilize the available resources. Unfortunately, offloading work is strictly limited by the size of the physical memory on a specific GPU. In this paper, we present Virtual Address Space for Throughput processors (VAST), an automatic GPU memory management system that provides an OpenCL program with the illusion of a virtual memory space. Based on the available physical memory on the target GPU, VAST does the following: automatically partitions the data parallel workload into chunks; efficiently extracts the precise working set required for the divided workload; rearranges the working set in contiguous memory space; and, transforms the kernel to operate on the reorganized working set. With VAST, the programmer is responsible for developing a data parallel kernel in OpenCL without concern for physical memory space limitations of individual GPUs. VAST transparently handles code generation dealing with the constraints of the actual physical memory and improves the retargetability of the OpenCL with moderate overhead. Experiments demonstrate that a real GPU, NVIDIA GTX 760 with 2 GB of memory, can compute any size of data without program changes achieving 2.6x speedup over CPU exeuction, which is a realistic alternative for large data computation.

References

[1]

Clang: A C language family frontend for LLVM. http://clang.llvm.org.

[2]

Intel(R) SDK for opencl applications 2013. http://software.intel.com/en-us/vcsource/tools/opencl-sdk.

[3]

NVIDIA GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.

[4]

Intel(R) Core i7-3700 Desktop Processor Series, 2012. http://download.intel.com/support/processors/corei7/sb/core_i7-3700_d.pdf.

[5]

CUDA C Programming Guide, 2014. http://docs.nvidia.com/cuda.

[6]

OpenACC, directives for accelerators, 2014. http://www.openacc.org.

[7]

OpenCL - the open standard for parallel programming of heterogeneous systems, 2014.

[8]

OpenHMPP, Hybrid Multicore Parallel Programming, 2014. http://www.openhmpp.org.

[9]

A. Basumallik and R. Eigenmann. Optimizing irregular shared-memory applications for distributed-memory systems. In Proc. of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 119--128, 2006.

Digital Library

[10]

A. Bensoussan, C. T. Clingen, and R. C. Daley. The Multics virtual memory: concepts and design. Communications of the ACM, 15(5):308--318, May 1972.

Digital Library

[11]

J. Canny and H. Zhao. Big Data Analytics with Small Footprint: Squaring the Cloud. In Proc. of the 19th International Conference on Knowledge Discovery and Data Mining, pages 95--103, 2013.

Digital Library

[12]

A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 381--392, 2011.

Digital Library

[13]

T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynamically managed data for CPU-GPU architectures. In Proc. of the 2012 International Symposium on Code Generation and Optimization, pages 165--174, 2012.

Digital Library

[14]

T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. Automatic CPU-GPU communication management and optimization. In Proc. of the '11 Conference on Programming Language Design and Implementation, pages 142--151, 2011.

Digital Library

[15]

F. Ji, H. Lin, and X. Ma. RSVM: A region-based software virtual memory for GPU. In Proc. of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pages 269--278, 2013.

Digital Library

[16]

J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in OpenCL for multiple GPUs. In Proc. of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 277--288, 2011.

Digital Library

[17]

J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In Proc. of the 2012 International Conference on Supercomputing, pages 341--352, 2012.

Digital Library

[18]

K. Kofler, I. Grasso, B. Cosenza, and T. Fahringer. An Automatic Input-sensitive Approach for Heterogeneous Task Partitioning. In Proc. of the 2013 International Conference on Supercomputing, pages 149--160, 2013.

Digital Library

[19]

C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75--86, 2004.

Digital Library

[20]

J. Lee, M. Samadi, Y. Park, and S. Mahlke. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proc. of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pages 245--256, 2013.

Digital Library

[21]

C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proc. of the 42nd Annual International Symposium on Microarchitecture, pages 45--55, 2009.

Digital Library

[22]

S.-J. Min and R. Eigenmann. Optimizing irregular shared-memory applications for clusters. In Proc. of the 2008 International Conference on Supercomputing, pages 256--265, 2008.

Digital Library

[23]

B. Pichai, L. Hsu, and A. Bhattacharjee. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In 19th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 743--758, 2014.

Digital Library

[24]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 73--82, 2008.

Digital Library

[25]

M. Samadi, A. Hormati, M. Mehrara, J. Lee, and S. Mahlke. Adaptive input-aware compilation for graphics engines. In Proc. of the '12 Conference on Programming Language Design and Implementation, pages 13--22, 2012.

Digital Library

[26]

D. Sayre. Is automatic "folding" of programs efficient enough to displace manual? Communications of the ACM, 12(12):656--660, Dec. 1969.

Digital Library

[27]

S. Sharma, R. Ponnusamy, B. Moon, H. Yuan-Shin, R. Das, and J. Saltz. Run-time and compile-time support for adaptive irregular problems. pages 97--106, 1994.

Digital Library

[28]

L. Torczon and K. Cooper. Engineering A Compiler. Morgan Kaufmann Publishers Inc., 2nd edition, 2011.

Digital Library

[29]

Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In Proc. of the '10 Conference on Programming Language Design and Implementation, pages 86--97, 2010.

Digital Library

Cited By

Kim JLee SJohnston BVetter J(2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3429010
Kang WLee JLee YOh SLee KChwa H(2024)RT-Swap: Addressing GPU Memory Bottlenecks for Real-Time Multi-DNN Inference2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS)10.1109/RTAS61025.2024.00037(373-385)Online publication date: 13-May-2024
https://doi.org/10.1109/RTAS61025.2024.00037
Dio Lavore IMaffi DArnaboldi MDelamare ABonetta DSantambrogio M(2024)GrOUT: Transparent Scale-Out to Overcome UVM's Oversubscription Slowdowns2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00132(696-705)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00132
Show More Cited By

Index Terms

VAST: the illusion of a large memory space for GPUs
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Source code generation

Recommendations

Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing Systems

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...
Optimization and Implementation of LBM Benchmark on Multithreaded GPU
DSDE '10: Proceedings of the 2010 International Conference on Data Storage and Data Engineering

With fast development of transistor technology, Graphic Processing Unit(GPU) is increasingly used in the non-graphics applications, and major GPU hardware vendors have introduced software stacks for their own GPUs, such as Brook+ for AMD GPU. Compared ...
On the Programmability and Performance of Heterogeneous Platforms
ICPADS '13: Proceedings of the 2013 International Conference on Parallel and Distributed Systems

General-purpose computing on an ever-broadening array of parallel devices has led to an increasingly complex and multi-dimensional landscape with respect to programmability and performance optimization. The growing diversity of parallel architectures ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

August 2014

514 pages

ISBN:9781450328098

DOI:10.1145/2628071

General Chair:
J. Nelson Amaral
University of Alberta, Canada
,
Program Chair:
Josep Torrellas
University of Illinois, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing
IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '14

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE CS TCPP
IEEE CS TCAA

PACT '14: International Conference on Parallel Architectures and Compilation

August 24 - 27, 2014

AB, Edmonton, Canada

Acceptance Rates

PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
311
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kim JLee SJohnston BVetter J(2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3429010
Kang WLee JLee YOh SLee KChwa H(2024)RT-Swap: Addressing GPU Memory Bottlenecks for Real-Time Multi-DNN Inference2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS)10.1109/RTAS61025.2024.00037(373-385)Online publication date: 13-May-2024
https://doi.org/10.1109/RTAS61025.2024.00037
Dio Lavore IMaffi DArnaboldi MDelamare ABonetta DSantambrogio M(2024)GrOUT: Transparent Scale-Out to Overcome UVM's Oversubscription Slowdowns2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00132(696-705)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00132
Long XGong XZhang BZhou H(2023)An Intelligent Framework for Oversubscription Management in CPU-GPU Unified MemoryJournal of Grid Computing10.1007/s10723-023-09646-121:1Online publication date: 14-Feb-2023
https://dl.acm.org/doi/10.1007/s10723-023-09646-1
Lee RZhou MLi CHu STeng JLi DZhang X(2021)The art of balanceProceedings of the VLDB Endowment10.14778/3476311.347637814:12(2999-3013)Online publication date: 28-Oct-2021
https://dl.acm.org/doi/10.14778/3476311.3476378
Di BHu DXie ZSun JChen HRen JLi D(2021)TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware SchedulingACM Transactions on Architecture and Code Optimization10.1145/349121819:1(1-23)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.1145/3491218
Kim JLee SJohnston BVetter J(2021)IRIS: A Portable Runtime System Exploiting Multiple Heterogeneous Programming Systems2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622873(1-8)Online publication date: 20-Sep-2021
https://doi.org/10.1109/HPEC49654.2021.9622873
Matz ADoerfert JFröning H(2020)Automated Partitioning of Data-Parallel Kernels using Polyhedral CompilationWorkshop Proceedings of the 49th International Conference on Parallel Processing10.1145/3409390.3409403(1-10)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3409390.3409403
Kim HSim JGera PHadidi RKim HLarus JCeze LStrauss K(2020)Batch-Aware Unified Memory Management in GPUs for Irregular WorkloadsProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378529(1357-1370)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378529
Yang SQiu ZChen YHung CCerny TShin DBechini A(2020)GPU swap-aware schedulerProceedings of the 35th Annual ACM Symposium on Applied Computing10.1145/3341105.3373866(1222-1227)Online publication date: 30-Mar-2020
https://dl.acm.org/doi/10.1145/3341105.3373866
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents