skip to main content
10.1145/3148055.3148059acmconferencesArticle/Chapter ViewAbstractPublication PagesbdcatConference Proceedingsconference-collections
research-article

Victream: Computing Framework for Out-of-Core Processing on Multiple GPUs

Published: 05 December 2017 Publication History

Abstract

In data-parallel computing that uses a graphic processing unit (GPU), processing of large data requires that multiple GPUs be used in the computer to increase its execution performance. Increasing processing performance by using multiple computing resources has been enabled by the development of computing frameworks based on a directed acyclic graph (DAG). However, their performance degrades in out-of-core processing, which often occurs in processing of large data on GPUs with limited memory capacity. The GPU data input/output (I/O) for data swapping between host memory and GPU memory during the execution of a user DAG is usually a performance bottleneck. A computing framework called "Victream" is proposed to overcome this drawback. It uses a novel scheduler that involves two methods to minimize the total amount of GPU data I/O of data swapping. First, it performs locality-aware scheduling. When it schedules a task, it selects one that requires the minimum amount of data swapping and reuses as much of the data residing in GPU memory as possible. Second, it extends the locality-aware scheduling so that GPUs can execute data prefetching. Prefetching data that are swapped out from a GPU enables efficient use of bottleneck GPU I/O resources. To prefetch the input data of future tasks, it is required to determine the schedule of future tasks. Victream's scheduler (hereafter, the Victream scheduler) extends the locality-aware scheduling so that it can schedule future tasks to enable data prefetching that is executed in the way that minimizes the amount of data I/O of data swapping. Evaluation of a Victream prototype showed that the performance of Victream is better than that of conventional frameworks by up to 117%.

References

[1]
Martín Abadi et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA.
[2]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplifed data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[3]
Abdullah Gharaibeh et al. 2010. CrystalGPU: Transparent and Efficient Utilization of GPU Power. arXiv preprint arXiv:1005.1695 (2010).
[4]
Max Grossman and Vivek Sarkar. 2016. SWAT: A Programmable, In-Memory, Distributed, High-Performance Computing Platform. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. ACM, 81--92.
[5]
Bingsheng He et al. 2008. Mars: a MapReduce Framework on Graphics Processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, 260--269.
[6]
Michael Isard et al. 2007. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 59--72.
[7]
Shinpei Kato et al. 2012. Gdev: First-Class GPU Resource Management in the Operating System. In USENIX Annual Technical Conference. 401--412.
[8]
NVIDIA. 2016. NVIDIA TESLA P100 GPU ACCELERATOR. http://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf. (2016).
[9]
NVIDIA. 2017. GPU-Accelerated Libraries. https://developer.nvidia.com/gpu-accelerated-libraries. (2017).
[10]
Chris Padwick et al. 2010. WORLDVIEW-2 PAN-SHARPENING. In Proceedings of the ASPRS 2010 Annual Conference, San Diego, CA, USA, Vol. 2630.
[11]
Christopher J Rossbach et al. 2011. PTask: Operating System Abstractions To Manage GPUs as Compute Devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 233--248.
[12]
Christopher J Rossbach et al. 2013. Dandelion: a Compiler and Runtime for Heterogeneous Systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 49--68.
[13]
Jeff A Stuart and John D Owens. 2011. Multi-GPU MapReduce on GPU clusters. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 1068--1079.
[14]
Narayanan Sundaram et al. 2009. A framework for efficient and scalable execution of domain-specific templates on GPUs. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 1--12.
[15]
Kaibo Wang et al. 2014. GDM: Device Memory Management for GPGPU Computing. ACM SIGMETRICS Performance Evaluation Review 42, 1 (2014), 533--545.
[16]
Matei Zaharia et al. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association.

Cited By

View all
  • (2019)A Unified Programming Model for Heterogeneous Computing with CPU and Accelerator Technologies2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)10.1109/CISP-BMEI48845.2019.8965946(1-4)Online publication date: Oct-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies
December 2017
288 pages
ISBN:9781450355490
DOI:10.1145/3148055
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. middleware
  2. operating systems
  3. scheduling

Qualifiers

  • Research-article

Conference

UCC '17
Sponsor:

Acceptance Rates

BDCAT '17 Paper Acceptance Rate 27 of 93 submissions, 29%;
Overall Acceptance Rate 27 of 93 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2019)A Unified Programming Model for Heterogeneous Computing with CPU and Accelerator Technologies2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)10.1109/CISP-BMEI48845.2019.8965946(1-4)Online publication date: Oct-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media