research-article

Victream: Computing Framework for Out-of-Core Processing on Multiple GPUs

Authors:

Shinya Miyakawa,

Takashi Takenaka,

Masaru KitsuregawaAuthors Info & Claims

BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

Pages 179 - 188

https://doi.org/10.1145/3148055.3148059

Published: 05 December 2017 Publication History

Abstract

In data-parallel computing that uses a graphic processing unit (GPU), processing of large data requires that multiple GPUs be used in the computer to increase its execution performance. Increasing processing performance by using multiple computing resources has been enabled by the development of computing frameworks based on a directed acyclic graph (DAG). However, their performance degrades in out-of-core processing, which often occurs in processing of large data on GPUs with limited memory capacity. The GPU data input/output (I/O) for data swapping between host memory and GPU memory during the execution of a user DAG is usually a performance bottleneck. A computing framework called "Victream" is proposed to overcome this drawback. It uses a novel scheduler that involves two methods to minimize the total amount of GPU data I/O of data swapping. First, it performs locality-aware scheduling. When it schedules a task, it selects one that requires the minimum amount of data swapping and reuses as much of the data residing in GPU memory as possible. Second, it extends the locality-aware scheduling so that GPUs can execute data prefetching. Prefetching data that are swapped out from a GPU enables efficient use of bottleneck GPU I/O resources. To prefetch the input data of future tasks, it is required to determine the schedule of future tasks. Victream's scheduler (hereafter, the Victream scheduler) extends the locality-aware scheduling so that it can schedule future tasks to enable data prefetching that is executed in the way that minimizes the amount of data I/O of data swapping. Evaluation of a Victream prototype showed that the performance of Victream is better than that of conventional frameworks by up to 117%.

References

[1]

Martín Abadi et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA.

Digital Library

[2]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplifed data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.

Digital Library

[3]

Abdullah Gharaibeh et al. 2010. CrystalGPU: Transparent and Efficient Utilization of GPU Power. arXiv preprint arXiv:1005.1695 (2010).

[4]

Max Grossman and Vivek Sarkar. 2016. SWAT: A Programmable, In-Memory, Distributed, High-Performance Computing Platform. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. ACM, 81--92.

Digital Library

[5]

Bingsheng He et al. 2008. Mars: a MapReduce Framework on Graphics Processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, 260--269.

Digital Library

[6]

Michael Isard et al. 2007. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 59--72.

Digital Library

[7]

Shinpei Kato et al. 2012. Gdev: First-Class GPU Resource Management in the Operating System. In USENIX Annual Technical Conference. 401--412.

Digital Library

[8]

NVIDIA. 2016. NVIDIA TESLA P100 GPU ACCELERATOR. http://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf. (2016).

[9]

NVIDIA. 2017. GPU-Accelerated Libraries. https://developer.nvidia.com/gpu-accelerated-libraries. (2017).

[10]

Chris Padwick et al. 2010. WORLDVIEW-2 PAN-SHARPENING. In Proceedings of the ASPRS 2010 Annual Conference, San Diego, CA, USA, Vol. 2630.

[11]

Christopher J Rossbach et al. 2011. PTask: Operating System Abstractions To Manage GPUs as Compute Devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 233--248.

Digital Library

[12]

Christopher J Rossbach et al. 2013. Dandelion: a Compiler and Runtime for Heterogeneous Systems. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 49--68.

Digital Library

[13]

Jeff A Stuart and John D Owens. 2011. Multi-GPU MapReduce on GPU clusters. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 1068--1079.

Digital Library

[14]

Narayanan Sundaram et al. 2009. A framework for efficient and scalable execution of domain-specific templates on GPUs. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 1--12.

Digital Library

[15]

Kaibo Wang et al. 2014. GDM: Device Memory Management for GPGPU Computing. ACM SIGMETRICS Performance Evaluation Review 42, 1 (2014), 533--545.

Digital Library

[16]

Matei Zaharia et al. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association.

Digital Library

Cited By

Xiong Y(2019)A Unified Programming Model for Heterogeneous Computing with CPU and Accelerator Technologies2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)10.1109/CISP-BMEI48845.2019.8965946(1-4)Online publication date: Oct-2019
https://doi.org/10.1109/CISP-BMEI48845.2019.8965946

Index Terms

Victream: Computing Framework for Out-of-Core Processing on Multiple GPUs
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling
      2. Software infrastructure
        Middleware

Recommendations

Leveraging Core Specialization via OS Scheduling to Improve Performance on Asymmetric Multicore Systems

Asymmetric multicore processors (AMPs) consist of cores with the same ISA (instruction-set architecture), but different microarchitectural features, speed, and power consumption. Because cores with more complex features and higher speed typically use ...
A comprehensive scheduler for asymmetric multicore systems
EuroSys '10: Proceedings of the 5th European conference on Computer systems

Symmetric-ISA (instruction set architecture) asymmetric-performance multicore processors were shown to deliver higher performance per watt and area for applications with diverse architectural requirements, and so it is likely that future multicore ...
ACFS: a completely fair scheduler for asymmetric single-isa multicore systems
SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing

Single-ISA (instruction set architecture) asymmetric multicore processors (AMPs) were shown to deliver higher performance per watt and area than symmetric CMPs (Chip Multi-Processors) for applications with diverse architectural requirements. A large ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

December 2017

288 pages

ISBN:9781450355490

DOI:10.1145/3148055

General Chairs:
Ashiq Anjum
University of Derby, UK
,
Alan Sill
Texas Tech University, USA
,
Program Chairs:
Xinghui Zhao
Washington State University Vancouver, USA
,
Mohsen Farid
University of Derby, UK
,
Shrideep Pallickara
Colorado State University, USA
,
Jiannong Cao
The HongKong Polytechnic University, Hong Kong

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE TCSC: IEEE Technical Committee on Scalable Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

UCC '17

Sponsor:

SIGARCH
IEEE TCSC

UCC '17: 10th International Conference on Utility and Cloud Computing

December 5 - 8, 2017

Texas, Austin, USA

Acceptance Rates

BDCAT '17 Paper Acceptance Rate 27 of 93 submissions, 29%;

Overall Acceptance Rate 27 of 93 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
140
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xiong Y(2019)A Unified Programming Model for Heterogeneous Computing with CPU and Accelerator Technologies2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)10.1109/CISP-BMEI48845.2019.8965946(1-4)Online publication date: Oct-2019
https://doi.org/10.1109/CISP-BMEI48845.2019.8965946

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten