research-article

Memory scheduling towards high-throughput cooperative heterogeneous computing

Authors:

Ripudaman Singh,

Michael J. Schulte,

Nam Sung KimAuthors Info & Claims

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

Pages 331 - 342

https://doi.org/10.1145/2628071.2628096

Published: 24 August 2014 Publication History

Abstract

Technology scaling enables the integration of both the CPU and the GPU into a single chip for higher throughput and energy efficiency. In such a single-chip heterogeneous processor (SCHP), its memory bandwidth is the most critically shared resource, requiring judicious management to maximize the throughput. Previous studies on memory scheduling for SCHPs have focused on the scenario where multiple applications are running on the CPU and the GPU respectively, which we denote as a multi-tasking scenario. However, another increasingly important usage scenario for SCHPs is cooperative heterogeneous computing, where a single parallel application is partitioned between the CPU and the GPU such that the overall throughput is maximized.

In previous studies on memory scheduling techniques for chip multi-processors (CMPs) and SCHPs, the first-ready first-come-first-service (FR-FCFS) scheduling policy was used as an inept baseline due to its fairness issue. However, in a cooperative heterogeneous computing scenario, we first demonstrate that FR-FCFS actually offers nearly 10% higher throughput than two recently proposed memory scheduling techniques designed for a multi-tasking scenario. Second, based on our analysis on memory access characteristics in a cooperative heterogeneous computing scenario, we propose various optimization techniques that enhance the row-buffer locality by 10%, reduce the service latency of CPU memory requests by 26%, and improve the overall throughput by up to 8% compared to FR-FCFS.

References

[1]

Onur Mutlu and Thomas Moscibroda, "Parallelism-aware batch scheduling: enhancing both performance and fairness of shared DRAM systems," in ISCA-35, 2008.

Digital Library

[2]

Yoongu Kim, Onur Mutlu, Mor Harchol-Balter, and Dongsu Han, "ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers," in HPCA-16, 2010.

[3]

Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter, "Thread cluster memory scheduling: exploiting differences in memory access behavior," in MICRO-43, 2010.

[4]

Alexander Branover, Denis Foley, and Maurice Steinman, "AMD Fusion APU: Llano," IEEE Micro, vol. 32, no. 2, pp. 28--37, 2012.

Digital Library

[5]

Satish Damaraju et al., "A 22nm IA multi-CPU and GPU System-on-Chip," in ISSCC, 2012.

[6]

Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver, "A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC," in DAC-49, 2012.

[7]

Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu, "Staged memory scheduling: achieving high performance and scalability in heterogeneous systems," in ISCA-39, 2012.

Digital Library

[8]

Michael D. Linderman, Jamison D. Collins, Hong Wang, and Teresa H. Meng, "Merge: a programming model for heterogeneous multi-core systems," in ASPLOS-13, 2008.

Digital Library

[9]

Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim, "Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping," in MICRO-42, 2009.

[10]

John E. Stone, David Gohara, and Guochun Shi, "OpenCL: A parallel programming standard for heterogeneous computing systems," Computing in Science and Engineering, vol. 12, no. 3, pp. 66--73, 2010.

Digital Library

[11]

LuxMark. {Online}. http://www.luxrender.net/wiki/LuxMark

[12]

SiSoftware. {Online}. http://www.sisoftware.net/

[13]

Maxim Shevtsov, "OpenCL: the advantages of heterogeneous approach," March 2013. {Online}. http://software.intel.com/en-us/articles/opencl-the-advantages-of-heterogeneous-approach

[14]

Hans Vandierendonck and Andre Seznec, "Fairness metrics for multi-threaded processors," IEEE Computer Architecture Letters, vol. 10, no. 1, pp. 4--7, 2011.

Digital Library

[15]

Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and Smith E. James, "Fair Queuing Memory Systems," in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39), 2006.

Digital Library

[16]

Rakesh Kumar, Dean M. Tullsen, Norman P. Jouppi, and Parthasarathy Ranganathan, "Heterogeneous chip multiprocessors," IEEE Computer, vol. 38, no. 11, pp. 32--38, 2005.

Digital Library

[17]

Hao Wang, Vijay Sathish, Ripudaman Singh, Michael J. Schulte, and Nam Sung Kim, "Workload and power budget partitioning for single-chip heterogeneous processors," in PACT-21, 2012.

Digital Library

[18]

AMD hUMA introduced: Heterogeneous Unified Memory Access. {Online}. http://www.bit-tech.net/news/hardware/2013/04/30/amd-huma-heterogeneous-unified-memory-acces/

[19]

Writing to a Shared Resource (Intel). {Online}. http://software.intel.com/sites/landingpage/opencl/optimization-guide/Writing_to_a_Shared_Resource.htm

[20]

Dominik Grewe and Michael F. P. O'Boyle, "A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL," Compiler Construction, pp. 286--305, 2011.

Digital Library

[21]

Maxim Shevtsov, Boaz Ouriel, and Ayal Zaks, "OpenCL sample for heterogeneous N-Body simulation," in HiPEAC Computing Systems Week, 2013.

[22]

Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens, "Memory access scheduling," in ISCA-27, 2000.

Digital Library

[23]

Nathan Binkert et al., "The gem5 simulator," ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1--7, 2011.

Digital Library

[24]

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in ISPASS, 2009.

[25]

Samsung, "4Gb B-die DDR3 SDRAM," Rev. 1.2, 2011.

[26]

Shuai Che et al., "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009.

Digital Library

Cited By

Choi JLee HSohn KYu HRhee C(2024)Accelerating CNN Training With Concurrent Execution of GPU and Processing-in-MemoryIEEE Access10.1109/ACCESS.2024.348800412(160190-160204)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3488004
Kumar RGhoshal B(2022)Machine learning guided thermal management of Open Computing Language applications on CPU‐GPU based embedded platformsIET Computers & Digital Techniques10.1049/cdt2.1205017:1(20-28)Online publication date: 28-Dec-2022
https://doi.org/10.1049/cdt2.12050
Zhang KOu DJiang CQiu YYan L(2021)Power and Performance Evaluation of Memory-Intensive ApplicationsEnergies10.3390/en1414408914:14(4089)Online publication date: 6-Jul-2021
https://doi.org/10.3390/en14144089
Show More Cited By

Index Terms

Memory scheduling towards high-throughput cooperative heterogeneous computing
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

A memory scheduling strategy for eliminating memory access interference in heterogeneous system
Abstract
Multiple CPUs and GPUs are integrated on the same chip to share memory, and access requests between cores are interfering with each other. Memory requests from the GPU seriously interfere with the CPU memory access performance. Requests between ...
Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory

Memory intensive workloads become increasingly popular on general purpose graphics processing units (GPGPUs), and impose great challenges on the GPGPU memory subsystem design. On the other hand, with the recent development of non-volatile memory (NVM) ...
Refresh pausing in DRAM memory systems

Dynamic Random Access Memory (DRAM) cells rely on periodic refresh operations to maintain data integrity. As the capacity of DRAM memories has increased, so has the amount of time consumed in doing refresh. Refresh operations contend with read ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

August 2014

514 pages

ISBN:9781450328098

DOI:10.1145/2628071

General Chair:
J. Nelson Amaral
University of Alberta, Canada
,
Program Chair:
Josep Torrellas
University of Illinois, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing
IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

PACT '14

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE CS TCPP
IEEE CS TCAA

PACT '14: International Conference on Parallel Architectures and Compilation

August 24 - 27, 2014

AB, Edmonton, Canada

Acceptance Rates

PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
622
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Choi JLee HSohn KYu HRhee C(2024)Accelerating CNN Training With Concurrent Execution of GPU and Processing-in-MemoryIEEE Access10.1109/ACCESS.2024.348800412(160190-160204)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3488004
Kumar RGhoshal B(2022)Machine learning guided thermal management of Open Computing Language applications on CPU‐GPU based embedded platformsIET Computers & Digital Techniques10.1049/cdt2.1205017:1(20-28)Online publication date: 28-Dec-2022
https://doi.org/10.1049/cdt2.12050
Zhang KOu DJiang CQiu YYan L(2021)Power and Performance Evaluation of Memory-Intensive ApplicationsEnergies10.3390/en1414408914:14(4089)Online publication date: 6-Jul-2021
https://doi.org/10.3390/en14144089
Oliveira GGomez-Luna JOrosa LGhose SVijaykumar NFernandez ISadrosadati MMutlu O(2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3110993
Fang JWang MWei Z(2020)A memory scheduling strategy for eliminating memory access interference in heterogeneous systemThe Journal of Supercomputing10.1007/s11227-019-03135-7Online publication date: 10-Jan-2020
https://doi.org/10.1007/s11227-019-03135-7
Singh ABasireddy KPrakash AMerrett GAl-Hashimi B(2019)Collaborative Adaptation for Energy-Efficient Heterogeneous Mobile SoCsIEEE Transactions on Computers10.1109/TC.2019.2943855(1-1)Online publication date: 2019
https://doi.org/10.1109/TC.2019.2943855
Khairy MWassal AZahran M(2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
https://doi.org/10.1016/j.jpdc.2018.11.012
Wachter EMerrett GAl-Hashimi BSingh A(2017)Reliable mapping and partitioning of performance-constrained openCL applications on CPU-GPU MPSoCsProceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia10.1145/3139315.3157088(78-83)Online publication date: 15-Oct-2017
https://dl.acm.org/doi/10.1145/3139315.3157088
Singh APrakash ABasireddy KMerrett GAl-Hashimi B(2017)Energy-Efficient Run-Time Mapping and Thread Partitioning of Concurrent OpenCL Applications on CPU-GPU MPSoCsACM Transactions on Embedded Computing Systems10.1145/312654816:5s(1-22)Online publication date: 27-Sep-2017
https://dl.acm.org/doi/10.1145/3126548
Kandemir MZhao HTang XKarakoy M(2015)Memory Row Reuse Distance and its Role in Optimizing Application PerformanceACM SIGMETRICS Performance Evaluation Review10.1145/2796314.274586743:1(137-149)Online publication date: 15-Jun-2015
https://dl.acm.org/doi/10.1145/2796314.2745867
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten