skip to main content
10.1145/2628071.2628096acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Memory scheduling towards high-throughput cooperative heterogeneous computing

Published: 24 August 2014 Publication History

Abstract

Technology scaling enables the integration of both the CPU and the GPU into a single chip for higher throughput and energy efficiency. In such a single-chip heterogeneous processor (SCHP), its memory bandwidth is the most critically shared resource, requiring judicious management to maximize the throughput. Previous studies on memory scheduling for SCHPs have focused on the scenario where multiple applications are running on the CPU and the GPU respectively, which we denote as a multi-tasking scenario. However, another increasingly important usage scenario for SCHPs is cooperative heterogeneous computing, where a single parallel application is partitioned between the CPU and the GPU such that the overall throughput is maximized.
In previous studies on memory scheduling techniques for chip multi-processors (CMPs) and SCHPs, the first-ready first-come-first-service (FR-FCFS) scheduling policy was used as an inept baseline due to its fairness issue. However, in a cooperative heterogeneous computing scenario, we first demonstrate that FR-FCFS actually offers nearly 10% higher throughput than two recently proposed memory scheduling techniques designed for a multi-tasking scenario. Second, based on our analysis on memory access characteristics in a cooperative heterogeneous computing scenario, we propose various optimization techniques that enhance the row-buffer locality by 10%, reduce the service latency of CPU memory requests by 26%, and improve the overall throughput by up to 8% compared to FR-FCFS.

References

[1]
Onur Mutlu and Thomas Moscibroda, "Parallelism-aware batch scheduling: enhancing both performance and fairness of shared DRAM systems," in ISCA-35, 2008.
[2]
Yoongu Kim, Onur Mutlu, Mor Harchol-Balter, and Dongsu Han, "ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers," in HPCA-16, 2010.
[3]
Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter, "Thread cluster memory scheduling: exploiting differences in memory access behavior," in MICRO-43, 2010.
[4]
Alexander Branover, Denis Foley, and Maurice Steinman, "AMD Fusion APU: Llano," IEEE Micro, vol. 32, no. 2, pp. 28--37, 2012.
[5]
Satish Damaraju et al., "A 22nm IA multi-CPU and GPU System-on-Chip," in ISSCC, 2012.
[6]
Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver, "A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC," in DAC-49, 2012.
[7]
Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu, "Staged memory scheduling: achieving high performance and scalability in heterogeneous systems," in ISCA-39, 2012.
[8]
Michael D. Linderman, Jamison D. Collins, Hong Wang, and Teresa H. Meng, "Merge: a programming model for heterogeneous multi-core systems," in ASPLOS-13, 2008.
[9]
Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim, "Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping," in MICRO-42, 2009.
[10]
John E. Stone, David Gohara, and Guochun Shi, "OpenCL: A parallel programming standard for heterogeneous computing systems," Computing in Science and Engineering, vol. 12, no. 3, pp. 66--73, 2010.
[11]
LuxMark. {Online}. http://www.luxrender.net/wiki/LuxMark
[12]
SiSoftware. {Online}. http://www.sisoftware.net/
[13]
Maxim Shevtsov, "OpenCL: the advantages of heterogeneous approach," March 2013. {Online}. http://software.intel.com/en-us/articles/opencl-the-advantages-of-heterogeneous-approach
[14]
Hans Vandierendonck and Andre Seznec, "Fairness metrics for multi-threaded processors," IEEE Computer Architecture Letters, vol. 10, no. 1, pp. 4--7, 2011.
[15]
Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and Smith E. James, "Fair Queuing Memory Systems," in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39), 2006.
[16]
Rakesh Kumar, Dean M. Tullsen, Norman P. Jouppi, and Parthasarathy Ranganathan, "Heterogeneous chip multiprocessors," IEEE Computer, vol. 38, no. 11, pp. 32--38, 2005.
[17]
Hao Wang, Vijay Sathish, Ripudaman Singh, Michael J. Schulte, and Nam Sung Kim, "Workload and power budget partitioning for single-chip heterogeneous processors," in PACT-21, 2012.
[18]
AMD hUMA introduced: Heterogeneous Unified Memory Access. {Online}. http://www.bit-tech.net/news/hardware/2013/04/30/amd-huma-heterogeneous-unified-memory-acces/
[19]
Writing to a Shared Resource (Intel). {Online}. http://software.intel.com/sites/landingpage/opencl/optimization-guide/Writing_to_a_Shared_Resource.htm
[20]
Dominik Grewe and Michael F. P. O'Boyle, "A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL," Compiler Construction, pp. 286--305, 2011.
[21]
Maxim Shevtsov, Boaz Ouriel, and Ayal Zaks, "OpenCL sample for heterogeneous N-Body simulation," in HiPEAC Computing Systems Week, 2013.
[22]
Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens, "Memory access scheduling," in ISCA-27, 2000.
[23]
Nathan Binkert et al., "The gem5 simulator," ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1--7, 2011.
[24]
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in ISPASS, 2009.
[25]
Samsung, "4Gb B-die DDR3 SDRAM," Rev. 1.2, 2011.
[26]
Shuai Che et al., "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009.

Cited By

View all
  • (2024)Accelerating CNN Training With Concurrent Execution of GPU and Processing-in-MemoryIEEE Access10.1109/ACCESS.2024.348800412(160190-160204)Online publication date: 2024
  • (2022)Machine learning guided thermal management of Open Computing Language applications on CPU‐GPU based embedded platformsIET Computers & Digital Techniques10.1049/cdt2.1205017:1(20-28)Online publication date: 28-Dec-2022
  • (2021)Power and Performance Evaluation of Memory-Intensive ApplicationsEnergies10.3390/en1414408914:14(4089)Online publication date: 6-Jul-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation
August 2014
514 pages
ISBN:9781450328098
DOI:10.1145/2628071
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. heterogeneous processor
  2. memory scheduling

Qualifiers

  • Research-article

Funding Sources

Conference

PACT '14
Sponsor:
  • IFIP WG 10.3
  • SIGARCH
  • IEEE CS TCPP
  • IEEE CS TCAA

Acceptance Rates

PACT '14 Paper Acceptance Rate 54 of 144 submissions, 38%;
Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Accelerating CNN Training With Concurrent Execution of GPU and Processing-in-MemoryIEEE Access10.1109/ACCESS.2024.348800412(160190-160204)Online publication date: 2024
  • (2022)Machine learning guided thermal management of Open Computing Language applications on CPU‐GPU based embedded platformsIET Computers & Digital Techniques10.1049/cdt2.1205017:1(20-28)Online publication date: 28-Dec-2022
  • (2021)Power and Performance Evaluation of Memory-Intensive ApplicationsEnergies10.3390/en1414408914:14(4089)Online publication date: 6-Jul-2021
  • (2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
  • (2020)A memory scheduling strategy for eliminating memory access interference in heterogeneous systemThe Journal of Supercomputing10.1007/s11227-019-03135-7Online publication date: 10-Jan-2020
  • (2019)Collaborative Adaptation for Energy-Efficient Heterogeneous Mobile SoCsIEEE Transactions on Computers10.1109/TC.2019.2943855(1-1)Online publication date: 2019
  • (2019)A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneityJournal of Parallel and Distributed Computing10.1016/j.jpdc.2018.11.012Online publication date: Jan-2019
  • (2017)Reliable mapping and partitioning of performance-constrained openCL applications on CPU-GPU MPSoCsProceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia10.1145/3139315.3157088(78-83)Online publication date: 15-Oct-2017
  • (2017)Energy-Efficient Run-Time Mapping and Thread Partitioning of Concurrent OpenCL Applications on CPU-GPU MPSoCsACM Transactions on Embedded Computing Systems10.1145/312654816:5s(1-22)Online publication date: 27-Sep-2017
  • (2015)Memory Row Reuse Distance and its Role in Optimizing Application PerformanceACM SIGMETRICS Performance Evaluation Review10.1145/2796314.274586743:1(137-149)Online publication date: 15-Jun-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media