skip to main content
10.1145/3240302.3240306acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article

A load balancing technique for memory channels

Published: 01 October 2018 Publication History

Abstract

The performance needs of memory systems caused by growing volumes of data from emerging applications, such as machine learning and big data analytics, have continued to increase. As a result, HBM has been introduced in GPUs and throughput oriented processors. HBM is a stack of multiple DRAM devices across a number of memory channels. Although HBM provides a large number of channels and high peak bandwidth, we observed that all channels are not evenly utilized and often only one or few channels are highly congested after applying the hashing technique to randomize the translated physical memory address.
To solve this issue, we propose a cost-effective technique to improve load balancing for HBM channels. In the proposed memory system, a memory request from a busy channel can be migrated to other non-busy channels and serviced in the other channels. Moreover, this request migration reduces stalls by memory controllers, because the depth of a memory request queue in a memory controller is effectively increased by the migration. The improved load balancing of memory channels shows a 10.1% increase in performance for GPGPU workloads.

References

[1]
AMD. 2015. Inside pascal: NVIDIA's newest computing platform. https://www.amd.com/en/technologies/hbm. (2015).
[2]
Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Computer Architecture (ISCA), 2012 39th Annual International Symposium on. IEEE, 416--427.
[3]
Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 163--174.
[4]
Robert D Blumofe and Charles E Leiserson. 1999. Scheduling multithreaded computations by work stealing. Journal of the ACM (JACM) 46, 5 (1999), 720--748.
[5]
Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Workload Characterization (IISWC), 2012 IEEE International Symposium on. IEEE, 141--151.
[6]
Alberto Cano. 2018. A survey on graphic processing unit computing for large-scale data mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 1 (2018).
[7]
Niladrish Chatterjee, Mike O'Connor, Gabriel H Loh, Nuwan Jayasena, and Rajeev Balasubramonian. 2014. Managing DRAM latency divergence in irregular GPGPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 128--139.
[8]
Niladrish Chatterjee, Mike OâĂŹConnor, Donghyuk Lee, Daniel R Johnson, Stephen W Keckler, Minsoo Rhu, and William J Dally. 2017. Architecting an energy-efficient dram system for gpus. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on. IEEE, 73--84.
[9]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Ieee, 44--54.
[10]
John Y Chen. 2009. GPU technology trends and future requirements. In Electron Devices Meeting (IEDM), 2009 IEEE International. IEEE, 1--6.
[11]
Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B Brockman, and Norman P Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012. IEEE, 33--38.
[12]
Preeti Gupta, Arun Sharma, and Rajni Jindal. 2016. Scalable machine-learning algorithms for big data analytics: a comprehensive review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 6, 6 (2016), 194--214.
[13]
Mark Harris and David Luebke. 2005. GPGPU: General-purpose computation on graphics hardware. In International Conference on Computer Graphics and Interactive Techniques: ACM SIGGRAPH 2005 Courses: Los Angeles, California, Vol. 2005.
[14]
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K Govindaraju, and Tuyong Wang. 2008. Mars: a MapReduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, 260--269.
[15]
Hiroaki Ikeda and Hidemori Inukai. 1999. High-speed DRAM architecture development. IEEE Journal of Solid-State Circuits 34, 5 (1999), 685--692.
[16]
Intel. 2012. DRAM Controllers for System Designers. https://www.altera.com/solutions/technology/system-design/articles/_2012/dram-controller-system-designer.html. (2012).
[17]
JEDEC. 2013. High Bandwidth Memory (HBM) DRAM. https://www.jedec.org/sites/default/files/docs/JESD235A.pdf. (2013).
[18]
JEDEC. 2016. GRAPHICS DOUBLE DATA RATE (GDDR5) SGRAM STANDARD. https://www.jedec.org/system/files/docs/JESD212C.pdf. (2016).
[19]
Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro 31, 5 (2011), 7--17.
[20]
Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 1--12.
[21]
David Kirk et al. 2007. NVIDIA CUDA software and GPU parallel computing architecture. In ISMM, Vol. 7. 103--104.
[22]
John DC Little and Stephen C Graves. 2008. Little's law. In Building intuition. Springer, 81--100.
[23]
Igor Loi and Luca Benini. 2010. An efficient distributed memory interface for many-core platform with 3D stacked DRAM. In Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, 99--104.
[24]
MICRON. 2014. DDR4 SDRAM. https://www.micron.com/~/media/documents/products/data-sheet/dram/ddr4/4gb_ddr4_dram_2e0d.pdf. (2014).
[25]
Michael Mitzenmacher. 2001. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems 12, 10 (2001), 1094--1104.
[26]
NVIDA. 2016. Inside pascal: NVIDIA's newest computing platform. https://devblogs.nvidia.com/inside-pascal. (2016).
[27]
John D Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E Lefohn, and Timothy J Purcell. 2007. A survey of general-purpose computation on graphics hardware. In Computer graphics forum, Vol. 26. Wiley Online Library, 80--113.
[28]
Subbarao Palacharla, Norman P Jouppi, and James E Smith. 1997. Complexity-effective superscalar processors. Vol. 25. ACM.
[29]
Scott Rixner, William J Dally, Ujval J Kapasi, Peter Mattson, and John D Owens. 2000. Memory access scheduling. In ACM SIGARCH Computer Architecture News, Vol. 28. ACM, 128--138.
[30]
Hemant G Rotithor, Randy B Osborne, and Nagi Aboulenein. 2006. Method and apparatus for out of order memory scheduling. (Oct. 24 2006). US Patent 7,127,574.
[31]
Samsung Semiconductor. 2016. Research collaboration communications. (2016).
[32]
Dilpreet Singh and Chandan K Reddy. 2015. A survey on platforms for big data analytics. Journal of Big Data 2, 1 (2015), 8.
[33]
John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).
[34]
Oklahoma State University. 2017. FreePDK: Unleashing VLSI to the Masses. https://vlsiarch.ecen.okstate.edu/flows/. (2017).
[35]
Gert-Jan van den Braak, Juan Gomez-Luna, José María González-Linares, Henk Corporaal, and Nicolas Guil. 2016. Configurable XOR hash functions for banked scratchpad memories in GPUs. IEEE Trans. Comput. 65, 7 (2016), 2045--2058.
[36]
Hans Vandierendonck and Koenraad De Bosschere. 2005. XOR-based hash functions. IEEE Trans. Comput. 54, 7 (2005), 800--812.
[37]
George L Yuan, Ali Bakhoda, and Tor M Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 34--44.
[38]
William K Zuravleff and Timothy Robinson. 1997. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. (May 13 1997). US Patent 5,630,096.

Cited By

View all
  • (2021)Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)10.1109/ICCAD51958.2021.9643473(1-9)Online publication date: 1-Nov-2021
  • (2021)HTA: A Scalable High-Throughput Accelerator for Irregular HPC WorkloadsHigh Performance Computing10.1007/978-3-030-78713-4_10(176-194)Online publication date: 24-Jun-2021
  • (2020)Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing UnitsJournal of Shanghai Jiaotong University (Science)10.1007/s12204-020-2240-xOnline publication date: 27-Oct-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MEMSYS '18: Proceedings of the International Symposium on Memory Systems
October 2018
361 pages
ISBN:9781450364751
DOI:10.1145/3240302
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DRAM
  2. GPU
  3. HBM
  4. memory controller
  5. work stealing

Qualifiers

  • Research-article

Conference

MEMSYS '18
MEMSYS '18: The International Symposium on Memory Systems
October 1 - 4, 2018
Virginia, Alexandria, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)3
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)10.1109/ICCAD51958.2021.9643473(1-9)Online publication date: 1-Nov-2021
  • (2021)HTA: A Scalable High-Throughput Accelerator for Irregular HPC WorkloadsHigh Performance Computing10.1007/978-3-030-78713-4_10(176-194)Online publication date: 24-Jun-2021
  • (2020)Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing UnitsJournal of Shanghai Jiaotong University (Science)10.1007/s12204-020-2240-xOnline publication date: 27-Oct-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media