research-article

A load balancing technique for memory channels

Authors:

Ronald G. Dreslinski,

Trevor MudgeAuthors Info & Claims

MEMSYS '18: Proceedings of the International Symposium on Memory Systems

Pages 55 - 66

https://doi.org/10.1145/3240302.3240306

Published: 01 October 2018 Publication History

Abstract

The performance needs of memory systems caused by growing volumes of data from emerging applications, such as machine learning and big data analytics, have continued to increase. As a result, HBM has been introduced in GPUs and throughput oriented processors. HBM is a stack of multiple DRAM devices across a number of memory channels. Although HBM provides a large number of channels and high peak bandwidth, we observed that all channels are not evenly utilized and often only one or few channels are highly congested after applying the hashing technique to randomize the translated physical memory address.

To solve this issue, we propose a cost-effective technique to improve load balancing for HBM channels. In the proposed memory system, a memory request from a busy channel can be migrated to other non-busy channels and serviced in the other channels. Moreover, this request migration reduces stalls by memory controllers, because the depth of a memory request queue in a memory controller is effectively increased by the migration. The improved load balancing of memory channels shows a 10.1% increase in performance for GPGPU workloads.

References

[1]

AMD. 2015. Inside pascal: NVIDIA's newest computing platform. https://www.amd.com/en/technologies/hbm. (2015).

[2]

Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Computer Architecture (ISCA), 2012 39th Annual International Symposium on. IEEE, 416--427.

Digital Library

[3]

Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 163--174.

[4]

Robert D Blumofe and Charles E Leiserson. 1999. Scheduling multithreaded computations by work stealing. Journal of the ACM (JACM) 46, 5 (1999), 720--748.

Digital Library

[5]

Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Workload Characterization (IISWC), 2012 IEEE International Symposium on. IEEE, 141--151.

Digital Library

[6]

Alberto Cano. 2018. A survey on graphic processing unit computing for large-scale data mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 1 (2018).

[7]

Niladrish Chatterjee, Mike O'Connor, Gabriel H Loh, Nuwan Jayasena, and Rajeev Balasubramonian. 2014. Managing DRAM latency divergence in irregular GPGPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 128--139.

Digital Library

[8]

Niladrish Chatterjee, Mike OâĂ&Zacute;Connor, Donghyuk Lee, Daniel R Johnson, Stephen W Keckler, Minsoo Rhu, and William J Dally. 2017. Architecting an energy-efficient dram system for gpus. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on. IEEE, 73--84.

[9]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. Ieee, 44--54.

Digital Library

[10]

John Y Chen. 2009. GPU technology trends and future requirements. In Electron Devices Meeting (IEDM), 2009 IEEE International. IEEE, 1--6.

[11]

Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B Brockman, and Norman P Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012. IEEE, 33--38.

Digital Library

[12]

Preeti Gupta, Arun Sharma, and Rajni Jindal. 2016. Scalable machine-learning algorithms for big data analytics: a comprehensive review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 6, 6 (2016), 194--214.

Digital Library

[13]

Mark Harris and David Luebke. 2005. GPGPU: General-purpose computation on graphics hardware. In International Conference on Computer Graphics and Interactive Techniques: ACM SIGGRAPH 2005 Courses: Los Angeles, California, Vol. 2005.

Digital Library

[14]

Bingsheng He, Wenbin Fang, Qiong Luo, Naga K Govindaraju, and Tuyong Wang. 2008. Mars: a MapReduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, 260--269.

Digital Library

[15]

Hiroaki Ikeda and Hidemori Inukai. 1999. High-speed DRAM architecture development. IEEE Journal of Solid-State Circuits 34, 5 (1999), 685--692.

[16]

Intel. 2012. DRAM Controllers for System Designers. https://www.altera.com/solutions/technology/system-design/articles/_2012/dram-controller-system-designer.html. (2012).

[17]

JEDEC. 2013. High Bandwidth Memory (HBM) DRAM. https://www.jedec.org/sites/default/files/docs/JESD235A.pdf. (2013).

[18]

JEDEC. 2016. GRAPHICS DOUBLE DATA RATE (GDDR5) SGRAM STANDARD. https://www.jedec.org/system/files/docs/JESD212C.pdf. (2016).

[19]

Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. 2011. GPUs and the future of parallel computing. IEEE Micro 31, 5 (2011), 7--17.

Digital Library

[20]

Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 1--12.

[21]

David Kirk et al. 2007. NVIDIA CUDA software and GPU parallel computing architecture. In ISMM, Vol. 7. 103--104.

Digital Library

[22]

John DC Little and Stephen C Graves. 2008. Little's law. In Building intuition. Springer, 81--100.

[23]

Igor Loi and Luca Benini. 2010. An efficient distributed memory interface for many-core platform with 3D stacked DRAM. In Proceedings of the Conference on Design, Automation and Test in Europe. European Design and Automation Association, 99--104.

Digital Library

[24]

MICRON. 2014. DDR4 SDRAM. https://www.micron.com/~/media/documents/products/data-sheet/dram/ddr4/4gb_ddr4_dram_2e0d.pdf. (2014).

[25]

Michael Mitzenmacher. 2001. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems 12, 10 (2001), 1094--1104.

Digital Library

[26]

NVIDA. 2016. Inside pascal: NVIDIA's newest computing platform. https://devblogs.nvidia.com/inside-pascal. (2016).

[27]

John D Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E Lefohn, and Timothy J Purcell. 2007. A survey of general-purpose computation on graphics hardware. In Computer graphics forum, Vol. 26. Wiley Online Library, 80--113.

[28]

Subbarao Palacharla, Norman P Jouppi, and James E Smith. 1997. Complexity-effective superscalar processors. Vol. 25. ACM.

Digital Library

[29]

Scott Rixner, William J Dally, Ujval J Kapasi, Peter Mattson, and John D Owens. 2000. Memory access scheduling. In ACM SIGARCH Computer Architecture News, Vol. 28. ACM, 128--138.

Digital Library

[30]

Hemant G Rotithor, Randy B Osborne, and Nagi Aboulenein. 2006. Method and apparatus for out of order memory scheduling. (Oct. 24 2006). US Patent 7,127,574.

[31]

Samsung Semiconductor. 2016. Research collaboration communications. (2016).

[32]

Dilpreet Singh and Chandan K Reddy. 2015. A survey on platforms for big data analytics. Journal of Big Data 2, 1 (2015), 8.

[33]

John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).

[34]

Oklahoma State University. 2017. FreePDK: Unleashing VLSI to the Masses. https://vlsiarch.ecen.okstate.edu/flows/. (2017).

[35]

Gert-Jan van den Braak, Juan Gomez-Luna, José María González-Linares, Henk Corporaal, and Nicolas Guil. 2016. Configurable XOR hash functions for banked scratchpad memories in GPUs. IEEE Trans. Comput. 65, 7 (2016), 2045--2058.

[36]

Hans Vandierendonck and Koenraad De Bosschere. 2005. XOR-based hash functions. IEEE Trans. Comput. 54, 7 (2005), 800--812.

Digital Library

[37]

George L Yuan, Ali Bakhoda, and Tor M Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 34--44.

Digital Library

[38]

William K Zuravleff and Timothy Robinson. 1997. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. (May 13 1997). US Patent 5,630,096.

Cited By

Asifuzzaman KAbuelala MHassan MCazorla F(2021)Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)10.1109/ICCAD51958.2021.9643473(1-9)Online publication date: 1-Nov-2021
https://doi.org/10.1109/ICCAD51958.2021.9643473
Fotouhi PFariborz MProietti RLowe-Power JAkella VYoo S(2021)HTA: A Scalable High-Throughput Accelerator for Irregular HPC WorkloadsHigh Performance Computing10.1007/978-3-030-78713-4_10(176-194)Online publication date: 24-Jun-2021
https://dl.acm.org/doi/10.1007/978-3-030-78713-4_10
Li BWei JGuo WSun J(2020)Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing UnitsJournal of Shanghai Jiaotong University (Science)10.1007/s12204-020-2240-xOnline publication date: 27-Oct-2020
https://doi.org/10.1007/s12204-020-2240-x

Index Terms

A load balancing technique for memory channels
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Challenges of High-Capacity DRAM Stacks and Potential Directions
MCHPC'18: Proceedings of the Workshop on Memory Centric High Performance Computing

With rapid growth in data volumes and an increase in number of CPU/GPU cores per chip, the capacity and bandwidth of main memory can be scaled up to accommodate performance requirements of data-intensive applications. Recent 3D-stacked in-package memory ...
Refresh pausing in DRAM memory systems

Dynamic Random Access Memory (DRAM) cells rely on periodic refresh operations to maintain data integrity. As the capacity of DRAM memories has increased, so has the amount of time consumed in doing refresh. Refresh operations contend with read ...
Power management of hybrid DRAM/PRAM-based main memory
DAC '11: Proceedings of the 48th Design Automation Conference

Hybrid main memory consisting of DRAM and non-volatile memory is attractive since the non-volatile memory can give the advantage of low standby power while DRAM provides high performance and better active power. In this work, we address the power ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

MEMSYS '18: Proceedings of the International Symposium on Memory Systems

October 2018

361 pages

ISBN:9781450364751

DOI:10.1145/3240302

General Chair:
Bruce Jacob
University of Maryland

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MEMSYS '18

MEMSYS '18: The International Symposium on Memory Systems

October 1 - 4, 2018

Virginia, Alexandria, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
221
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)3

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Asifuzzaman KAbuelala MHassan MCazorla F(2021)Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)10.1109/ICCAD51958.2021.9643473(1-9)Online publication date: 1-Nov-2021
https://doi.org/10.1109/ICCAD51958.2021.9643473
Fotouhi PFariborz MProietti RLowe-Power JAkella VYoo S(2021)HTA: A Scalable High-Throughput Accelerator for Irregular HPC WorkloadsHigh Performance Computing10.1007/978-3-030-78713-4_10(176-194)Online publication date: 24-Jun-2021
https://dl.acm.org/doi/10.1007/978-3-030-78713-4_10
Li BWei JGuo WSun J(2020)Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing UnitsJournal of Shanghai Jiaotong University (Science)10.1007/s12204-020-2240-xOnline publication date: 27-Oct-2020
https://doi.org/10.1007/s12204-020-2240-x

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten