skip to main content
10.1145/3123939.3124545acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Fine-grained DRAM: energy-efficient DRAM for extreme bandwidth systems

Published: 14 October 2017 Publication History

Abstract

Future GPUs and other high-performance throughput processors will require multiple TB/s of bandwidth to DRAM. Satisfying this bandwidth demand within an acceptable energy budget is a challenge in these extreme bandwidth memory systems. We propose a new high-bandwidth DRAM architecture, Fine-Grained DRAM (FGDRAM), which improves bandwidth by 4× and improves the energy efficiency of DRAM by 2× relative to the highest-bandwidth, most energy-efficient contemporary DRAM, High Bandwidth Memory (HBM2). These benefits are in large measure achieved by partitioning the DRAM die into many independent units, called grains, each of which has a local, adjacent I/O. This approach unlocks the bandwidth of all the banks in the DRAM to be used simultaneously, eliminating shared buses interconnecting various banks. Furthermore, the on-DRAM data movement energy is significantly reduced due to the much shorter wiring distance between the cell array and the local I/O. This FGDRAM architecture readily lends itself to leveraging existing techniques to reducing the effective DRAM row size in an area efficient manner, reducing wasteful row activate energy in applications with low locality. In addition, when FGDRAM is paired with a memory controller optimized to exploit the additional concurrency provided by the independent grains, it improves GPU system performance by 19% over an iso-bandwidth and iso-capacity future HBM baseline. Thus, this energy-efficient, high-bandwidth FGDRAM architecture addresses the needs of future extreme-bandwidth memory systems.

References

[1]
M. F. Adams, J. Brown, J. Shalf, B. V. Straalen, E. Strohmaier, and S. Williams. 2014. HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems. Technical Report. Lawrence Berkley National Laboratory. LBNL-6630E.
[2]
T. Aila and T. Karras. 2010. Architecture Considerations for Tracing Incoherent Rays. In Proceedings of High Performance Graphics.
[3]
M. Andersch, J. Lucas, M. Alvarez-Mesa, and B. Juurlink. 2015. On Latency in GPU Throughput Microarchitectures. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 169--170.
[4]
M. Burtscher, R. Nasre, and K. Pingali. 2012. A Quantitative Study of Irregular Programs on GPUs. In Proceedings of the International Symposium on Workload Characterization (IISWC). 141 -- 151.
[5]
S. Cha, S. O, H. Shin, S. Hwang, K. Park, S. J. Jang, J. S. Choi, G. Y. Jin, Y. H. Son, H. Cho, J. H. Ahn, and N. S. Kim. 2017. Defect Analysis and Cost-Effective Resilience Architecture for Future DRAM Devices. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).
[6]
N. Chatterjee, M. O'Connor, D. Lee, D. R. Johnson, S. W. Keckler, M. Rhu, and W. J. Dally. 2017. Architecting an Energy-Efficient DRAM System For GPUs. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).
[7]
N. Chatterjee, M. O'Connor, G. H. Loh, N. Jayasena, and R. Balasubramonian. 2014. Managing DRAM Latency Divergence in Irregular GPGPU Applications. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC).
[8]
S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, S-.H. Lee, and K. Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the International Symposium on Workload Characterization (IISWC). 44--54.
[9]
E. Cooper-Balis and B. Jacob. 2010. Fine-Grained Activation for Power Reduction in DRAM. IEEE Micro 30, 3 (May/June 2010), 34--47.
[10]
Coral 2014. CORAL Benchmarks. https://asc.llnl.gov/CORAL-benchmarks/. (2014).
[11]
J. Dongarra and P. Luszczek. 2005. Introduction to the HPCChallenege Benchmark Suite. ICL Technical Report ICL-UT-05--01. (2005).
[12]
H. Ha, A. Pedram, S. Richardson, S. Kvatinsky, and M. Horowitz. 2016. Improving Energy Efficiency of DRAM by Exploiting Half Page Row Access. In Proceedings of the International Symposium on Microarchitecture (MICRO).
[13]
Q. Harvard and R. J. Baker. 2011. A Scalable I/O Architecture for Wide I/O DRAM. In Proceedings of the International Midwest Symposium on Circuits and Systems (MWSCAS).
[14]
M. A Heroux, D. W. Doerfler, Paul S. Crozier, J. M. Wilenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich. 2009. Improving Performance via Mini-applications. Sandia Report SAND 2008--5574. (2009).
[15]
Intel. 2016. An Intro to MCDRAM (High Bandwidth Memory) on Knights Landing. (2016). https://software.intel.com/en-us/blogs/2016/01/20/an-intro-to-mcdram-high-bandwidth-memory-on-knights-landing.
[16]
D. James. 2010. Recent Advances in DRAM Manufacturing. In Proceedings of the SEMI Advanced Semiconductor Manufacturing Conference. 264--269.
[17]
J. Jeddeloh and B. Keeth. 2012. Hybrid Memory Cube - New DRAM Architecture Increases Density and Performance. In Symposium on VLSI Technology.
[18]
JEDEC. 2009. JEDEC Standard JESD212: GDDR5 SGRAM. JEDEC Solid State Technology Association, Virginia, USA.
[19]
JEDEC. 2012. JESD79--4: JEDEC Standard DDR4 SDRAM. JEDEC Solid State Technology Association, Virginia, USA.
[20]
JEDEC. 2013. JEDEC Standard JESD235: High Bandwidth Memory (HBM) DRAM. JEDEC Solid State Technology Association, Virginia, USA.
[21]
JEDEC. 2014. GDDR3 Specific SGRAM Functions in JEDEC Standard JESD21-C: JEDEC Configurations for Solid State Memories. JEDEC Solid State Technology Association, Virginia, USA.
[22]
JEDEC. 2014. JESD209--4: Low Power Double Data Rate 4 (LPDDR4). JEDEC Solid State Technology Association, Virginia, USA.
[23]
JEDEC. 2015. JEDEC Standard JESD235A: High Bandwidth Memory (HBM) DRAM. JEDEC Solid State Technology Association, Virginia, USA.
[24]
JEDEC. 2016. JEDEC Standard JESD232A: Graphics Double Data Rate (GDDR5X) SGRAM Standard. JEDEC Solid State Technology Association, Virginia, USA.
[25]
B. Keeth, R. J. Baker, B. Johnson, and F. Lin. 2008. DRAM Circuit Design - Fundamental and High-Speed Topics. IEEE Press.
[26]
Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu. 2012. A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM. In Proceedings of the International Symposium on Computer Architecture (ISCA). 368--379.
[27]
S. Layton, N. Sakharnykh, and K. Clark. 2015. GPU Implementation of HPGMG-FV. In HPGMG BoF, Supercomputing.
[28]
Y. Lee, H. Kim, S. Hong, S. Hong, and S. Kim. 2017. Partial Row Activation for Low-Power DRAM System. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).
[29]
J. Mohd-Yusof and N. Sakharnykh. 2014. Optimizing CoMD: A Molecular Dynamics Proxy Application Study. In GPU Technology Conference (GTC).
[30]
NVIDIA. 2016. NVIDIA Tesla P100 Whitepaper. (2016). https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.
[31]
NVIDIA 2017. NVIDIA GeForce GTX 1080: Gaming Perfected. (2017). http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf.
[32]
S. O, Y. H. Son, N. S. Kim, and J. H. Ahn. 2014. Row-Buffer Decoupling: A Case for Low-Latency DRAM Microarchitecture. In Proceedings of the International Symposium on Computer Architecture (ISCA). 337--348.
[33]
T. Pawlowski. 2011. Hybrid Memory Cube (HMC). In HotChips 23.
[34]
J. Poulton, W. Dally, X. Chen, J. Eyles, T. Greer, S. Tell, J. Wilson, and T. Gray. 2013. A 0.54pJ/b 20Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28nm CMOS for Advanced Packaging Applications. IEEE Journal of Solid-State Circuits 48, 12 (December 2013), 3206--3218.
[35]
M. Rhu, M. Sullivan, J. Leng, and M. Erez. 2013. A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures. In Proceedings of the International Symposium on Microarchitecture (MICRO). 86--98.
[36]
T. Schloesser, F. Jakubowski, J. v. Kluge, A. Graham, S. Selsazeck, M. Popp, P. Baars, K. Muemmler, P. Moll, K. Wilson, A. Buerke, D. Koehler, J. Radecker, E. Erben, U. Zimmerman, T. vorrath, B. Fischer, G. Aichmayr, R. Agaiby, W. Pamler, and T. Scheuster. 2008. A 6f2 Buried Wordline DRAM Cell for 40nm and Beyond. In Proceedings of the International Electron Devices Meeting (IEDM). 1--4.
[37]
R. Schmitt, J.-H. Kim, W. Kim, D. Oh, J. Feng, C. Yuan, L. Luo, and J. Wilson. 2008. Analyzing the Impact of Simultaneous Switching Noise on System Margin in Gigabit Single-Ended Memory Systems. In DesignCon.
[38]
Y. H. Son, S. O, H. Yang, D. Jung, J. H. Ahn, J. Kim, J. Kim, and J. W. Lee. 2014. Microbank: Architecting Through-Silicon Interposer-Based Main Memory Systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC).
[39]
M. R. Stan and W. P. Burleson. 1995. Bus-Invert Coding for Low-Power I/O. IEEE Transactions on Very Large Scale Integraion (VLSI) Systems 3, 1 (March 1995), 49--58.
[40]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erha, V. Vanhoucke, and A. Rabinovich. 2015. Going Deeper With Convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR).
[41]
K. Tran and J. Ahn. 2014. HBM: Memory Solution for High Performance Processors. In Proceedings of MemCon.
[42]
A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balasubramonian, A. Davis, and N. Jouppi. 2010. Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores. In Proceedings of the International Symposium on Computer Architecture (ISCA). 175--186.
[43]
O. Villa, D. R. Johnson, M. O'Connor, E. Bolotin, D. Nellans, J. Luitjens, N. Sakharnykh, P. Wang, P. Micikevicius, A. Scudiero, S. W. Keckler, and W. J. Dally. 2014. Scaling the Power Wall: A Path to Exascale. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC).
[44]
T. Vogelsang. 2010. Understanding the Energy Consumption of Dynamic Random Access Memories. In Proceedings of the International Symposium on Microarchitecture (MICRO). 363--374.
[45]
T. Zhang, K. Chen, C. Xu, G. Sun, T. Wang, and Y. Xie. 2014. Half-DRAM: a High-bandwdith and Low-power DRAM System from the Rethinking of Fine-grained Activation. In Proceedings of the International Symposium on Computer Architecture (ISCA). 349--360.

Cited By

View all
  • (2024)Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM ArchitectureACM Transactions on Architecture and Code Optimization10.1145/3673653Online publication date: 14-Jun-2024
  • (2024)FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed RestorationACM Transactions on Architecture and Code Optimization10.1145/364945521:2(1-27)Online publication date: 21-May-2024
  • (2024)AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model InferenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640422(103-119)Online publication date: 27-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
October 2017
850 pages
ISBN:9781450349529
DOI:10.1145/3123939
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DRAM
  2. GPU
  3. energy-efficiency
  4. high bandwidth

Qualifiers

  • Research-article

Conference

MICRO-50
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)525
  • Downloads (Last 6 weeks)66
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Sectored DRAM: A Practical Energy-Efficient and High-Performance Fine-Grained DRAM ArchitectureACM Transactions on Architecture and Code Optimization10.1145/3673653Online publication date: 14-Jun-2024
  • (2024)FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed RestorationACM Transactions on Architecture and Code Optimization10.1145/364945521:2(1-27)Online publication date: 21-May-2024
  • (2024)AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model InferenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640422(103-119)Online publication date: 27-Apr-2024
  • (2024)A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640367(20-36)Online publication date: 27-Apr-2024
  • (2024)FLNA: Flexibly Accelerating Feature Learning Networks for Large-Scale Point Clouds With Efficient Dataflow DecouplingIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.335512632:4(739-751)Online publication date: 30-Jan-2024
  • (2024)Spartus: A 9.4 TOp/s FPGA-Based LSTM Accelerator Exploiting Spatio-Temporal SparsityIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.318020935:1(1098-1112)Online publication date: Jan-2024
  • (2024)ARTEMIS: A Mixed Analog-Stochastic In-DRAM Accelerator for Transformer Neural NetworksIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344671943:11(3336-3347)Online publication date: Nov-2024
  • (2024)Hyena: Balancing Packing, Reuse, and Rotations for Encrypted Inference2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00107(3091-3108)Online publication date: 19-May-2024
  • (2024)CiFHER: A Chiplet-Based FHE Accelerator with a Resizable Structure2024 International Symposium on Secure and Private Execution Environment Design (SEED)10.1109/SEED61283.2024.00022(119-130)Online publication date: 16-May-2024
  • (2024)The Environmental Cost of High Performance Computing System Simulation2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00048(289-292)Online publication date: 20-Mar-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media