research-article

Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory

Authors:

Zhiping JiaAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 17, Issue 4

Article No.: 77, Pages 1 - 25

https://doi.org/10.1145/3230643

Published: 31 July 2018 Publication History

Abstract

Memory intensive workloads become increasingly popular on general purpose graphics processing units (GPGPUs), and impose great challenges on the GPGPU memory subsystem design. On the other hand, with the recent development of non-volatile memory (NVM) technologies, hybrid memory combining both DRAM and NVM achieves high performance, low power, and high density simultaneously, which provides a promising main memory design for GPGPUs. In this article, we explore the shared last-level cache management for GPGPUs with consideration of the underlying hybrid main memory. To improve the overall memory subsystem performance, we exploit the characteristics of both the asymmetric read/write latency of the hybrid main memory architecture, as well as the memory coalescing feature of GPGPUs. In particular, to reduce the average cost of L2 cache misses, we prioritize cache blocks from DRAM or NVM based on observations that operations to NVM part of main memory have a large impact on the system performance. Furthermore, the cache management scheme also integrates the GPU memory coalescing and cache bypassing techniques to improve the overall system performance. To minimize the impact of memory divergence behaviors among simultaneously executed groups of threads, we propose a hybrid main memory and warp aware memory scheduling mechanism for GPGPUs. Experimental results show that in the context of a hybrid main memory system, our proposed L2 cache management policy and memory scheduling mechanism improve performance by 15.69% on average for memory intensive benchmarks, whereas the maximum gain can be up to 29% and achieve an average memory subsystem energy reduction of 21.27%.

References

[1]

Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture. IEEE, 416--427.

Digital Library

[2]

Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, and Onur Mutlu. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT’15). IEEE, 25--38.

Digital Library

[3]

Sara S. Baghsorkhi, Isaac Gelado, Matthieu Delahaye, and Wen-mei W. Hwu. 2012. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 23--34.

Digital Library

[4]

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 163--174.

[5]

Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’12). IEEE, 141--151.

Digital Library

[6]

Niladrish Chatterjee, Naveen Muralimanohar, Rajeev Balasubramonian, Al Davis, and Norman P. Jouppi. 2012. Staged reads: Mitigating the impact of DRAM writes on DRAM reads. In Proceedings of the 18th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1--12.

Digital Library

[7]

Niladrish Chatterjee, Mike O’Connor, Gabriel H. Loh, Nuwan Jayasena, and Rajeev Balasubramonian. 2014. Managing DRAM latency divergence in irregular GPGPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 128--139.

Digital Library

[8]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization. IEEE, 44--54.

Digital Library

[9]

Ganesh Dasika, Ankit Sethia, Trevor Mudge, and Scott Mahlke. 2011. PEPSC: A power-efficient processor for scientific computing. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE, 101--110.

Digital Library

[10]

Bruce Jacob, Spencer Ng, and David Wang. 2010. Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann.

Digital Library

[11]

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture. ACM, 60--71.

Digital Library

[12]

Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 272--283.

[13]

Adwait Jog, Evgeny Bolotin, Zvika Guz, Mike Parker, Stephen W. Keckler, Mahmut T. Kandemir, and Chita R. Das. 2014. Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 1.

[14]

Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 395--406.

Digital Library

[15]

Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutluy, and Daniel A. Jimenezz. 2014. Improving cache performance using read-write partitioning. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 452--463.

[16]

Hoda Aghaei Khouzani, Fateme S. Hosseini, and Chengmo Yang. 2017. Segment and conflict aware page allocation and migration in DRAM-PCM hybrid main memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 9 (2017), 1458--1470.

Digital Library

[17]

Dongki Kim, Sungkwang Lee, Jaewoong Chung, Dae Hyun Kim, Dong Hyuk Woo, Sungjoo Yoo, and Sunggu Lee. 2012. Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU. In Proceedings of the 49th Annual Design Automation Conference. ACM, 888--896.

Digital Library

[18]

David B. Kirk and W. Hwu Wen-Mei. 2016. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann.

Digital Library

[19]

Nagesh B. Lakshminarayana and Hyesoon Kim. 2010. Effect of instruction fetch and memory scheduling on gpu performance. In Proceedings of the Workshop on Language, Compiler, and Architecture Support for GPGPU. 1--10.

[20]

Jaekyu Lee and Hyesoon Kim. 2012. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proceedings of the 18th International Symposium on High Performance Computer Architecture (HPCA’12). IEEE, 1--12.

Digital Library

[21]

Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 213--224.

Digital Library

[22]

Soyoon Lee, Hyokyung Bahn, and Sam H. Noh. 2014. Clock-dwf: A write-history-aware page replacement algorithm for hybrid PCM and dram memory architectures. IEEE Transactions on Computers 63, 9 (2014), 2187--2200.

Digital Library

[23]

Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--12.

Digital Library

[24]

Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O’Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Redder. 2015. Priority-based cache allocation in throughput processors. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 89--100.

[25]

Yun Liang, Xiaolong Xie, Guangyu Sun, and Deming Chen. 2015. An efficient compiler framework for cache bypassing on GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 10 (2015), 1677--1690.

Digital Library

[26]

Vineeth Mekkat, Anup Holey, Pen-Chung Yew, and Antonia Zhai. 2013. Managing shared last-level cache in a heterogeneous multicore processor. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, 225--234.

Digital Library

[27]

Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. Proceedings of the 37th Annual International Symposium on Computer Architecture (2010), 235--246.

Digital Library

[28]

Shuai Mu, Yandong Deng, Yubei Chen, Huaiming Li, Jianming Pan, Wenjun Zhang, and Zhihua Wang. 2014. Orchestrating cache management and memory scheduling for GPGPU applications. IEEE Transactions on Very Large Scale Integration Systems 22, 8 (2014), 1803--1814.

[29]

Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 146--160.

Digital Library

[30]

Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proceedings of the 35th Annual International Symposium on Computer Architecture. IEEE, 63--74.

Digital Library

[31]

NVIDIA. 2011. NVIDIA, CUDA SDK. (May 2011). https://developer.nvidia.com/cuda-toolkit-40.

[32]

Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 423--432.

Digital Library

[33]

Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. ACM SIGARCH Computer Architecture News 37, 3 (2009), 24--33.

Digital Library

[34]

Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page placement in hybrid memory systems. In Proceedings of the International Conference on Supercomputing. ACM, 85--95.

Digital Library

[35]

Bin Wang, Bo Wu, Dong Li, Xipeng Shen, Weikuan Yu, Yizheng Jiao, and Jeffrey S. Vetter. 2013. Exploring hybrid memory for GPU energy efficiency through software-hardware co-design. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, 93--102.

Digital Library

[36]

Bin Wang, Weikuan Yu, Xian-He Sun, and Xinning Wang. 2015. Dacache: Memory divergence-aware GPU cache management. In Proceedings of the 29th ACM International Conference on Supercomputing. ACM, 89--98.

Digital Library

[37]

Zhu Wang, Zonghua Gu, and Zili Shao. 2014. Optimizated allocation of data variables to PCM/DRAM-based hybrid main memory for real-time embedded systems. IEEE Embedded Systems Letters 6, 3 (2014), 61--64.

[38]

Wei Wei, Dejun Jiang, Jin Xiong, and Mingyu Chen. 2014. HAP: Hybrid-memory-aware partition in shared last-level cache. In Proceedings of the 32nd IEEE International Conference on Computer Design (ICCD’14). IEEE, 28--35.

[39]

Nicholas Wilt. 2013. The Cuda Handbook: A Comprehensive Guide to GPU Programming. Pearson Education.

[40]

Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 76--88.

[41]

Yuejian Xie and Gabriel H. Loh. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture. ACM, 174--183.

Digital Library

[42]

Chun Jason Xue, Guangyu Sun, Youtao Zhang, J. Joshua Yang, Yiran Chen, and Hai Li. 2011. Emerging non-volatile memories: Opportunities and challenges. In Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS’11). IEEE, 325--334.

Digital Library

[43]

Deshan Zhang, Lei Ju, Mengying Zhao, Xiang Gao, and Zhiping Jia. 2016. Write-back aware shared last-level cache management for hybrid main memory. In Proceedings of the 53rd ACM/EDAC/IEEE on Design Automation Conference (DAC’16). IEEE, 1--6.

Digital Library

[44]

Jishen Zhao and Yuan Xie. 2012. Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration. In Proceedings of the International Conference on Computer-Aided Design. ACM, 81--87.

Digital Library

Cited By

Macia-Lillo AMora HJimeno-Morenilla ARamirez T(2024)Towards Abstraction of Heterogeneous Accelerators for HPC/AI Tasks in the Cloud2024 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)10.1109/CloudCom62794.2024.00013(151-159)Online publication date: 9-Dec-2024
https://doi.org/10.1109/CloudCom62794.2024.00013
Xie LRen YWang YNie WZhou M(2022)An AOA and Orientation Angle-Based Localization Algorithm for Passive RFID Tag ArrayWireless Communications & Mobile Computing10.1155/2022/77741662022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/7774166
Zhang DLiu AJia GChen FLi QLi J(2017)Effective Caching of Shortest Travel-Time Paths for Web Mapping Mashup SystemsWeb Information Systems Engineering – WISE 201710.1007/978-3-319-68783-4_29(422-437)Online publication date: 4-Oct-2017
https://doi.org/10.1007/978-3-319-68783-4_29

Index Terms

Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
      2. Single instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Non-volatile memory
  2. Power and energy

Recommendations

Shared last-level cache management for GPGPUs with hybrid main memory
DATE '17: Proceedings of the Conference on Design, Automation & Test in Europe

Memory intensive workloads become increasingly popular on general purpose graphics processing units (GPGPUs), and impose great challenges on the GPGPU memory subsystem design. On the other hand, with the recent development of nonvolatile memory (NVM) ...
Efficient utilization of GPGPU cache hierarchy
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe ...
Write-back aware shared last-level cache management for hybrid main memory
DAC '16: Proceedings of the 53rd Annual Design Automation Conference

Hybrid main memory with both DRAM and emerging non-volatile memory (NVM) becomes a promising solution for high performance and energy-efficient embedded systems. Cache plays an important role and highly affects the number of write backs to NVM and DRAM ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 17, Issue 4

July 2018

207 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3236463

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 31 July 2018

Accepted: 01 May 2018

Revised: 01 January 2018

Received: 01 March 2017

Published in TECS Volume 17, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Key R8D Program of China
Shandong Provincial Natural Science Foundation
State Key Program of NSFC
Research and Application of Key Technology for Intelligent Dispatching and Security Early-Warning of Large Power Grid
Young Scholars Program of Shandong University
State Grid Corporation of China
Natural Science Foundation of China (NSFC)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
318
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Macia-Lillo AMora HJimeno-Morenilla ARamirez T(2024)Towards Abstraction of Heterogeneous Accelerators for HPC/AI Tasks in the Cloud2024 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)10.1109/CloudCom62794.2024.00013(151-159)Online publication date: 9-Dec-2024
https://doi.org/10.1109/CloudCom62794.2024.00013
Xie LRen YWang YNie WZhou M(2022)An AOA and Orientation Angle-Based Localization Algorithm for Passive RFID Tag ArrayWireless Communications & Mobile Computing10.1155/2022/77741662022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/7774166
Zhang DLiu AJia GChen FLi QLi J(2017)Effective Caching of Shortest Travel-Time Paths for Web Mapping Mashup SystemsWeb Information Systems Engineering – WISE 201710.1007/978-3-319-68783-4_29(422-437)Online publication date: 4-Oct-2017
https://doi.org/10.1007/978-3-319-68783-4_29

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents