skip to main content
research-article

Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory

Published: 31 July 2018 Publication History

Abstract

Memory intensive workloads become increasingly popular on general purpose graphics processing units (GPGPUs), and impose great challenges on the GPGPU memory subsystem design. On the other hand, with the recent development of non-volatile memory (NVM) technologies, hybrid memory combining both DRAM and NVM achieves high performance, low power, and high density simultaneously, which provides a promising main memory design for GPGPUs. In this article, we explore the shared last-level cache management for GPGPUs with consideration of the underlying hybrid main memory. To improve the overall memory subsystem performance, we exploit the characteristics of both the asymmetric read/write latency of the hybrid main memory architecture, as well as the memory coalescing feature of GPGPUs. In particular, to reduce the average cost of L2 cache misses, we prioritize cache blocks from DRAM or NVM based on observations that operations to NVM part of main memory have a large impact on the system performance. Furthermore, the cache management scheme also integrates the GPU memory coalescing and cache bypassing techniques to improve the overall system performance. To minimize the impact of memory divergence behaviors among simultaneously executed groups of threads, we propose a hybrid main memory and warp aware memory scheduling mechanism for GPGPUs. Experimental results show that in the context of a hybrid main memory system, our proposed L2 cache management policy and memory scheduling mechanism improve performance by 15.69% on average for memory intensive benchmarks, whereas the maximum gain can be up to 29% and achieve an average memory subsystem energy reduction of 21.27%.

References

[1]
Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture. IEEE, 416--427.
[2]
Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, and Onur Mutlu. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT’15). IEEE, 25--38.
[3]
Sara S. Baghsorkhi, Isaac Gelado, Matthieu Delahaye, and Wen-mei W. Hwu. 2012. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 23--34.
[4]
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 163--174.
[5]
Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’12). IEEE, 141--151.
[6]
Niladrish Chatterjee, Naveen Muralimanohar, Rajeev Balasubramonian, Al Davis, and Norman P. Jouppi. 2012. Staged reads: Mitigating the impact of DRAM writes on DRAM reads. In Proceedings of the 18th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1--12.
[7]
Niladrish Chatterjee, Mike O’Connor, Gabriel H. Loh, Nuwan Jayasena, and Rajeev Balasubramonian. 2014. Managing DRAM latency divergence in irregular GPGPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 128--139.
[8]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization. IEEE, 44--54.
[9]
Ganesh Dasika, Ankit Sethia, Trevor Mudge, and Scott Mahlke. 2011. PEPSC: A power-efficient processor for scientific computing. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE, 101--110.
[10]
Bruce Jacob, Spencer Ng, and David Wang. 2010. Memory Systems: Cache, DRAM, Disk. Morgan Kaufmann.
[11]
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture. ACM, 60--71.
[12]
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 272--283.
[13]
Adwait Jog, Evgeny Bolotin, Zvika Guz, Mike Parker, Stephen W. Keckler, Mahmut T. Kandemir, and Chita R. Das. 2014. Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 1.
[14]
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 395--406.
[15]
Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutluy, and Daniel A. Jimenezz. 2014. Improving cache performance using read-write partitioning. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 452--463.
[16]
Hoda Aghaei Khouzani, Fateme S. Hosseini, and Chengmo Yang. 2017. Segment and conflict aware page allocation and migration in DRAM-PCM hybrid main memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 9 (2017), 1458--1470.
[17]
Dongki Kim, Sungkwang Lee, Jaewoong Chung, Dae Hyun Kim, Dong Hyuk Woo, Sungjoo Yoo, and Sunggu Lee. 2012. Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU. In Proceedings of the 49th Annual Design Automation Conference. ACM, 888--896.
[18]
David B. Kirk and W. Hwu Wen-Mei. 2016. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann.
[19]
Nagesh B. Lakshminarayana and Hyesoon Kim. 2010. Effect of instruction fetch and memory scheduling on gpu performance. In Proceedings of the Workshop on Language, Compiler, and Architecture Support for GPGPU. 1--10.
[20]
Jaekyu Lee and Hyesoon Kim. 2012. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proceedings of the 18th International Symposium on High Performance Computer Architecture (HPCA’12). IEEE, 1--12.
[21]
Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 213--224.
[22]
Soyoon Lee, Hyokyung Bahn, and Sam H. Noh. 2014. Clock-dwf: A write-history-aware page replacement algorithm for hybrid PCM and dram memory architectures. IEEE Transactions on Computers 63, 9 (2014), 2187--2200.
[23]
Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--12.
[24]
Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O’Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Redder. 2015. Priority-based cache allocation in throughput processors. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 89--100.
[25]
Yun Liang, Xiaolong Xie, Guangyu Sun, and Deming Chen. 2015. An efficient compiler framework for cache bypassing on GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 10 (2015), 1677--1690.
[26]
Vineeth Mekkat, Anup Holey, Pen-Chung Yew, and Antonia Zhai. 2013. Managing shared last-level cache in a heterogeneous multicore processor. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, 225--234.
[27]
Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. Proceedings of the 37th Annual International Symposium on Computer Architecture (2010), 235--246.
[28]
Shuai Mu, Yandong Deng, Yubei Chen, Huaiming Li, Jianming Pan, Wenjun Zhang, and Zhihua Wang. 2014. Orchestrating cache management and memory scheduling for GPGPU applications. IEEE Transactions on Very Large Scale Integration Systems 22, 8 (2014), 1803--1814.
[29]
Onur Mutlu and Thomas Moscibroda. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 146--160.
[30]
Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proceedings of the 35th Annual International Symposium on Computer Architecture. IEEE, 63--74.
[31]
NVIDIA. 2011. NVIDIA, CUDA SDK. (May 2011). https://developer.nvidia.com/cuda-toolkit-40.
[32]
Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 423--432.
[33]
Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. ACM SIGARCH Computer Architecture News 37, 3 (2009), 24--33.
[34]
Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page placement in hybrid memory systems. In Proceedings of the International Conference on Supercomputing. ACM, 85--95.
[35]
Bin Wang, Bo Wu, Dong Li, Xipeng Shen, Weikuan Yu, Yizheng Jiao, and Jeffrey S. Vetter. 2013. Exploring hybrid memory for GPU energy efficiency through software-hardware co-design. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. IEEE, 93--102.
[36]
Bin Wang, Weikuan Yu, Xian-He Sun, and Xinning Wang. 2015. Dacache: Memory divergence-aware GPU cache management. In Proceedings of the 29th ACM International Conference on Supercomputing. ACM, 89--98.
[37]
Zhu Wang, Zonghua Gu, and Zili Shao. 2014. Optimizated allocation of data variables to PCM/DRAM-based hybrid main memory for real-time embedded systems. IEEE Embedded Systems Letters 6, 3 (2014), 61--64.
[38]
Wei Wei, Dejun Jiang, Jin Xiong, and Mingyu Chen. 2014. HAP: Hybrid-memory-aware partition in shared last-level cache. In Proceedings of the 32nd IEEE International Conference on Computer Design (ICCD’14). IEEE, 28--35.
[39]
Nicholas Wilt. 2013. The Cuda Handbook: A Comprehensive Guide to GPU Programming. Pearson Education.
[40]
Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 76--88.
[41]
Yuejian Xie and Gabriel H. Loh. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture. ACM, 174--183.
[42]
Chun Jason Xue, Guangyu Sun, Youtao Zhang, J. Joshua Yang, Yiran Chen, and Hai Li. 2011. Emerging non-volatile memories: Opportunities and challenges. In Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS’11). IEEE, 325--334.
[43]
Deshan Zhang, Lei Ju, Mengying Zhao, Xiang Gao, and Zhiping Jia. 2016. Write-back aware shared last-level cache management for hybrid main memory. In Proceedings of the 53rd ACM/EDAC/IEEE on Design Automation Conference (DAC’16). IEEE, 1--6.
[44]
Jishen Zhao and Yuan Xie. 2012. Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration. In Proceedings of the International Conference on Computer-Aided Design. ACM, 81--87.

Cited By

View all
  • (2024)Towards Abstraction of Heterogeneous Accelerators for HPC/AI Tasks in the Cloud2024 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)10.1109/CloudCom62794.2024.00013(151-159)Online publication date: 9-Dec-2024
  • (2022)An AOA and Orientation Angle-Based Localization Algorithm for Passive RFID Tag ArrayWireless Communications & Mobile Computing10.1155/2022/77741662022Online publication date: 1-Jan-2022
  • (2017)Effective Caching of Shortest Travel-Time Paths for Web Mapping Mashup SystemsWeb Information Systems Engineering – WISE 201710.1007/978-3-319-68783-4_29(422-437)Online publication date: 4-Oct-2017

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 17, Issue 4
July 2018
207 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3236463
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 31 July 2018
Accepted: 01 May 2018
Revised: 01 January 2018
Received: 01 March 2017
Published in TECS Volume 17, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPU
  2. NVM
  3. cache bypassing
  4. cache management
  5. hybrid memory
  6. memory scheduling

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Key R8D Program of China
  • Shandong Provincial Natural Science Foundation
  • State Key Program of NSFC
  • Research and Application of Key Technology for Intelligent Dispatching and Security Early-Warning of Large Power Grid
  • Young Scholars Program of Shandong University
  • State Grid Corporation of China
  • Natural Science Foundation of China (NSFC)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)2
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Towards Abstraction of Heterogeneous Accelerators for HPC/AI Tasks in the Cloud2024 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)10.1109/CloudCom62794.2024.00013(151-159)Online publication date: 9-Dec-2024
  • (2022)An AOA and Orientation Angle-Based Localization Algorithm for Passive RFID Tag ArrayWireless Communications & Mobile Computing10.1155/2022/77741662022Online publication date: 1-Jan-2022
  • (2017)Effective Caching of Shortest Travel-Time Paths for Web Mapping Mashup SystemsWeb Information Systems Engineering – WISE 201710.1007/978-3-319-68783-4_29(422-437)Online publication date: 4-Oct-2017

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media