ABSTRACT
The performance of GPU’s external memories is becoming more critical since a modern GPU runs thousands of concurrent threads that demand a huge volume of data. In order to utilize resources in the memory hierarchy more efficiently, GPU employs a memory coalescing scheme to reduce the number of demand requests created from a group of threads (i.e. a warp). However, GPU’s memory coalescing does not work well for applications that exhibit irregular memory access patterns, thus a single warp can generate multiple memory transactions. Since memory requests are serviced by different hierarchy levels and/or memory partitions, multiple outstanding requests from a single warp exhibit diverged fetch latency. Considering the execution time of a load warp is decided by the slowest memory transaction, the diverged memory latency within a warp is a critical performance factor for load warps.
In this paper, we propose a warp-aware memory controller scheme, called Warped-MC, to mitigate the memory latency divergence issues. Based on the in-depth analysis, we reveal the memory latency divergence within a warp is mainly caused by GPU memory controllers. While the conventional FR-FCFS memory controller can maximize the effective bandwidth of DRAM channels, the scheduling scheme of the conventional memory controller can exacerbate the memory latency divergence of a warp. Warped-MC employs a warp-aware scheduling scheme to alleviate the memory latency divergence, thus Warped-MC can tackle the long tail of the load warp execution time to improve the performance of memory-intensive applications. We implement Warped-MC on GPGPU-Sim configured with the modern GPU architecture, and our evaluation results exhibit Warped-MC can improve the performance of memory-intensive applications by 8.9% on average with a maximum of 45.8%.
- Chris Anderson. 2006. The long tail: Why the future of business is selling less of more. Hachette UK.Google Scholar
- Akhil Arunkumar, Shin-ying Lee, and Carole-jean Wu. 2016. ID-cache: instruction and memory divergence based cache management for GPUs. In 2016 IEEE International Symposium on Workload Characterization (IISWC). 1–10. https://doi.org/10.1109/IISWC.2016.7581276Google ScholarCross Ref
- Rachata Ausavarungnirun, Saugata Ghose, Onur Kayıran, Gabriel H Loh, Chita R Das, Mahmut T Kandemir, and Onur Mutlu. 2018. Holistic management of the GPGPU memory hierarchy to manage warp-level latency tolerance. arXiv preprint arXiv:1804.11038 (2018).Google Scholar
- Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163–174. https://doi.org/10.1109/ISPASS.2009.4919648Google ScholarCross Ref
- Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In 2012 IEEE International Symposium on Workload Characterization (IISWC). 141–151. https://doi.org/10.1109/IISWC.2012.6402918Google ScholarDigital Library
- Niladrish Chatterjee, Mike O’Connor, Gabriel H. Loh, Nuwan Jayasena, and Rajeev Balasubramonia. 2014. Managing DRAM Latency Divergence in Irregular GPGPU Applications. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 128–139. https://doi.org/10.1109/SC.2014.16Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44–54. https://doi.org/10.1109/IISWC.2009.5306797Google ScholarDigital Library
- Shuai Che, Jeremy W Sheaffer, Michael Boyer, Lukasz G Szafaryn, Liang Wang, and Kevin Skadron. 2010. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads. In IEEE International Symposium on Workload Characterization (IISWC’10). IEEE, 1–11.Google ScholarDigital Library
- Jack Choquette, Olivier Giroux, and Denis Foley. 2018. Volta: Performance and Programmability. IEEE Micro 38, 2 (2018), 42–52. https://doi.org/10.1109/MM.2018.022071134Google ScholarCross Ref
- Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2017. Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 239–248. https://doi.org/10.1109/ISPASS.2017.7975295Google ScholarCross Ref
- Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2019. Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). 492–505. https://doi.org/10.1109/HPCA.2019.00061Google ScholarCross Ref
- Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar). 1–10. https://doi.org/10.1109/InPar.2012.6339595Google ScholarCross Ref
- Takakazu Ikeda and Kenji Kise. 2013. Application Aware DRAM Bank Partitioning in CMP. In 2013 International Conference on Parallel and Distributed Systems. 349–356. https://doi.org/10.1109/ICPADS.2013.56Google ScholarCross Ref
- Shiwei Jia, Ze Tian, Yueyuan Ma, Chenglu Sun, Yimen Zhang, and Yuming Zhang. 2021. A Survey of GPGPU Parallel Processing Architecture Performance Optimization. In 2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall). 75–82. https://doi.org/10.1109/ICISFall51598.2021.9627400Google ScholarCross Ref
- Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2016. Exploiting core criticality for enhanced GPU performance. In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science. 351–363.Google ScholarDigital Library
- Hongshin Jun, Jinhee Cho, Kangseol Lee, Ho-Young Son, Kwiwook Kim, Hanho Jin, and Keith Kim. 2017. Hbm (high bandwidth memory) dram technology and architecture. In 2017 IEEE International Memory Workshop (IMW). IEEE, 1–4.Google ScholarCross Ref
- Vijay Kandiah, Scott Peverelle, Mahmoud Khairy, Junrui Pan, Amogh Manjunath, Timothy G. Rogers, Tor M. Aamodt, and Nikos Hardavellas. 2021. AccelWattch: A Power Modeling Framework for Modern GPUs. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 738–753. https://doi.org/10.1145/3466752.3480063Google ScholarDigital Library
- Mahmoud Khairy, Jain Akshay, Tor Aamodt, and Timothy G Rogers. 2018. Exploring modern GPU memory system design challenges through accurate modeling. arXiv preprint arXiv:1810.07269 (2018).Google Scholar
- Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 473–486. https://doi.org/10.1109/ISCA45697.2020.00047Google ScholarDigital Library
- Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, Gunjae Koo, Won Woo Ro, and Murali Annavaram. 2016. Warped-preexecution: A GPU pre-execution approach for improving latency hiding. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 163–175. https://doi.org/10.1109/HPCA.2016.7446062Google ScholarCross Ref
- Gunjae Koo, Hyeran Jeon, and Murali Annavaram. 2015. Revealing Critical Loads and Hidden Data Locality in GPGPU Applications. In 2015 IEEE International Symposium on Workload Characterization. 120–129. https://doi.org/10.1109/IISWC.2015.23Google ScholarDigital Library
- Gunjae Koo, Hyeran Jeon, Zhenhong Liu, Nam Sung Kim, and Murali Annavaram. 2018. CTA-Aware Prefetching and Scheduling for GPU. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 137–148. https://doi.org/10.1109/IPDPS.2018.00024Google ScholarCross Ref
- Milind Kulkarni, Martin Burtscher, Calin Cascaval, and Keshav Pingali. 2009. Lonestar: A suite of parallel irregular programs. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 65–76. https://doi.org/10.1109/ISPASS.2009.4919639Google ScholarCross Ref
- Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th annual international symposium on Computer architecture. 235–246.Google ScholarDigital Library
- Shuai Mu, Yandong Deng, Yubei Chen, Huaiming Li, Jianming Pan, Wenjun Zhang, and Zhihua Wang. 2014. Orchestrating Cache Management and Memory Scheduling for GPGPU Applications. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 22, 8 (2014), 1803–1814. https://doi.org/10.1109/TVLSI.2013.2278025Google ScholarCross Ref
- Onur Mutlu and Thomas Moscibroda. 2007. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007). 146–160. https://doi.org/10.1109/MICRO.2007.21Google ScholarDigital Library
- Nvidia. 2015. CUDA C/C++ SDK Code Samples. http://developer.nvidia.com/cuda-cc-sdk-code-samples.Google Scholar
- Nvidia. 2019. GeForce RTX 2060 Super. https://www.nvidia.com/en-us/geforce/ graphics-cards/rtx-2060-super/.Google Scholar
- S. Rixner, W.J. Dally, U.J. Kapasi, P. Mattson, and J.D. Owens. 2000. Memory access scheduling. In Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201). 128–138. https://doi.org/10.1145/339647.339668Google ScholarDigital Library
- John Sartori and Rakesh Kumar. 2012. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 427–428.Google ScholarDigital Library
- A. Seznec. 1994. Decoupled sectored caches: conciliating low tag implementation cost and low miss ratio. In Proceedings of 21 International Symposium on Computer Architecture. 384–393. https://doi.org/10.1109/ISCA.1994.288133Google ScholarCross Ref
- JEDEC standard. 2018. GDDR6. JESD250C.Google Scholar
- John A. Stratton, Christopher I. Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Liu, and Wen mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing.Google Scholar
- Lu Wang, Magnus Jahre, Almutaz Adileho, and Lieven Eeckhout. 2020. MDM: The GPU Memory Divergence Model. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1009–1021. https://doi.org/10.1109/MICRO50266.2020.00085Google ScholarCross Ref
- Qiumin Xu, Hyeran Jeon, and Murali Annavaram. 2014. Graph processing on GPUs: Where are the bottlenecks?. In 2014 IEEE International Symposium on Workload Characterization (IISWC). 140–149. https://doi.org/10.1109/IISWC.2014.6983053Google ScholarCross Ref
- George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 34–44. https://doi.org/10.1145/1669112.1669119Google ScholarDigital Library
Index Terms
- Warped-MC: An Efficient Memory Controller Scheme for Massively Parallel Processors
Recommendations
Refresh pausing in DRAM memory systems
Dynamic Random Access Memory (DRAM) cells rely on periodic refresh operations to maintain data integrity. As the capacity of DRAM memories has increased, so has the amount of time consumed in doing refresh. Refresh operations contend with read ...
Page placement in hybrid memory systems
ICS '11: Proceedings of the international conference on SupercomputingPhase-Change Memory (PCM) technology has received substantial attention recently. Because PCM is byte-addressable and exhibits access times in the nanosecond range, it can be used in main memory designs. In fact, PCM has higher density and lower idle ...
Design and Implementation of a DDR3-based Memory Controller
ISDEA '13: Proceedings of the 2013 Third International Conference on Intelligent System Design and Engineering ApplicationsMemory performance has become the major bottleneck to improve the overall performance of the computer system. DDR3 SDRAM is a new generation of memory technology standard introduced by JEDEC, support multibank in parallel and open-page technology. On ...
Comments