ABSTRACT
We have closely examined GPU resource utilization when executing memory-intensive benchmarks. Our detailed analysis of GPU global memory accesses reveals that divergent loads can lead to the occlusion of Load-Store units, resulting in quick consumption of MSHR entries. Such memory occlusion prevents other ready memory instructions from accessing L1 data cache, eventually stalling warp schedulers and degrading the overall performance. We have designed memory Occlusion Aware Warp Scheduling (OAWS) that can dynamically predict the demand of MSHR entries of divergent memory instructions, and maximize the number of concurrent warps such that their aggregate MSHR consumptions are within the MSHR capacity. Our dynamic OAWS policy can prevent memory occlusions and effectively leverage more MSHR entries for better IPC performance for GPU. Experimental results show that the static and dynamic versions of OAWS achieve 36.7% and 73.1% performance improvement, compared to the baseline GTO scheduling. Particularly, dynamic OAWS outperforms MASCAR, CCWS, and SWL-Best by 70.1%, 57.8%, and 11.4%, respectively.
- W. Jia, K. A. Shaw, and M. Martonosi, "MRPB: Memory Request Prioritization for Massively Parallel Processors," in HPCA, 2014.Google Scholar
- X. Chen, S. Wu, L.-W. Chang, W.-S. Huang, C. Pearson, Z. Wang, and W. mei W. Hwu, "Adaptive Cache Bypass and Insertion for Many-core Accelerators," in MES, 2014. Google ScholarDigital Library
- X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W. mei W. Hwu, "Adaptive Cache Management for Energy-Efficient GPU Computing," in MICRO, 2014. Google ScholarDigital Library
- B. Wang, W. Yu, X.-H. Sun, and X. Wang, "DaCache: Memory Divergence-Aware GPU Cache Management," in ICS, 2015. Google ScholarDigital Library
- C. Li, S. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, "Locality-Driven Dynamic GPU Cache Bypassing," in ICS, 2015. Google ScholarDigital Library
- D. Li, Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processor. PhD thesis, University of Texas at Austin, May 2014.Google Scholar
- Z. Zheng, Z. Wang, and M. Lipasti, "Adaptive Cache and Concurrency Allocation on GPGPUs," Computer Architecture Letters, 2014. Google ScholarDigital Library
- M. Khairy, M. Zahran, and A. G. Wassal, "Efficient Utilization of GPGPU Cache Hierarchy," in GPGPU, 2015. Google ScholarDigital Library
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," in MICRO, 2012. Google ScholarDigital Library
- T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," in MICRO, 2013. Google ScholarDigital Library
- D. Kroft, "Lockup-free Instruction Fetch/Prefetch Cache Organization," in ISCA, 1981. Google ScholarDigital Library
- A. E. Turner, On replay and hazards in graphics processing units. PhD thesis, University of British Columbia, Oct 2012.Google Scholar
- A. Sethia, D. A. Jamshidi, and S. A. Mahlke, "Mascar: Speeding up GPU Warps by Reducing Memory Pitstops," in HPCA, 2015.Google Scholar
- NVIDIA, "NVIDIA's Next Generation CUDA Compute Architecture: Fermi," 2009.Google Scholar
- E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro, vol. 28, pp. 39--55, Mar. 2008. Google ScholarDigital Library
- NVIDIA, "NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110," 2012.Google Scholar
- B. Coon, P. Mills, S. Oberman, and M. Siu, "Tracking register usage during multithreaded processing using a scoreboard having separate memory regions and storing sequenti al register size indicators," Oct. 7 2008. US Patent 7,434,032.Google Scholar
- P. Mills, J. Lindholm, B. Coon, G. Tarolli, and J. Burgess, "Scheduler in multi-threaded processor prioritizing instructions passing qualification rule," May 24 2011. US Patent 7,949,855.Google Scholar
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-level Warp Scheduling," in MICRO, 2011. Google ScholarDigital Library
- A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in ASPLOS, 2013. Google ScholarDigital Library
- N. Brunie, S. Collange, and G. F. Diamos, "Simultaneous Branch and Warp Interweaving for Sustained GPU Performance," in ISCA, 2012. Google ScholarDigital Library
- B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars: a MapReduce Framework on Graphics Processors," in PACT, 2008. Google ScholarDigital Library
- B. Wang, Z. Liu, X. Wang, and W. Yu, "Eliminating intra-warp conflict misses in GPU," in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, DATE 2015, Grenoble, France, March 9--13, 2015, pp. 689--694, 2015. Google ScholarDigital Library
- W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in MICRO, 2007. Google ScholarDigital Library
- C. Nugteren, G.-J. van den Braak, H. Corporaal, and H. Bal, "A Detailed GPU Cache Model Based on Reuse Distance Theory," in HPCA, 2014.Google Scholar
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in ISPASS, 2009.Google Scholar
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009. Google ScholarDigital Library
- S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a High-Level Language Targeted to GPU Codes.," in Innovative Parallel Computing, 2012.Google Scholar
- A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The Scalable Heterogeneous Computing (SHOC) Benchmark Suite," in GPGPU, 2010. Google ScholarDigital Library
- J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Changx, N. Anssari, G. D. Liu, and W. mei W. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," IMPACT Technical Report, IMPACT-12-01, University of Illinois, at Urbana-Champaign, 2012.Google Scholar
- O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs," in PACT, 2013. Google ScholarDigital Library
- M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y.-G. Cho, and S. Ryu, "Improving GPGPU Resource Utilization Through Alternative Thread Block Scheduling," in HPCA, 2014.Google Scholar
- W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control flow," in HPCA, 2011. Google ScholarDigital Library
- M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "Energy-Efficient Mechanisms for Managing Thread Context in Throughput Processors," in ISCA, 2011. Google ScholarDigital Library
- Y. Yu, W. Xiao, X. He, H. Guo, Y. Wang, and X. Chen, "A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs," in ICS, 2015. Google ScholarDigital Library
- W. Jia, K. A. Shaw, and M. Martonosi, "Characterizing and Improving the Use of Demand-fetched Caches in GPUs," in ICS, 2012. Google ScholarDigital Library
- X. Xie, Y. Liang, G. Sun, and D. Chen, "An Efficient Compiler Framework for Cache Bypassing on GPUs," in ICCAD, 2013. Google ScholarDigital Library
- N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum, "Improving Cache Management Policies Using Dynamic Reuse Distances," in MICRO, 2012. Google ScholarDigital Library
Index Terms
- OAWS: Memory Occlusion Aware Warp Scheduling
Recommendations
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
ASPLOS '13Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose ...
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
ASPLOS '13Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose ...
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsEmerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose ...
Comments