skip to main content
10.1145/2967938.2967947acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Public Access

OAWS: Memory Occlusion Aware Warp Scheduling

Published:11 September 2016Publication History

ABSTRACT

We have closely examined GPU resource utilization when executing memory-intensive benchmarks. Our detailed analysis of GPU global memory accesses reveals that divergent loads can lead to the occlusion of Load-Store units, resulting in quick consumption of MSHR entries. Such memory occlusion prevents other ready memory instructions from accessing L1 data cache, eventually stalling warp schedulers and degrading the overall performance. We have designed memory Occlusion Aware Warp Scheduling (OAWS) that can dynamically predict the demand of MSHR entries of divergent memory instructions, and maximize the number of concurrent warps such that their aggregate MSHR consumptions are within the MSHR capacity. Our dynamic OAWS policy can prevent memory occlusions and effectively leverage more MSHR entries for better IPC performance for GPU. Experimental results show that the static and dynamic versions of OAWS achieve 36.7% and 73.1% performance improvement, compared to the baseline GTO scheduling. Particularly, dynamic OAWS outperforms MASCAR, CCWS, and SWL-Best by 70.1%, 57.8%, and 11.4%, respectively.

References

  1. W. Jia, K. A. Shaw, and M. Martonosi, "MRPB: Memory Request Prioritization for Massively Parallel Processors," in HPCA, 2014.Google ScholarGoogle Scholar
  2. X. Chen, S. Wu, L.-W. Chang, W.-S. Huang, C. Pearson, Z. Wang, and W. mei W. Hwu, "Adaptive Cache Bypass and Insertion for Many-core Accelerators," in MES, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W. mei W. Hwu, "Adaptive Cache Management for Energy-Efficient GPU Computing," in MICRO, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Wang, W. Yu, X.-H. Sun, and X. Wang, "DaCache: Memory Divergence-Aware GPU Cache Management," in ICS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Li, S. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, "Locality-Driven Dynamic GPU Cache Bypassing," in ICS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Li, Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processor. PhD thesis, University of Texas at Austin, May 2014.Google ScholarGoogle Scholar
  7. Z. Zheng, Z. Wang, and M. Lipasti, "Adaptive Cache and Concurrency Allocation on GPGPUs," Computer Architecture Letters, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Khairy, M. Zahran, and A. G. Wassal, "Efficient Utilization of GPGPU Cache Hierarchy," in GPGPU, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," in MICRO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," in MICRO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Kroft, "Lockup-free Instruction Fetch/Prefetch Cache Organization," in ISCA, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. E. Turner, On replay and hazards in graphics processing units. PhD thesis, University of British Columbia, Oct 2012.Google ScholarGoogle Scholar
  13. A. Sethia, D. A. Jamshidi, and S. A. Mahlke, "Mascar: Speeding up GPU Warps by Reducing Memory Pitstops," in HPCA, 2015.Google ScholarGoogle Scholar
  14. NVIDIA, "NVIDIA's Next Generation CUDA Compute Architecture: Fermi," 2009.Google ScholarGoogle Scholar
  15. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro, vol. 28, pp. 39--55, Mar. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. NVIDIA, "NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110," 2012.Google ScholarGoogle Scholar
  17. B. Coon, P. Mills, S. Oberman, and M. Siu, "Tracking register usage during multithreaded processing using a scoreboard having separate memory regions and storing sequenti al register size indicators," Oct. 7 2008. US Patent 7,434,032.Google ScholarGoogle Scholar
  18. P. Mills, J. Lindholm, B. Coon, G. Tarolli, and J. Burgess, "Scheduler in multi-threaded processor prioritizing instructions passing qualification rule," May 24 2011. US Patent 7,949,855.Google ScholarGoogle Scholar
  19. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-level Warp Scheduling," in MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in ASPLOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. Brunie, S. Collange, and G. F. Diamos, "Simultaneous Branch and Warp Interweaving for Sustained GPU Performance," in ISCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars: a MapReduce Framework on Graphics Processors," in PACT, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Wang, Z. Liu, X. Wang, and W. Yu, "Eliminating intra-warp conflict misses in GPU," in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, DATE 2015, Grenoble, France, March 9--13, 2015, pp. 689--694, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Nugteren, G.-J. van den Braak, H. Corporaal, and H. Bal, "A Detailed GPU Cache Model Based on Reuse Distance Theory," in HPCA, 2014.Google ScholarGoogle Scholar
  26. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in ISPASS, 2009.Google ScholarGoogle Scholar
  27. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a High-Level Language Targeted to GPU Codes.," in Innovative Parallel Computing, 2012.Google ScholarGoogle Scholar
  29. A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The Scalable Heterogeneous Computing (SHOC) Benchmark Suite," in GPGPU, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Changx, N. Anssari, G. D. Liu, and W. mei W. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," IMPACT Technical Report, IMPACT-12-01, University of Illinois, at Urbana-Champaign, 2012.Google ScholarGoogle Scholar
  31. O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs," in PACT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y.-G. Cho, and S. Ryu, "Improving GPGPU Resource Utilization Through Alternative Thread Block Scheduling," in HPCA, 2014.Google ScholarGoogle Scholar
  33. W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control flow," in HPCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "Energy-Efficient Mechanisms for Managing Thread Context in Throughput Processors," in ISCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. Yu, W. Xiao, X. He, H. Guo, Y. Wang, and X. Chen, "A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs," in ICS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. W. Jia, K. A. Shaw, and M. Martonosi, "Characterizing and Improving the Use of Demand-fetched Caches in GPUs," in ICS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. X. Xie, Y. Liang, G. Sun, and D. Chen, "An Efficient Compiler Framework for Cache Bypassing on GPUs," in ICCAD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum, "Improving Cache Management Policies Using Dynamic Reuse Distances," in MICRO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. OAWS: Memory Occlusion Aware Warp Scheduling

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
      September 2016
      474 pages
      ISBN:9781450341219
      DOI:10.1145/2967938

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 September 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      PACT '16 Paper Acceptance Rate31of119submissions,26%Overall Acceptance Rate121of471submissions,26%

      Upcoming Conference

      PACT '24
      International Conference on Parallel Architectures and Compilation Techniques
      October 14 - 16, 2024
      Southern California , CA , USA

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader