OAWS: Memory Occlusion Aware Warp Scheduling

Authors:
Bin Wang

Auburn University, Auburn, AL, USA

Auburn University, Auburn, AL, USA
View Profile

,
Yue Zhu

Florida State University, Tallahassee, FL, USA

Florida State University, Tallahassee, FL, USA
View Profile

,
Weikuan Yu

Florida State University, Tallahassee, FL, USA

Florida State University, Tallahassee, FL, USA
View Profile

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationSeptember 2016Pages 45–55https://doi.org/10.1145/2967938.2967947

Published:11 September 2016Publication History

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 45–55

ABSTRACT

We have closely examined GPU resource utilization when executing memory-intensive benchmarks. Our detailed analysis of GPU global memory accesses reveals that divergent loads can lead to the occlusion of Load-Store units, resulting in quick consumption of MSHR entries. Such memory occlusion prevents other ready memory instructions from accessing L1 data cache, eventually stalling warp schedulers and degrading the overall performance. We have designed memory Occlusion Aware Warp Scheduling (OAWS) that can dynamically predict the demand of MSHR entries of divergent memory instructions, and maximize the number of concurrent warps such that their aggregate MSHR consumptions are within the MSHR capacity. Our dynamic OAWS policy can prevent memory occlusions and effectively leverage more MSHR entries for better IPC performance for GPU. Experimental results show that the static and dynamic versions of OAWS achieve 36.7% and 73.1% performance improvement, compared to the baseline GTO scheduling. Particularly, dynamic OAWS outperforms MASCAR, CCWS, and SWL-Best by 70.1%, 57.8%, and 11.4%, respectively.

References

W. Jia, K. A. Shaw, and M. Martonosi, "MRPB: Memory Request Prioritization for Massively Parallel Processors," in HPCA, 2014.Google Scholar
X. Chen, S. Wu, L.-W. Chang, W.-S. Huang, C. Pearson, Z. Wang, and W. mei W. Hwu, "Adaptive Cache Bypass and Insertion for Many-core Accelerators," in MES, 2014. Google ScholarDigital Library
X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W. mei W. Hwu, "Adaptive Cache Management for Energy-Efficient GPU Computing," in MICRO, 2014. Google ScholarDigital Library
B. Wang, W. Yu, X.-H. Sun, and X. Wang, "DaCache: Memory Divergence-Aware GPU Cache Management," in ICS, 2015. Google ScholarDigital Library
C. Li, S. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, "Locality-Driven Dynamic GPU Cache Bypassing," in ICS, 2015. Google ScholarDigital Library
D. Li, Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processor. PhD thesis, University of Texas at Austin, May 2014.Google Scholar
Z. Zheng, Z. Wang, and M. Lipasti, "Adaptive Cache and Concurrency Allocation on GPGPUs," Computer Architecture Letters, 2014. Google ScholarDigital Library
M. Khairy, M. Zahran, and A. G. Wassal, "Efficient Utilization of GPGPU Cache Hierarchy," in GPGPU, 2015. Google ScholarDigital Library
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Wavefront Scheduling," in MICRO, 2012. Google ScholarDigital Library
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," in MICRO, 2013. Google ScholarDigital Library
D. Kroft, "Lockup-free Instruction Fetch/Prefetch Cache Organization," in ISCA, 1981. Google ScholarDigital Library
A. E. Turner, On replay and hazards in graphics processing units. PhD thesis, University of British Columbia, Oct 2012.Google Scholar
A. Sethia, D. A. Jamshidi, and S. A. Mahlke, "Mascar: Speeding up GPU Warps by Reducing Memory Pitstops," in HPCA, 2015.Google Scholar
NVIDIA, "NVIDIA's Next Generation CUDA Compute Architecture: Fermi," 2009.Google Scholar
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro, vol. 28, pp. 39--55, Mar. 2008. Google ScholarDigital Library
NVIDIA, "NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110," 2012.Google Scholar
B. Coon, P. Mills, S. Oberman, and M. Siu, "Tracking register usage during multithreaded processing using a scoreboard having separate memory regions and storing sequenti al register size indicators," Oct. 7 2008. US Patent 7,434,032.Google Scholar
P. Mills, J. Lindholm, B. Coon, G. Tarolli, and J. Burgess, "Scheduler in multi-threaded processor prioritizing instructions passing qualification rule," May 24 2011. US Patent 7,949,855.Google Scholar
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving GPU Performance via Large Warps and Two-level Warp Scheduling," in MICRO, 2011. Google ScholarDigital Library
A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in ASPLOS, 2013. Google ScholarDigital Library
N. Brunie, S. Collange, and G. F. Diamos, "Simultaneous Branch and Warp Interweaving for Sustained GPU Performance," in ISCA, 2012. Google ScholarDigital Library
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars: a MapReduce Framework on Graphics Processors," in PACT, 2008. Google ScholarDigital Library
B. Wang, Z. Liu, X. Wang, and W. Yu, "Eliminating intra-warp conflict misses in GPU," in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, DATE 2015, Grenoble, France, March 9--13, 2015, pp. 689--694, 2015. Google ScholarDigital Library
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in MICRO, 2007. Google ScholarDigital Library
C. Nugteren, G.-J. van den Braak, H. Corporaal, and H. Bal, "A Detailed GPU Cache Model Based on Reuse Distance Theory," in HPCA, 2014.Google Scholar
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in ISPASS, 2009.Google Scholar
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009. Google ScholarDigital Library
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a High-Level Language Targeted to GPU Codes.," in Innovative Parallel Computing, 2012.Google Scholar
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The Scalable Heterogeneous Computing (SHOC) Benchmark Suite," in GPGPU, 2010. Google ScholarDigital Library
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Changx, N. Anssari, G. D. Liu, and W. mei W. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," IMPACT Technical Report, IMPACT-12-01, University of Illinois, at Urbana-Champaign, 2012.Google Scholar
O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das, "Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs," in PACT, 2013. Google ScholarDigital Library
M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y.-G. Cho, and S. Ryu, "Improving GPGPU Resource Utilization Through Alternative Thread Block Scheduling," in HPCA, 2014.Google Scholar
W. W. L. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control flow," in HPCA, 2011. Google ScholarDigital Library
M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron, "Energy-Efficient Mechanisms for Managing Thread Context in Throughput Processors," in ISCA, 2011. Google ScholarDigital Library
Y. Yu, W. Xiao, X. He, H. Guo, Y. Wang, and X. Chen, "A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs," in ICS, 2015. Google ScholarDigital Library
W. Jia, K. A. Shaw, and M. Martonosi, "Characterizing and Improving the Use of Demand-fetched Caches in GPUs," in ICS, 2012. Google ScholarDigital Library
X. Xie, Y. Liang, G. Sun, and D. Chen, "An Efficient Compiler Framework for Cache Bypassing on GPUs," in ICCAD, 2013. Google ScholarDigital Library
N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Veidenbaum, "Improving Cache Management Policies Using Dynamic Reuse Distances," in MICRO, 2012. Google ScholarDigital Library

Index Terms

OAWS: Memory Occlusion Aware Warp Scheduling
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
ASPLOS '13

Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose ...
Read More
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
ASPLOS '13

Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose ...
Read More
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Emerging GPGPU architectures, along with programming models like CUDA and OpenCL, offer a cost-effective platform for many applications by providing high thread level parallelism at lower energy budgets. Unfortunately, for many general-purpose ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
September 2016
474 pages
ISBN:9781450341219
DOI:10.1145/2967938
General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 September 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
gpgpus
memory occlusion
mshr
scheduling
Qualifiers
- research-article
Conference

Acceptance Rates
PACT '16 Paper Acceptance Rate31of119submissions,26%Overall Acceptance Rate121of471submissions,26%
More
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 276
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

OAWS: Memory Occlusion Aware Warp Scheduling

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

ABSTRACT

References

Cited By

Index Terms

Recommendations

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance