Abstract
Warp scheduling policy for GPUs has significant impact on performance since the order of executed warps determines the degree of data cache locality. Greedy warp scheduling policy such as GTO shows better performance than fair scheduling policy for numerous applications. However, cache locality by multiple warps is underutilized when the GTO is adopted, resulting in overall performance degradation. In this paper, we propose a dynamic selective warp scheduling exploiting data locality of workload. Inter-warp locality and intra-warp locality are determined based on the access history information of the L1 data cache. By adjusting scheduling policy dynamically, the performance and cache efficiency are improved compared LRR and GTO significantly. According to our experimental results, the proposed technique provides IPC improvement by 19% and 3.8% over LRR and GTO, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 308–317. IEEE (2011)
Zhang, Y., Xing, Z., Liu, C., Tang, C., Wang, Q.: Locality based warp scheduling in GPGPUs, Futur. Gener. Comput. Syst. (2017)
Wang, B., Zhu, Y., Yu, W.: OAWS: memory occlusion aware warp scheduling. In: International Conference on Parallel Architecture and Compilation Techniques, pp. 45–55. IEEE (2016)
Wang, J., Rubin, N., Sidelnik, A., Yalamanchili, S.: LaPerm: locality aware scheduler for dynamic parallelism on GPUs. ACM SIGARCH Comput. Arch. News 44(3), 583–595 (2016)
Zhang, W.: Enhancing data cache reliability by the addition of a small fully-associative replication cache. In: Proceedings of the 18th Annual International Conference on Supercomputing, pp. 12–19 (2004)
Sato, M., Egawa, R., Takizawa, H., Kobayashi, H.: A voting-based working set assessment scheme for dynamic cache resizing mechanisms. In: IEEE International Conference on Computer Design (ICCD), pp. 98–105. IEEE (2010)
Lee, M., Kim, G., Kim, J., Seo, W., Cho, Y., Ryu, S.: iPAWS: instruction-issue pattern-based adaptive warp scheduling for GPGPUs. In: IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 370–381. IEEE (2016)
Oh, Y., Kim, K., Yoon, M.K., Park, J.H., Ro, W.W., Annavaram, M.: APRES: improving cache efficiency by exploiting load characteristics on GPUs. ACM SIGARCH Comput. Arch. News 44(3), 191–203 (2016)
Aamodt, T.M., Fung, W.W.L.: GPGPU-Sim 3.x Manual (2014). http://gpgpu-sim.org/manual/index.php/GPGPU-Sim 3.x Manual
Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing, workload characterization. In: IEEE International Symposium on IISWC 2009, pp. 44–54 (2009)
Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: Performance Analysis of Systems and Software, pp. 163–174 (2009)
Grauer-Gray, S., Xu, L., Searles, R., Ayalasomayajula, S., Cavazos, J.: Auto-tuning a high-level language targeted to GPU Codes. In: Innovative Parallel Computing, pp. 1–10 (2012)
NVIDIA, NVIDIA CUDA C programming guide v4.2, April 2012. http://developer.nvidia.com/nvidia-gpu-computing-documentation
Nugteren, C., van den Braak, G.-J., Corporaal, H., Bal, H.: A detailed GPU cache model based on reuse distance theory. In: High Performance Computer Architecture, pp. 37–48 (2014)
Wong, H., Papadopoulou, M.-M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU microarchitecture through microbenchmarking. In: Performance Analysis of Systems & Software, pp. 235–246 (2010)
Rogers, T.G., O’Connor, M., Aamodt, T.M.: Cache-conscious wavefront scheduling. In: Proceedings of the IEEE/ACM International Symposium on Microarchitecture, pp. 72–83 (2012)
Acknowledgements
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF2018R1A2B6005740).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kim, G.B., Kim, J.M., Kim, C.H. (2019). Dynamic Selective Warp Scheduling for GPUs Using L1 Data Cache Locality Information. In: Park, J., Shen, H., Sung, Y., Tian, H. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2018. Communications in Computer and Information Science, vol 931. Springer, Singapore. https://doi.org/10.1007/978-981-13-5907-1_24
Download citation
DOI: https://doi.org/10.1007/978-981-13-5907-1_24
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-5906-4
Online ISBN: 978-981-13-5907-1
eBook Packages: Computer ScienceComputer Science (R0)