Abstract
Graphics Processing Units (GPUs) with massive parallel architecture have been widely used to boost performance of both graphics and general-purpose programs. GPGPUs become one of the most attractive platforms in exploiting plentiful thread-level parallelism. In recent GPUs, cache hierarchies have been employed to deal with applications with irregular memory access patterns. Unfortunately, GPU caches exhibit poor efficiency due to arising many performance challenges such as cache contention and resource congestion caused by large number of active threads in GPUs. Cache bypassing can be a solution to reduce the impact of cache contention and resource congestion. In this paper, we introduce a new cache bypassing technique that is able to make effective bypassing decisions. In particular, the proposed mechanism employs a small memory, which can be accessed before actual cache access, to record the tag information of the L1 data cache. By using this information, the mechanism can know the status of the L1 data cache and use it as a bypassing hint to make the cache bypassing decision close to optimal. Our experimental results based on a modern GPU platform reveal that our proposed cache bypassing technique achieves up to 10.4% of IPC improvement on average.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ryoo, S., Rodrigues, C., Baghsorkhi, S., Stone, S., Kirk, D., Hwu, W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: The ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 73–82 (2008)
Son, D.O., Do, C.T., Choi, H.J., Nam, J., Kim, C.H.: A dynamic CTA scheduling scheme for massive parallel computing. Clust. Comput. 20(1), 781–787 (2017)
Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: The IEEE/ACM International Symposium on Microarchitecture, pp. 308–317 (2011)
Rogers, T.G., O’Connor, M., Aamodt, T.: Cache-conscious wavefront scheduling. In: The International Symposium on Microarchitecture, pp. 72–83 (2012)
Lee, S.-Y., Arunkumar, A., Wu, C.-J.: CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In: The International Symposium on Computer Architecture, pp. 515–527 (2015)
Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: The International Symposium on Computer Architecture, pp. 235–246 (2010)
Park, Y., Park, J.J.K., Park, H., Mahlke, S.: Libra: tailoring SIMD execution using heterogeneous hardware and dynamic configurability. In: The IEEE/ACM International Symposium on Microarchitecture, pp. 84–95 (2012)
Rhu, M., Erez, M.: Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In: The International Symposium on Computer Architecture, pp. 356–367 (2013)
Do, C.T., Choi, H.J., Kim, J.M., Kim, C.H.: A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines. Microprocess. Microsyst. 39(4–5), 286–295 (2015)
Jaleel, A., Theobald, K.B., Steely, S.C., Emer, J.: High performance cache replacement using re-reference interval prediction (RRIP). In: The International Symposium on Computer Architecture, pp. 60–71 (2010)
Qureshi, M.K., Jaleel, A., Patt, Y.N., Steely, S.C., Emer, J.: Adaptive insertion policies for high performance caching. In: The International Symposium on Computer Architecture, pp. 381–391(2007)
Jia, W., Shaw, K., Martonosi, M.: MRPB: memory request prioritization for massively parallel processors. In: The IEEE International Symposium on High Performance Computer Architecture, pp. 272–283 (2014)
Chen, X., Chang, L.-W. Rodrigues, C.I., Lv, J., Wang, Z., Hwu, W.-M.W.: Adaptive cache bypass and insertion for many-core accelerators. In: The International Workshop on Manycore Embedded Systems, p. 1 (2014)
Duong, N., Zhao, D., Kim, T., Cammarota, R., Valero, M., Veidenbaum, A.V.: Improving cache management policies using dynamic reuse distances. In: The IEEE/ACM International Symposium on Microarchitecture, pp. 389–400 (2012)
Do, C.T., Kim, J.M., Kim, C.H.: Early miss prediction based periodic cache bypassing for high performance GPUs. Microprocess. Microsyst. 55, 44–54 (2017)
Xie, X., Liang, Y., Wang, Y., Sun, G., Wang, T.: Coordinated static and dynamic cache bypassing for GPUs. In: The IEEE International Symposium on High Performance Computer Architecture, pp. 76–88 (2015)
Xie, X., Liang, Y., Sun, G., Chen, D.: An efficient compiler framework for cache bypassing on GPUs. In: The International Conference on Computer-Aided Design, pp. 516–523 (2013)
Krewell, K.: AMD’s Fusion Finally Arrives, Microprocessor Report (2011)
Krewell, K.: NVIDIA Lowers the Heat on Kepler, Microprocessor Report (2012)
Kirk, D., Hwu, W.: Programming Massively Parallel Processors. Elsevier, London (2010)
Abdalla, K.M., et al.: Scheduling and Execution of Compute Task. US Patent US20130185725 (2013)
NVIDIA. NVIDIA Tegra Multiprocessor Architecture (2010)
Bakhola, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing CUDA workloads using a detailed GPU simulator. In: The International Symposium on Analysis of Systems and Software, pp. 163–174 (2009)
Delano, E., Mulla, D.: Data cache design considerations for the Itanium2 processor. In the International Conference on Computer Design, pp. 356–362 (2002)
Brock, B., Exerman, M.: Cache Latencies of the PowerPC MPC7451. Freescale Semiconductor, Inc., Austin, TX, USA (2006)
NVIDA. CUDA SDK. http://developer.nvidia.com/gpu-computing-sdk
Acknowledgements
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1A2B6005740), and it was also supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2018-2016-0-00314) supervised by the IITP (Institute for Information & communications Technology Promotion).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Do, C.T., Moon, M.G., Kim, J.M., Kim, C.H. (2019). A Study on L1 Data Cache Bypassing Methods for High-Performance GPUs. In: Park, J., Shen, H., Sung, Y., Tian, H. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2018. Communications in Computer and Information Science, vol 931. Springer, Singapore. https://doi.org/10.1007/978-981-13-5907-1_22
Download citation
DOI: https://doi.org/10.1007/978-981-13-5907-1_22
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-5906-4
Online ISBN: 978-981-13-5907-1
eBook Packages: Computer ScienceComputer Science (R0)