A Study on L1 Data Cache Bypassing Methods for High-Performance GPUs

Do, Cong Thuan; Moon, Min Goo; Kim, Jong Myon; Kim, Cheol Hong

doi:10.1007/978-981-13-5907-1_22

Cong Thuan Do¹²,
Min Goo Moon¹³,
Jong Myon Kim¹⁴ &
…
Cheol Hong Kim¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 931))

Included in the following conference series:

International Conference on Parallel and Distributed Computing: Applications and Technologies

927 Accesses

Abstract

Graphics Processing Units (GPUs) with massive parallel architecture have been widely used to boost performance of both graphics and general-purpose programs. GPGPUs become one of the most attractive platforms in exploiting plentiful thread-level parallelism. In recent GPUs, cache hierarchies have been employed to deal with applications with irregular memory access patterns. Unfortunately, GPU caches exhibit poor efficiency due to arising many performance challenges such as cache contention and resource congestion caused by large number of active threads in GPUs. Cache bypassing can be a solution to reduce the impact of cache contention and resource congestion. In this paper, we introduce a new cache bypassing technique that is able to make effective bypassing decisions. In particular, the proposed mechanism employs a small memory, which can be accessed before actual cache access, to record the tag information of the L1 data cache. By using this information, the mechanism can know the status of the L1 data cache and use it as a bypassing hint to make the cache bypassing decision close to optimal. Our experimental results based on a modern GPU platform reveal that our proposed cache bypassing technique achieves up to 10.4% of IPC improvement on average.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache

A Quantitative Study of Locality in GPU Caches

Cache Reuse Aware Replacement Policy for Improving GPU Cache Performance

References

Ryoo, S., Rodrigues, C., Baghsorkhi, S., Stone, S., Kirk, D., Hwu, W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: The ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 73–82 (2008)
Google Scholar
Son, D.O., Do, C.T., Choi, H.J., Nam, J., Kim, C.H.: A dynamic CTA scheduling scheme for massive parallel computing. Clust. Comput. 20(1), 781–787 (2017)
Article Google Scholar
Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: The IEEE/ACM International Symposium on Microarchitecture, pp. 308–317 (2011)
Google Scholar
Rogers, T.G., O’Connor, M., Aamodt, T.: Cache-conscious wavefront scheduling. In: The International Symposium on Microarchitecture, pp. 72–83 (2012)
Google Scholar
Lee, S.-Y., Arunkumar, A., Wu, C.-J.: CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In: The International Symposium on Computer Architecture, pp. 515–527 (2015)
Article Google Scholar
Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: The International Symposium on Computer Architecture, pp. 235–246 (2010)
Google Scholar
Park, Y., Park, J.J.K., Park, H., Mahlke, S.: Libra: tailoring SIMD execution using heterogeneous hardware and dynamic configurability. In: The IEEE/ACM International Symposium on Microarchitecture, pp. 84–95 (2012)
Google Scholar
Rhu, M., Erez, M.: Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In: The International Symposium on Computer Architecture, pp. 356–367 (2013)
Google Scholar
Do, C.T., Choi, H.J., Kim, J.M., Kim, C.H.: A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines. Microprocess. Microsyst. 39(4–5), 286–295 (2015)
Article Google Scholar
Jaleel, A., Theobald, K.B., Steely, S.C., Emer, J.: High performance cache replacement using re-reference interval prediction (RRIP). In: The International Symposium on Computer Architecture, pp. 60–71 (2010)
Google Scholar
Qureshi, M.K., Jaleel, A., Patt, Y.N., Steely, S.C., Emer, J.: Adaptive insertion policies for high performance caching. In: The International Symposium on Computer Architecture, pp. 381–391(2007)
Google Scholar
Jia, W., Shaw, K., Martonosi, M.: MRPB: memory request prioritization for massively parallel processors. In: The IEEE International Symposium on High Performance Computer Architecture, pp. 272–283 (2014)
Google Scholar
Chen, X., Chang, L.-W. Rodrigues, C.I., Lv, J., Wang, Z., Hwu, W.-M.W.: Adaptive cache bypass and insertion for many-core accelerators. In: The International Workshop on Manycore Embedded Systems, p. 1 (2014)
Google Scholar
Duong, N., Zhao, D., Kim, T., Cammarota, R., Valero, M., Veidenbaum, A.V.: Improving cache management policies using dynamic reuse distances. In: The IEEE/ACM International Symposium on Microarchitecture, pp. 389–400 (2012)
Google Scholar
Do, C.T., Kim, J.M., Kim, C.H.: Early miss prediction based periodic cache bypassing for high performance GPUs. Microprocess. Microsyst. 55, 44–54 (2017)
Article Google Scholar
Xie, X., Liang, Y., Wang, Y., Sun, G., Wang, T.: Coordinated static and dynamic cache bypassing for GPUs. In: The IEEE International Symposium on High Performance Computer Architecture, pp. 76–88 (2015)
Google Scholar
Xie, X., Liang, Y., Sun, G., Chen, D.: An efficient compiler framework for cache bypassing on GPUs. In: The International Conference on Computer-Aided Design, pp. 516–523 (2013)
Google Scholar
Krewell, K.: AMD’s Fusion Finally Arrives, Microprocessor Report (2011)
Google Scholar
Krewell, K.: NVIDIA Lowers the Heat on Kepler, Microprocessor Report (2012)
Google Scholar
Kirk, D., Hwu, W.: Programming Massively Parallel Processors. Elsevier, London (2010)
Google Scholar
Abdalla, K.M., et al.: Scheduling and Execution of Compute Task. US Patent US20130185725 (2013)
Google Scholar
NVIDIA. NVIDIA Tegra Multiprocessor Architecture (2010)
Google Scholar
Bakhola, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing CUDA workloads using a detailed GPU simulator. In: The International Symposium on Analysis of Systems and Software, pp. 163–174 (2009)
Google Scholar
Delano, E., Mulla, D.: Data cache design considerations for the Itanium2 processor. In the International Conference on Computer Design, pp. 356–362 (2002)
Google Scholar
Brock, B., Exerman, M.: Cache Latencies of the PowerPC MPC7451. Freescale Semiconductor, Inc., Austin, TX, USA (2006)
Google Scholar
NVIDA. CUDA SDK. http://developer.nvidia.com/gpu-computing-sdk

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1A2B6005740), and it was also supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2018-2016-0-00314) supervised by the IITP (Institute for Information & communications Technology Promotion).

Author information

Authors and Affiliations

Department of Computer Science, Korea University, Seoul, Korea
Cong Thuan Do
School of Electronics and Computer Engineering, Chonnam National University, Gwangju, Korea
Min Goo Moon & Cheol Hong Kim
School of Electrical Engineering, University of Ulsan, Ulsan, Korea
Jong Myon Kim

Authors

Cong Thuan Do
View author publications
You can also search for this author in PubMed Google Scholar
Min Goo Moon
View author publications
You can also search for this author in PubMed Google Scholar
Jong Myon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Cheol Hong Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheol Hong Kim .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Seoul National University of Science and Technology, Seoul, Korea (Republic of)
Jong Hyuk Park
School of Computer Science, University of Adelaide, Adelaide, SA, Australia
Hong Shen
Department of Multimedia Engineering, Dongguk University, Seoul, Korea (Republic of)
Yunsick Sung
School of ICT, Griffith University, Gold Coast, Australia
Hui Tian

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Do, C.T., Moon, M.G., Kim, J.M., Kim, C.H. (2019). A Study on L1 Data Cache Bypassing Methods for High-Performance GPUs. In: Park, J., Shen, H., Sung, Y., Tian, H. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2018. Communications in Computer and Information Science, vol 931. Springer, Singapore. https://doi.org/10.1007/978-981-13-5907-1_22

Download citation

DOI: https://doi.org/10.1007/978-981-13-5907-1_22
Published: 08 February 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-5906-4
Online ISBN: 978-981-13-5907-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Study on L1 Data Cache Bypassing Methods for High-Performance GPUs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache

A Quantitative Study of Locality in GPU Caches

Cache Reuse Aware Replacement Policy for Improving GPU Cache Performance

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Study on L1 Data Cache Bypassing Methods for High-Performance GPUs

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache

A Quantitative Study of Locality in GPU Caches

Cache Reuse Aware Replacement Policy for Improving GPU Cache Performance

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation