Skip to main content

A Study on L1 Data Cache Bypassing Methods for High-Performance GPUs

  • Conference paper
  • First Online:
  • 870 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 931))

Abstract

Graphics Processing Units (GPUs) with massive parallel architecture have been widely used to boost performance of both graphics and general-purpose programs. GPGPUs become one of the most attractive platforms in exploiting plentiful thread-level parallelism. In recent GPUs, cache hierarchies have been employed to deal with applications with irregular memory access patterns. Unfortunately, GPU caches exhibit poor efficiency due to arising many performance challenges such as cache contention and resource congestion caused by large number of active threads in GPUs. Cache bypassing can be a solution to reduce the impact of cache contention and resource congestion. In this paper, we introduce a new cache bypassing technique that is able to make effective bypassing decisions. In particular, the proposed mechanism employs a small memory, which can be accessed before actual cache access, to record the tag information of the L1 data cache. By using this information, the mechanism can know the status of the L1 data cache and use it as a bypassing hint to make the cache bypassing decision close to optimal. Our experimental results based on a modern GPU platform reveal that our proposed cache bypassing technique achieves up to 10.4% of IPC improvement on average.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Ryoo, S., Rodrigues, C., Baghsorkhi, S., Stone, S., Kirk, D., Hwu, W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: The ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 73–82 (2008)

    Google Scholar 

  2. Son, D.O., Do, C.T., Choi, H.J., Nam, J., Kim, C.H.: A dynamic CTA scheduling scheme for massive parallel computing. Clust. Comput. 20(1), 781–787 (2017)

    Article  Google Scholar 

  3. Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: The IEEE/ACM International Symposium on Microarchitecture, pp. 308–317 (2011)

    Google Scholar 

  4. Rogers, T.G., O’Connor, M., Aamodt, T.: Cache-conscious wavefront scheduling. In: The International Symposium on Microarchitecture, pp. 72–83 (2012)

    Google Scholar 

  5. Lee, S.-Y., Arunkumar, A., Wu, C.-J.: CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In: The International Symposium on Computer Architecture, pp. 515–527 (2015)

    Article  Google Scholar 

  6. Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: The International Symposium on Computer Architecture, pp. 235–246 (2010)

    Google Scholar 

  7. Park, Y., Park, J.J.K., Park, H., Mahlke, S.: Libra: tailoring SIMD execution using heterogeneous hardware and dynamic configurability. In: The IEEE/ACM International Symposium on Microarchitecture, pp. 84–95 (2012)

    Google Scholar 

  8. Rhu, M., Erez, M.: Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In: The International Symposium on Computer Architecture, pp. 356–367 (2013)

    Google Scholar 

  9. Do, C.T., Choi, H.J., Kim, J.M., Kim, C.H.: A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines. Microprocess. Microsyst. 39(4–5), 286–295 (2015)

    Article  Google Scholar 

  10. Jaleel, A., Theobald, K.B., Steely, S.C., Emer, J.: High performance cache replacement using re-reference interval prediction (RRIP). In: The International Symposium on Computer Architecture, pp. 60–71 (2010)

    Google Scholar 

  11. Qureshi, M.K., Jaleel, A., Patt, Y.N., Steely, S.C., Emer, J.: Adaptive insertion policies for high performance caching. In: The International Symposium on Computer Architecture, pp. 381–391(2007)

    Google Scholar 

  12. Jia, W., Shaw, K., Martonosi, M.: MRPB: memory request prioritization for massively parallel processors. In: The IEEE International Symposium on High Performance Computer Architecture, pp. 272–283 (2014)

    Google Scholar 

  13. Chen, X., Chang, L.-W. Rodrigues, C.I., Lv, J., Wang, Z., Hwu, W.-M.W.: Adaptive cache bypass and insertion for many-core accelerators. In: The International Workshop on Manycore Embedded Systems, p. 1 (2014)

    Google Scholar 

  14. Duong, N., Zhao, D., Kim, T., Cammarota, R., Valero, M., Veidenbaum, A.V.: Improving cache management policies using dynamic reuse distances. In: The IEEE/ACM International Symposium on Microarchitecture, pp. 389–400 (2012)

    Google Scholar 

  15. Do, C.T., Kim, J.M., Kim, C.H.: Early miss prediction based periodic cache bypassing for high performance GPUs. Microprocess. Microsyst. 55, 44–54 (2017)

    Article  Google Scholar 

  16. Xie, X., Liang, Y., Wang, Y., Sun, G., Wang, T.: Coordinated static and dynamic cache bypassing for GPUs. In: The IEEE International Symposium on High Performance Computer Architecture, pp. 76–88 (2015)

    Google Scholar 

  17. Xie, X., Liang, Y., Sun, G., Chen, D.: An efficient compiler framework for cache bypassing on GPUs. In: The International Conference on Computer-Aided Design, pp. 516–523 (2013)

    Google Scholar 

  18. Krewell, K.: AMD’s Fusion Finally Arrives, Microprocessor Report (2011)

    Google Scholar 

  19. Krewell, K.: NVIDIA Lowers the Heat on Kepler, Microprocessor Report (2012)

    Google Scholar 

  20. Kirk, D., Hwu, W.: Programming Massively Parallel Processors. Elsevier, London (2010)

    Google Scholar 

  21. Abdalla, K.M., et al.: Scheduling and Execution of Compute Task. US Patent US20130185725 (2013)

    Google Scholar 

  22. NVIDIA. NVIDIA Tegra Multiprocessor Architecture (2010)

    Google Scholar 

  23. Bakhola, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing CUDA workloads using a detailed GPU simulator. In: The International Symposium on Analysis of Systems and Software, pp. 163–174 (2009)

    Google Scholar 

  24. Delano, E., Mulla, D.: Data cache design considerations for the Itanium2 processor. In the International Conference on Computer Design, pp. 356–362 (2002)

    Google Scholar 

  25. Brock, B., Exerman, M.: Cache Latencies of the PowerPC MPC7451. Freescale Semiconductor, Inc., Austin, TX, USA (2006)

    Google Scholar 

  26. NVIDA. CUDA SDK. http://developer.nvidia.com/gpu-computing-sdk

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1A2B6005740), and it was also supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2018-2016-0-00314) supervised by the IITP (Institute for Information & communications Technology Promotion).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheol Hong Kim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Do, C.T., Moon, M.G., Kim, J.M., Kim, C.H. (2019). A Study on L1 Data Cache Bypassing Methods for High-Performance GPUs. In: Park, J., Shen, H., Sung, Y., Tian, H. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2018. Communications in Computer and Information Science, vol 931. Springer, Singapore. https://doi.org/10.1007/978-981-13-5907-1_22

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-5907-1_22

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-5906-4

  • Online ISBN: 978-981-13-5907-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics