A Profiling-Based Approach to Cache Partitioning of Program Data

Breiter, Sergej; Weidendorfer, Josef; Chung, Minh Thanh; Fürlinger, Karl

doi:10.1007/978-3-031-29927-8_35

Sergej Breiter¹³,
Josef Weidendorfer¹⁴,
Minh Thanh Chung¹³ &
…
Karl Fürlinger¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13798))

Included in the following conference series:

International Conference on Parallel and Distributed Computing: Applications and Technologies

497 Accesses
1 Citations

Abstract

Cache efficiency is important to avoid unnecessary data transfers and to keep processors active. Cache partitioning, a technique to virtually divide a cache into multiple partitions, has become available in recent hardware. Cache partitioning can improve efficiency by isolating data with high temporal locality to avoid its early eviction before reuse. However, deciding on the partitioning is challenging, because it depends on the locality of reference. To facilitate the decision-making, we propose a profiling-based approach that measures locality, providing knowledge for cache partitioning without requiring manual code analysis. We present a profiling tool and confirm its benefits through experiments on Fujitsu’s A64FX processor, which supports the cache partitioning mechanism called sector cache. Our results show ways to optimize program codes to improve cache efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alappat, C., et al.: Execution-Cache-Memory modeling and performance tuning of sparse matrix-vector multiplication and Lattice quantum chromodynamics on A64FX. Concurr. Comput.: Pract. Experience 34(20), e6512 (2022). https://doi.org/10.1002/cpe.6512
Article Google Scholar
Bailey, D.H., et al.: The NAS parallel benchmarks-summary and preliminary results. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, pp. 158–165. ACM (1991). https://doi.org/10.1145/125826.125925
Belady, L.A.: A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 5(2), 78–101 (1966). https://doi.org/10.1147/sj.52.0078
Article Google Scholar
Beyls, K., D’Hollander, E.: Reuse distance as a metric for cache behavior. In: Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems, pp. 617–622 (2001)
Google Scholar
El-Sayed, N., et al.: KPart: a hybrid cache partitioning-sharing technique for commodity multicores. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 104–117 (2018). https://doi.org/10.1109/HPCA.2018.00019
Fujitsu Limited: A64FX Microarchitecture Manual, version 1.5 edn. (2021). https://github.com/fujitsu/A64FX/blob/master/doc/
Intel Corporation: Improving real-time performance by utilizing cache allocation technology. Intel Corporation (2015)
Google Scholar
Jiang, Y., Zhang, E.Z., Tian, K., Shen, X.: Is reuse distance applicable to data locality analysis on chip multiprocessors? In: Gupta, R. (ed.) CC 2010. LNCS, vol. 6011, pp. 264–282. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11970-5_15
Chapter Google Scholar
Kim, Y.H., et al.: Implementing stack simulation for highly-associative memories. SIGMETRICS Perform. Eval. Rev. 19(1), 212–213 (1991). https://doi.org/10.1145/107972.107995
Article Google Scholar
Kumar, S., Singh, P.K.: An overview of modern cache memory and performance analysis of replacement policies. In: 2016 IEEE International Conference on Engineering and Technology, pp. 210–214 (2016). https://doi.org/10.1109/ICETECH.2016.7569243
Lu, Q., Lin, J., et al.: Soft-OLP: improving hardware cache performance through software-controlled object-level partitioning. In: 2009 18th International Conference on Parallel Architectures and Compilation Techniques, pp. 246–257 (2009). https://doi.org/10.1109/PACT.2009.35
Löff, J., et al.: The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures. Future Gener. Comput. Syst. 125(C), 743–757 (2021). https://doi.org/10.1016/j.future.2021.07.021
Mellor-Crummey, J.M., Scott, M.L.: Synchronization without contention. SIGPLAN Not. 26(4), 269–278 (1991). https://doi.org/10.1145/106973.106999
Article Google Scholar
Mittal, S.: A survey of techniques for cache partitioning in multicore processors. ACM Comput. Surv. 50(2) (2017). https://doi.org/10.1145/3062394
Mucci, P.J., Browne, S., et al.: PAPI: a portable interface to hardware performance counters. In: Proceedings of the Department of Defense HPCMP Users Group Conference, vol. 710. Citeseer (1999)
Google Scholar
Perarnau, S., Sato, M.: Toward automated cache partitioning for the K computer. IPSJ SIG-HPC (2012)
Google Scholar
Sabarimuthu, J.M., Venkatesh, T.: Analytical miss rate calculation of L2 cache from the RD profile of L1 cache. IEEE Trans. Comput. 67(1), 9–15 (2017). https://doi.org/10.1109/TC.2017.2723878
Article MathSciNet MATH Google Scholar
Sasongko, M.A., Chabbi, M., et al.: ReuseTracker: fast yet accurate multicore reuse distance analyzer. ACM Trans. Archit. Code Optim. 19(1) (2021). https://doi.org/10.1145/3484199
Schuff, D.L., Kulkarni, M., Pai, V.S.: Accelerating multicore reuse distance analysis with sampling and parallelization. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pp. 53–64 (2010). https://doi.org/10.1145/1854273.1854286
Schuff, D.L., Parsons, B.S., Pai, V.S.: Multicore-aware reuse distance analysis. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–8 (2010). https://doi.org/10.1109/IPDPSW.2010.5470780
Wang, Q., Liu, X., Chabbi, M.: Featherlight reuse-distance measurement. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 440–453. IEEE (2019). https://doi.org/10.1109/HPCA.2019.00056
Wu, M.J., Yeung, D.: Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis. In: Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, pp. 2–11 (2012). https://doi.org/10.1145/2247684.2247687
Yoshida, T., Hondo, M., Kan, R., Sugizaki, G.: SPARC64 VIIIfx: CPU for the K computer. Fujitsu Sci. Tech. J 48(3), 274–279 (2012)
Google Scholar
Zhong, Y., Dropsho, S.G., et al.: Miss rate prediction across program inputs and cache configurations. IEEE Trans. Comput. 56(3), 328–343 (2007). https://doi.org/10.1109/TC.2007.50
Article MathSciNet Google Scholar

Download references

Acknowledgement

This work has received funding from the European High-Performance Joint Undertaking under grant agreement no.956213 (SparCity), and the Federal Ministry of Education and Research of Germany (project number 16HPC045). Performance results have been obtained on systems in the test environment BEAST (Bavarian Energy Architecture & Software Testbed) (https://www.lrz.de/presse/ereignisse/2020-11-06_BEAST/) at the Leibniz Supercomputing Centre.

Author information

Authors and Affiliations

MNM Team, Ludwig-Maximilians-Universität München, Munich, Germany
Sergej Breiter, Minh Thanh Chung & Karl Fürlinger
Leibniz Supercomputing Centre (LRZ), Garching, Germany
Josef Weidendorfer

Authors

Sergej Breiter
View author publications
You can also search for this author in PubMed Google Scholar
Josef Weidendorfer
View author publications
You can also search for this author in PubMed Google Scholar
Minh Thanh Chung
View author publications
You can also search for this author in PubMed Google Scholar
Karl Fürlinger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergej Breiter .

Editor information

Editors and Affiliations

Tohoku University, Aoba-ku, Japan
Hiroyuki Takizawa
Sun Yat-sen University, Guangzhou, China
Hong Shen
The University of Tokyo, Tokyo, Japan
Toshihiro Hanawa
Seoul National University of Science and Technology, Seoul, Korea (Republic of)
Jong Hyuk Park
Griffith University, Queensland, QLD, Australia
Hui Tian
Tokyo Denki University, Tokyo, Japan
Ryusuke Egawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Breiter, S., Weidendorfer, J., Chung, M.T., Fürlinger, K. (2023). A Profiling-Based Approach to Cache Partitioning of Program Data. In: Takizawa, H., Shen, H., Hanawa, T., Hyuk Park, J., Tian, H., Egawa, R. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2022. Lecture Notes in Computer Science, vol 13798. Springer, Cham. https://doi.org/10.1007/978-3-031-29927-8_35

Download citation

DOI: https://doi.org/10.1007/978-3-031-29927-8_35
Published: 08 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-29926-1
Online ISBN: 978-3-031-29927-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Profiling-Based Approach to Cache Partitioning of Program Data