Abstract
Cache efficiency is important to avoid unnecessary data transfers and to keep processors active. Cache partitioning, a technique to virtually divide a cache into multiple partitions, has become available in recent hardware. Cache partitioning can improve efficiency by isolating data with high temporal locality to avoid its early eviction before reuse. However, deciding on the partitioning is challenging, because it depends on the locality of reference. To facilitate the decision-making, we propose a profiling-based approach that measures locality, providing knowledge for cache partitioning without requiring manual code analysis. We present a profiling tool and confirm its benefits through experiments on Fujitsu’s A64FX processor, which supports the cache partitioning mechanism called sector cache. Our results show ways to optimize program codes to improve cache efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alappat, C., et al.: Execution-Cache-Memory modeling and performance tuning of sparse matrix-vector multiplication and Lattice quantum chromodynamics on A64FX. Concurr. Comput.: Pract. Experience 34(20), e6512 (2022). https://doi.org/10.1002/cpe.6512
Bailey, D.H., et al.: The NAS parallel benchmarks-summary and preliminary results. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, pp. 158–165. ACM (1991). https://doi.org/10.1145/125826.125925
Belady, L.A.: A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 5(2), 78–101 (1966). https://doi.org/10.1147/sj.52.0078
Beyls, K., D’Hollander, E.: Reuse distance as a metric for cache behavior. In: Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems, pp. 617–622 (2001)
El-Sayed, N., et al.: KPart: a hybrid cache partitioning-sharing technique for commodity multicores. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 104–117 (2018). https://doi.org/10.1109/HPCA.2018.00019
Fujitsu Limited: A64FX Microarchitecture Manual, version 1.5 edn. (2021). https://github.com/fujitsu/A64FX/blob/master/doc/
Intel Corporation: Improving real-time performance by utilizing cache allocation technology. Intel Corporation (2015)
Jiang, Y., Zhang, E.Z., Tian, K., Shen, X.: Is reuse distance applicable to data locality analysis on chip multiprocessors? In: Gupta, R. (ed.) CC 2010. LNCS, vol. 6011, pp. 264–282. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11970-5_15
Kim, Y.H., et al.: Implementing stack simulation for highly-associative memories. SIGMETRICS Perform. Eval. Rev. 19(1), 212–213 (1991). https://doi.org/10.1145/107972.107995
Kumar, S., Singh, P.K.: An overview of modern cache memory and performance analysis of replacement policies. In: 2016 IEEE International Conference on Engineering and Technology, pp. 210–214 (2016). https://doi.org/10.1109/ICETECH.2016.7569243
Lu, Q., Lin, J., et al.: Soft-OLP: improving hardware cache performance through software-controlled object-level partitioning. In: 2009 18th International Conference on Parallel Architectures and Compilation Techniques, pp. 246–257 (2009). https://doi.org/10.1109/PACT.2009.35
Löff, J., et al.: The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures. Future Gener. Comput. Syst. 125(C), 743–757 (2021). https://doi.org/10.1016/j.future.2021.07.021
Mellor-Crummey, J.M., Scott, M.L.: Synchronization without contention. SIGPLAN Not. 26(4), 269–278 (1991). https://doi.org/10.1145/106973.106999
Mittal, S.: A survey of techniques for cache partitioning in multicore processors. ACM Comput. Surv. 50(2) (2017). https://doi.org/10.1145/3062394
Mucci, P.J., Browne, S., et al.: PAPI: a portable interface to hardware performance counters. In: Proceedings of the Department of Defense HPCMP Users Group Conference, vol. 710. Citeseer (1999)
Perarnau, S., Sato, M.: Toward automated cache partitioning for the K computer. IPSJ SIG-HPC (2012)
Sabarimuthu, J.M., Venkatesh, T.: Analytical miss rate calculation of L2 cache from the RD profile of L1 cache. IEEE Trans. Comput. 67(1), 9–15 (2017). https://doi.org/10.1109/TC.2017.2723878
Sasongko, M.A., Chabbi, M., et al.: ReuseTracker: fast yet accurate multicore reuse distance analyzer. ACM Trans. Archit. Code Optim. 19(1) (2021). https://doi.org/10.1145/3484199
Schuff, D.L., Kulkarni, M., Pai, V.S.: Accelerating multicore reuse distance analysis with sampling and parallelization. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pp. 53–64 (2010). https://doi.org/10.1145/1854273.1854286
Schuff, D.L., Parsons, B.S., Pai, V.S.: Multicore-aware reuse distance analysis. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–8 (2010). https://doi.org/10.1109/IPDPSW.2010.5470780
Wang, Q., Liu, X., Chabbi, M.: Featherlight reuse-distance measurement. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 440–453. IEEE (2019). https://doi.org/10.1109/HPCA.2019.00056
Wu, M.J., Yeung, D.: Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis. In: Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, pp. 2–11 (2012). https://doi.org/10.1145/2247684.2247687
Yoshida, T., Hondo, M., Kan, R., Sugizaki, G.: SPARC64 VIIIfx: CPU for the K computer. Fujitsu Sci. Tech. J 48(3), 274–279 (2012)
Zhong, Y., Dropsho, S.G., et al.: Miss rate prediction across program inputs and cache configurations. IEEE Trans. Comput. 56(3), 328–343 (2007). https://doi.org/10.1109/TC.2007.50
Acknowledgement
This work has received funding from the European High-Performance Joint Undertaking under grant agreement no.956213 (SparCity), and the Federal Ministry of Education and Research of Germany (project number 16HPC045). Performance results have been obtained on systems in the test environment BEAST (Bavarian Energy Architecture & Software Testbed) (https://www.lrz.de/presse/ereignisse/2020-11-06_BEAST/) at the Leibniz Supercomputing Centre.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Breiter, S., Weidendorfer, J., Chung, M.T., Fürlinger, K. (2023). A Profiling-Based Approach to Cache Partitioning of Program Data. In: Takizawa, H., Shen, H., Hanawa, T., Hyuk Park, J., Tian, H., Egawa, R. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2022. Lecture Notes in Computer Science, vol 13798. Springer, Cham. https://doi.org/10.1007/978-3-031-29927-8_35
Download citation
DOI: https://doi.org/10.1007/978-3-031-29927-8_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-29926-1
Online ISBN: 978-3-031-29927-8
eBook Packages: Computer ScienceComputer Science (R0)