Rapid Execution Time Estimation for Heterogeneous Memory Systems Through Differential Tracing

Denoyelle, Nicolas; Perarnau, Swann; Iskra, Kamil; Gerofi, Balazs

doi:10.1007/978-3-031-07312-0_13

Rapid Execution Time Estimation for Heterogeneous Memory Systems Through Differential Tracing

Nicolas Denoyelle¹¹,
Swann Perarnau¹¹,
Kamil Iskra¹¹ &
…
Balazs Gerofi¹²

Conference paper
First Online: 29 May 2022

1223 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13289))

Abstract

As the complexity of compute nodes in high-performance computing (HPC) keeps increasing, systems equipped with heterogeneous memory devices are becoming paramount. Efficiently utilizing heterogeneous memory-based systems, however, poses significant challenges to application developers. System-software-level transparent solutions utilizing artificial intelligence and machine learning approaches, in particular nonsupervised learning-based methods such as reinforcement learning, may come to the rescue. However, such methods require rapid estimation of execution runtime as a function of the data layout across memory devices for exploring different data placement strategies, rendering architecture-level simulators impractical for this purpose.

In this paper we propose a differential tracing-based approach using memory access traces obtained by high-frequency sampling-based methods (e.g., Intel’s PEBS) on real hardware using of different memory devices. We develop a runtime estimator based on such traces that provides an execution time estimate orders of magnitude faster than full-system simulators. On a number of HPC miniapplications we show that the estimator predicts runtime with an average error of \(4.4\%\) compared to measurements on real hardware.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The exact availability of events depends on the processor’s microarchitecture.

References

Akiyama, S., Hirofuchi, T.: Quantitative evaluation of Intel PEBS overhead for online system-noise analysis. In: Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017 (2017)
Google Scholar
Alvarez, L., Casas, M., Labarta, J., Ayguade, E., Valero, M., Moreto, M.: Runtime-guided management of stacked DRAM memories in task parallel programs. In: Proceedings of the 2018 International Conference on Supercomputing (2018)
Google Scholar
AnandTech: Intel to launch next-gen Sapphire Rapids Xeon with high bandwidth memory (2021). https://www.anandtech.com/show/16795/intel-to-launch-next-gen-sapphire-rapids-xeon-with-high-bandwidth-memory
Angel, S., Nanavati, M., Sen, S.: Disaggregation and the Application. USENIX Association, Berkeley (2020)
Google Scholar
Argonne National Laboratory: Proxy-apps for thermal hydraulics (2021). https://proxyapps.exascaleproject.org/app/nekbone/
Arima, E., Schulz, M.: Pattern-aware staging for hybrid memory systems. In: International Conference on High Performance Computing (2020)
Google Scholar
Benoit, A., Perarnau, S., Pottier, L., Robert, Y.: A performance model to execute workflows on high-bandwidth-memory architectures. In: Proceedings of the 47th International Conference on Parallel Processing (2018)
Google Scholar
Binkert, N., et al.: The gem5 simulator. SIGARCH Comput. Archit. News (2011). https://doi.org/10.1145/2024716.2024718
Buck, B., Hollingsworth, J.K.: An API for runtime code patching. Int. J. High Perform. Comput. Appl. (2000), https://doi.org/10.1177/109434200001400404
Dhodapkar, A.S., Smith, J.E.: Comparing program phase detection techniques. In: Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36 (2003)
Google Scholar
Dong, X., Xu, C., Xie, Y., Jouppi, N.P.: NVSim: a circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput. Aid. Des. Integr. Circ. Syst. 31, 994–1007 (2012)
Google Scholar
Doudali, T.D., Blagodurov, S., Vishnu, A., Gurumurthi, S., Gavrilovska, A.: Kleio: A hybrid memory page scheduler with machine intelligence. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing (2019)
Google Scholar
Doudali, T.D., Zahka, D., Gavrilovska, A.: The case for optimizing the frequency of periodic data movements over hybrid memory systems. In: The International Symposium on Memory Systems (2020)
Google Scholar
Doudali, T.D., Zahka, D., Gavrilovska, A.: Cori: dancing to the right beat of periodic data movements over hybrid memory systems. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2021)
Google Scholar
Gerofi, B., Takagi, M., Hori, A., Nakamura, G., Shirasawa, T., Ishikawa, Y.: On the scalability, performance isolation and device driver transparency of the IHK/McKernel hybrid lightweight kernel. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2016
Google Scholar
Hady, F.T., Foong, A., Veal, B., Williams, D.: Platform storage performance with 3D XPoint technology. In: Proceedings of the IEEE (2017)
Google Scholar
Heroux, M.A., et al.: Improving performance via mini-applications. Tech. rep, Sandia National Laboratories (2009)
Google Scholar
Hildebrand, M., Khan, J., Trika, S., Lowe-Power, J., Akella, V.: AutoTM: automatic tensor movement in heterogeneous memory systems using integer linear programming. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (2020). https://doi.org/10.1145/3373376.3378465
HMC Consortium: Hybrid Memory Cube Specification 2.1. (2015). http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf
Intel Corporation: Intel 64 and IA-32 Architectures Software Developer Manuals (2021). https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
JEDEC Solid State Technology Association: High Bandwidth Memory (HBM) DRAM (2015)
Google Scholar
JEDEC Solid State Technology Association: Graphics Double Data Rate 6 (GDDR6) SGRAM standard (2017)
Google Scholar
Karlin, I., Keasler, J., Neely, R.: LULESH 2.0 updates and changes. Tech. rep., Lawrence Livermore National Laboratory (2013)
Google Scholar
Kim, J., Choe, W., Ahn, J.: Exploring the design space of page management for multi-tiered memory systems. In: 2021 USENIX Annual Technical Conference (USENIX ATC 21) (2021)
Google Scholar
Kim, Y., Yang, W., Mutlu, O.: Ramulator: a fast and extensible DRAM simulator. IEEE Comput. Archit. Lett. 15, 45–49 (2016)
Google Scholar
Larysch, F.: Fine-grained estimation of memory bandwidth utilization. Master’s thesis (2016)
Google Scholar
Lee, B.C., Ipek, E., Mutlu, O., Burger, D.: Architecting phase change memory as a scalable DRAM alternative. SIGARCH Comput. Archit. News (2009). https://doi.org/10.1145/1555815.1555758
Luk, C.K., et al.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (2005)
Google Scholar
Nonell, A.R., Gerofi, B., Bautista-Gomez, L., Martinet, D., Querol, V.B., Ishikawa, Y.: On the applicability of PEBS based online memory access tracking for heterogeneous memory management at scale. In: Proceedings of the Workshop on Memory Centric High Performance Computing (2018)
Google Scholar
Olson, M.B., Zhou, T., Jantz, M.R., Doshi, K.A., Lopez, M.G., Hernandez, O.: MemBrain: automated application guidance for hybrid memory systems. In: IEEE International Conference on Networking, Architecture, and Storage (2018)
Google Scholar
Padakandla, S.: A survey of reinforcement learning algorithms for dynamically varying environments. ACM Comput. Surv. 54(6) (2021). https://doi.org/10.1145/3459991
Park, K.-T., et al.: 19.5 three-dimensional 128Gb MLC vertical NAND flash-memory with 24-WL stacked layers and 50MB/s high-speed programming. In: 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) (2014)
Google Scholar
Parsons, B.S.: Initial benchmarking of the Intel 3D-stacked MCDRAM. Tech. rep, ERDC (2019)
Book Google Scholar
Peng, I.B., Vetter, J.S.: Siena: exploring the design space of heterogeneous memory systems. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (2018)
Google Scholar
Peng, I.B., Gioiosa, R., Kestor, G., Cicotti, P., Laure, E., Markidis, S.: RTHMS: a tool for data placement on hybrid memory system. ACM SIGPLAN Notices 52, 82–91 (2017)
Google Scholar
Plimpton, S.: Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phy. 117, 1–19 (1995)
Google Scholar
Pohl, C.: Exploiting manycore architectures for parallel data stream processing. In: Grundlagen von Datenbanken, pp. 66–71 (2017)
Google Scholar
Sandberg, A., Diestelhorst, S., Wang, W.: Architectural exploration with gem5 (2017). https://www.gem5.org/assets/files/ASPLOS2017_gem5_tutorial.pdf
Servat, H., Peña, A.J., Llort, G., Mercadal, E., Hoppe, H.C., Labarta, J.: Automating the application data placement in hybrid memory systems. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER) (2017)
Google Scholar
Shimosawa, T., et al.: Interface for heterogeneous kernels: A framework to enable hybrid OS designs targeting high performance computing on manycore architectures. In: 21st International Conference on High Performance Computing (2014)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction (1998). http://www.cs.ualberta.ca/~sutton/book/the-book.html
Uh, G.R., Cohn, R., Yadavalli, B., Peri, R., Ayyagari, R.: Analyzing dynamic binary instrumentation overhead. In: WBIA Workshop at ASPLOS. Citeseer (2006)
Google Scholar
Wu, K., Ren, J., Li, D.: Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programs. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (2018)
Google Scholar
Yu, S., Park, S., Baek, W.: Design and implementation of bandwidth-aware memory placement and migration policies for heterogeneous memory systems. In: Proceedings of the International Conference on Supercomputing, pp. 1–10 (2017)
Google Scholar
Zambelli, C., Navarro, G., Sousa, V., Prejbeanu, I.L., Perniola, L.: Phase change and magnetic memories for solid-state drive applications. In: Proceedings of the IEEE (2017)
Google Scholar

Download references

Acknowledgment

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The material was based upon work supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. This research was also supported by the JSPS KAKENHI Grant Number JP19K11993.

Author information

Authors and Affiliations

Argonne National Laboratory, Lemont, USA
Nicolas Denoyelle, Swann Perarnau & Kamil Iskra
RIKEN Center for Computational Science, Kobe, Japan
Balazs Gerofi

Authors

Nicolas Denoyelle
View author publications
You can also search for this author in PubMed Google Scholar
Swann Perarnau
View author publications
You can also search for this author in PubMed Google Scholar
Kamil Iskra
View author publications
You can also search for this author in PubMed Google Scholar
Balazs Gerofi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Balazs Gerofi .

Editor information

Editors and Affiliations

University of Twente, Enschede, The Netherlands
Ana-Lucia Varbanescu
University of Maryland, College Park, MD, USA
Abhinav Bhatele
University of Tennessee, Knoxville, TN, USA
Piotr Luszczek
Université Paris-Saclay, Orsay, France
Baboulin Marc

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Denoyelle, N., Perarnau, S., Iskra, K., Gerofi, B. (2022). Rapid Execution Time Estimation for Heterogeneous Memory Systems Through Differential Tracing. In: Varbanescu, AL., Bhatele, A., Luszczek, P., Marc, B. (eds) High Performance Computing. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13289. Springer, Cham. https://doi.org/10.1007/978-3-031-07312-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-07312-0_13
Published: 29 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07311-3
Online ISBN: 978-3-031-07312-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics