Skip to main content

Rapid Execution Time Estimation for Heterogeneous Memory Systems Through Differential Tracing

  • Conference paper
  • First Online:
  • 1223 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13289))

Abstract

As the complexity of compute nodes in high-performance computing (HPC) keeps increasing, systems equipped with heterogeneous memory devices are becoming paramount. Efficiently utilizing heterogeneous memory-based systems, however, poses significant challenges to application developers. System-software-level transparent solutions utilizing artificial intelligence and machine learning approaches, in particular nonsupervised learning-based methods such as reinforcement learning, may come to the rescue. However, such methods require rapid estimation of execution runtime as a function of the data layout across memory devices for exploring different data placement strategies, rendering architecture-level simulators impractical for this purpose.

In this paper we propose a differential tracing-based approach using memory access traces obtained by high-frequency sampling-based methods (e.g., Intel’s PEBS) on real hardware using of different memory devices. We develop a runtime estimator based on such traces that provides an execution time estimate orders of magnitude faster than full-system simulators. On a number of HPC miniapplications we show that the estimator predicts runtime with an average error of \(4.4\%\) compared to measurements on real hardware.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The exact availability of events depends on the processor’s microarchitecture.

References

  1. Akiyama, S., Hirofuchi, T.: Quantitative evaluation of Intel PEBS overhead for online system-noise analysis. In: Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017 (2017)

    Google Scholar 

  2. Alvarez, L., Casas, M., Labarta, J., Ayguade, E., Valero, M., Moreto, M.: Runtime-guided management of stacked DRAM memories in task parallel programs. In: Proceedings of the 2018 International Conference on Supercomputing (2018)

    Google Scholar 

  3. AnandTech: Intel to launch next-gen Sapphire Rapids Xeon with high bandwidth memory (2021). https://www.anandtech.com/show/16795/intel-to-launch-next-gen-sapphire-rapids-xeon-with-high-bandwidth-memory

  4. Angel, S., Nanavati, M., Sen, S.: Disaggregation and the Application. USENIX Association, Berkeley (2020)

    Google Scholar 

  5. Argonne National Laboratory: Proxy-apps for thermal hydraulics (2021). https://proxyapps.exascaleproject.org/app/nekbone/

  6. Arima, E., Schulz, M.: Pattern-aware staging for hybrid memory systems. In: International Conference on High Performance Computing (2020)

    Google Scholar 

  7. Benoit, A., Perarnau, S., Pottier, L., Robert, Y.: A performance model to execute workflows on high-bandwidth-memory architectures. In: Proceedings of the 47th International Conference on Parallel Processing (2018)

    Google Scholar 

  8. Binkert, N., et al.: The gem5 simulator. SIGARCH Comput. Archit. News (2011). https://doi.org/10.1145/2024716.2024718

  9. Buck, B., Hollingsworth, J.K.: An API for runtime code patching. Int. J. High Perform. Comput. Appl. (2000), https://doi.org/10.1177/109434200001400404

  10. Dhodapkar, A.S., Smith, J.E.: Comparing program phase detection techniques. In: Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36 (2003)

    Google Scholar 

  11. Dong, X., Xu, C., Xie, Y., Jouppi, N.P.: NVSim: a circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput. Aid. Des. Integr. Circ. Syst. 31, 994–1007 (2012)

    Google Scholar 

  12. Doudali, T.D., Blagodurov, S., Vishnu, A., Gurumurthi, S., Gavrilovska, A.: Kleio: A hybrid memory page scheduler with machine intelligence. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing (2019)

    Google Scholar 

  13. Doudali, T.D., Zahka, D., Gavrilovska, A.: The case for optimizing the frequency of periodic data movements over hybrid memory systems. In: The International Symposium on Memory Systems (2020)

    Google Scholar 

  14. Doudali, T.D., Zahka, D., Gavrilovska, A.: Cori: dancing to the right beat of periodic data movements over hybrid memory systems. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2021)

    Google Scholar 

  15. Gerofi, B., Takagi, M., Hori, A., Nakamura, G., Shirasawa, T., Ishikawa, Y.: On the scalability, performance isolation and device driver transparency of the IHK/McKernel hybrid lightweight kernel. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2016

    Google Scholar 

  16. Hady, F.T., Foong, A., Veal, B., Williams, D.: Platform storage performance with 3D XPoint technology. In: Proceedings of the IEEE (2017)

    Google Scholar 

  17. Heroux, M.A., et al.: Improving performance via mini-applications. Tech. rep, Sandia National Laboratories (2009)

    Google Scholar 

  18. Hildebrand, M., Khan, J., Trika, S., Lowe-Power, J., Akella, V.: AutoTM: automatic tensor movement in heterogeneous memory systems using integer linear programming. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (2020). https://doi.org/10.1145/3373376.3378465

  19. HMC Consortium: Hybrid Memory Cube Specification 2.1. (2015). http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf

  20. Intel Corporation: Intel 64 and IA-32 Architectures Software Developer Manuals (2021). https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

  21. JEDEC Solid State Technology Association: High Bandwidth Memory (HBM) DRAM (2015)

    Google Scholar 

  22. JEDEC Solid State Technology Association: Graphics Double Data Rate 6 (GDDR6) SGRAM standard (2017)

    Google Scholar 

  23. Karlin, I., Keasler, J., Neely, R.: LULESH 2.0 updates and changes. Tech. rep., Lawrence Livermore National Laboratory (2013)

    Google Scholar 

  24. Kim, J., Choe, W., Ahn, J.: Exploring the design space of page management for multi-tiered memory systems. In: 2021 USENIX Annual Technical Conference (USENIX ATC 21) (2021)

    Google Scholar 

  25. Kim, Y., Yang, W., Mutlu, O.: Ramulator: a fast and extensible DRAM simulator. IEEE Comput. Archit. Lett. 15, 45–49 (2016)

    Google Scholar 

  26. Larysch, F.: Fine-grained estimation of memory bandwidth utilization. Master’s thesis (2016)

    Google Scholar 

  27. Lee, B.C., Ipek, E., Mutlu, O., Burger, D.: Architecting phase change memory as a scalable DRAM alternative. SIGARCH Comput. Archit. News (2009). https://doi.org/10.1145/1555815.1555758

  28. Luk, C.K., et al.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (2005)

    Google Scholar 

  29. Nonell, A.R., Gerofi, B., Bautista-Gomez, L., Martinet, D., Querol, V.B., Ishikawa, Y.: On the applicability of PEBS based online memory access tracking for heterogeneous memory management at scale. In: Proceedings of the Workshop on Memory Centric High Performance Computing (2018)

    Google Scholar 

  30. Olson, M.B., Zhou, T., Jantz, M.R., Doshi, K.A., Lopez, M.G., Hernandez, O.: MemBrain: automated application guidance for hybrid memory systems. In: IEEE International Conference on Networking, Architecture, and Storage (2018)

    Google Scholar 

  31. Padakandla, S.: A survey of reinforcement learning algorithms for dynamically varying environments. ACM Comput. Surv. 54(6) (2021). https://doi.org/10.1145/3459991

  32. Park, K.-T., et al.: 19.5 three-dimensional 128Gb MLC vertical NAND flash-memory with 24-WL stacked layers and 50MB/s high-speed programming. In: 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) (2014)

    Google Scholar 

  33. Parsons, B.S.: Initial benchmarking of the Intel 3D-stacked MCDRAM. Tech. rep, ERDC (2019)

    Book  Google Scholar 

  34. Peng, I.B., Vetter, J.S.: Siena: exploring the design space of heterogeneous memory systems. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (2018)

    Google Scholar 

  35. Peng, I.B., Gioiosa, R., Kestor, G., Cicotti, P., Laure, E., Markidis, S.: RTHMS: a tool for data placement on hybrid memory system. ACM SIGPLAN Notices 52, 82–91 (2017)

    Google Scholar 

  36. Plimpton, S.: Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phy. 117, 1–19 (1995)

    Google Scholar 

  37. Pohl, C.: Exploiting manycore architectures for parallel data stream processing. In: Grundlagen von Datenbanken, pp. 66–71 (2017)

    Google Scholar 

  38. Sandberg, A., Diestelhorst, S., Wang, W.: Architectural exploration with gem5 (2017). https://www.gem5.org/assets/files/ASPLOS2017_gem5_tutorial.pdf

  39. Servat, H., Peña, A.J., Llort, G., Mercadal, E., Hoppe, H.C., Labarta, J.: Automating the application data placement in hybrid memory systems. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER) (2017)

    Google Scholar 

  40. Shimosawa, T., et al.: Interface for heterogeneous kernels: A framework to enable hybrid OS designs targeting high performance computing on manycore architectures. In: 21st International Conference on High Performance Computing (2014)

    Google Scholar 

  41. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction (1998). http://www.cs.ualberta.ca/~sutton/book/the-book.html

  42. Uh, G.R., Cohn, R., Yadavalli, B., Peri, R., Ayyagari, R.: Analyzing dynamic binary instrumentation overhead. In: WBIA Workshop at ASPLOS. Citeseer (2006)

    Google Scholar 

  43. Wu, K., Ren, J., Li, D.: Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programs. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (2018)

    Google Scholar 

  44. Yu, S., Park, S., Baek, W.: Design and implementation of bandwidth-aware memory placement and migration policies for heterogeneous memory systems. In: Proceedings of the International Conference on Supercomputing, pp. 1–10 (2017)

    Google Scholar 

  45. Zambelli, C., Navarro, G., Sousa, V., Prejbeanu, I.L., Perniola, L.: Phase change and magnetic memories for solid-state drive applications. In: Proceedings of the IEEE (2017)

    Google Scholar 

Download references

Acknowledgment

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The material was based upon work supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. This research was also supported by the JSPS KAKENHI Grant Number JP19K11993.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Balazs Gerofi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Balzas Gerolfi and UChicago Argonne, LLC, Operator of Argonne National Laboratory, under exclusive license to Springer Nature Switzerland AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Denoyelle, N., Perarnau, S., Iskra, K., Gerofi, B. (2022). Rapid Execution Time Estimation for Heterogeneous Memory Systems Through Differential Tracing. In: Varbanescu, AL., Bhatele, A., Luszczek, P., Marc, B. (eds) High Performance Computing. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13289. Springer, Cham. https://doi.org/10.1007/978-3-031-07312-0_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-07312-0_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-07311-3

  • Online ISBN: 978-3-031-07312-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics