ABSTRACT
HPC application developers and administrators need to understand the complex interplay between compute clusters and storage systems to make effective optimization decisions. Ad hoc investigations of this interplay based on isolated case studies can lead to conclusions that are incorrect or difficult to generalize. The I/O Trace Initiative aims to improve the scientific community’s understanding of I/O operations by building a searchable collaborative archive of I/O traces from a wide range of applications and machines, with a focus on high-performance computing and scalable AI/ML. This initiative advances the accessibility of I/O trace data by enabling users to locate and compare traces based on user-specified criteria. It also provides a visual analytics platform for in-depth analysis, paving the way for the development of advanced performance optimization techniques. By acting as a hub for trace data, the initiative fosters collaborative research by encouraging data sharing and collective learning.
- 2015. darshan-logutils.c. https://github.com/darshan-hpc/darshan/blob/main/darshan-util/darshan-logutils.cGoogle Scholar
- Jean Luca Bez, Suren Byna, and Shadi Ibrahim. 2023. I/O Access Patterns in HPC Applications: A 360-Degree Survey. ACM Comput. Surv. (jul 2023).Google Scholar
- Jean Luca Bez, Houjun Tang, Bing Xie, David B. Williams-Young, Robert Latham, Robert B. Ross, Sarp Oral, and Suren Byna. 2021. I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis. In 6th IEEE/ACM International Parallel Data Systems Workshop (PDSW@SC), St. Louis, MO, USA, November 15. 15–22.Google ScholarCross Ref
- Phil Carns. 2013. ALCF I/O Data Repository. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).Google Scholar
- Philip Carns. 2014. Darshan. In High performance parallel I/O. Chapman and Hall/CRC, 351–358.Google Scholar
- Philip H. Carns, Kevin Harms, William E. Allcock, Charles Bacon, Samuel Lang, Robert Latham, and Robert B. Ross. 2011. Understanding and Improving Computational Science Storage Access through Continuous Characterization. ACM Trans. Storage 7, 3 (2011), 8:1–8:26.Google ScholarDigital Library
- Steven WD Chien, Artur Podobas, Ivy B Peng, and Stefano Markidis. 2020. tf-Darshan: Understanding fine-grained I/O performance in machine learning workloads. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 359–370.Google ScholarCross Ref
- European Organization For Nuclear Research and OpenAIRE. 2013. Zenodo. https://doi.org/10.25495/7GXK-RD71Google ScholarCross Ref
- Clinton Gormley and Zachary Tong. 2015. Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. " O’Reilly Media, Inc.".Google Scholar
- Harsh Khetawat, Christopher Zimmer, Frank Mueller, Sudharshan Vazhkudai, and Scott Atchley. 2018. Using darshan and codes to evaluate application i/o performance. SC Poster Session (2018).Google Scholar
- Seong Jo Kim, Seung Woo Son, Wei-keng Liao, Mahmut T. Kandemir, Rajeev Thakur, and Alok N. Choudhary. 2012. IOPin: Runtime Profiling of Parallel I/O in HPC Systems. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, USA, November 10-16, 2012. IEEE Computer Society, 18–23.Google Scholar
- Andreas Knupfer, Christian Rossel, Dieter an Mey, Scott Biersdorff, Kai Diethelm, Dominic Eschweiler, Markus Geimer, Michael Gerndt, Daniel Lorenz, Allen Malony, and Wolfgang E. Nagel. 2012. Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir. (8 2012). https://www.osti.gov/biblio/1567522Google Scholar
- Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett H. Phillips, Ankur Mahesh, Michael A. Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, and Michael Houston. 2018. Exascale deep learning for climate analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, Dallas, TX, USA, November 11-16, 2018. IEEE / ACM, 51:1–51:12.Google ScholarDigital Library
- Jakob Lüttgau, Shane Snyder, Philip H. Carns, Justin M. Wozniak, Julian M. Kunkel, and Thomas Ludwig. 2018. Toward Understanding I/O Behavior in HPC Workflows. In 3rd IEEE/ACM International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS@SC), Dallas, TX, USA, November 12. 64–75. https://doi.org/10.1109/PDSW-DISCS.2018.00012Google ScholarCross Ref
- Huong Luu, Babak Behzad, Ruth A. Aydt, and Marianne Winslett. 2013. A multi-level approach for understanding I/O activity in HPC applications. In 2013 IEEE International Conference on Cluster Computing, CLUSTER 2013, Indianapolis, IN, USA, September 23-27, 2013. IEEE Computer Society, 1–5.Google ScholarCross Ref
- D Miller, J Whitlocak, M Gartiner, M Ralphson, R Ratovsky, and U Sarid. [n. d.]. OpenAPI Specification v3. 1.0 (2021). URL https://spec. openapis. org/oas/latest. html. OpenAPI Initiative, The Linux Foundation ([n. d.]).Google Scholar
- Tirthak Patel, Suren Byna, Glenn K. Lockwood, and Devesh Tiwari. 2019. Revisiting I/O behavior in large-scale storage systems: the expected and the unexpected. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17-19, 2019, Michela Taufer, Pavan Balaji, and Antonio J. Peña (Eds.). ACM, 65:1–65:13.Google ScholarDigital Library
- Arnab K. Paul, Jong Youl Choi, Ahmad Maroof Karimi, and Feiyi Wang. 2022. Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems. In 31st International Symposium on High-Performance Parallel and Distributed Computing (HPDC), Minneapolis, MN, USA, 27 June 2022 - 1 July. 199–212. https://doi.org/10.1145/3502181.3531457Google ScholarDigital Library
- Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Jeffrey S. Vetter, Pietro Cicotti, Erwin Laure, and Stefano Markidis. 2018. Characterizing the performance benefit of hybrid memory system for HPC applications. Parallel Comput. 76 (2018), 57–69.Google ScholarDigital Library
- Sameer Shende, Allen D. Malony, Wyatt Spear, and Karen Schuchardt. [n. d.]. Characterizing I/O Performance Using the TAU Performance System. In Applications, Tools and Techniques on the Road to Exascale Computing, Proceedings of the conference ParCo 2011, 31 August - 3 September 2011, Ghent, Belgium(Advances in Parallel Computing, Vol. 22), Koen De Bosschere, Erik H. D’Hollander, Gerhard R. Joubert, David A. Padua, Frans J. Peters, and Mark Sawyer (Eds.). IOS Press, 647–655.Google Scholar
- Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K Lockwood, and Nicholas J Wright. 2016. Modular hpc i/o characterization with darshan. In 2016 5th workshop on extreme-scale programming tools (ESPT). IEEE, 9–17.Google ScholarCross Ref
- Chen Wang, Jinghan Sun, Marc Snir, Kathryn M. Mohror, and Elsa Gonsiorowski. 2020. Recorder 2.0: Efficient Parallel I/O Tracing and Analysis. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020, New Orleans, LA, USA, May 18-22, 2020. IEEE, 1052–1059.Google ScholarCross Ref
- Teng Wang, Shane Snyder, Glenn K. Lockwood, Philip H. Carns, Nicholas J. Wright, and Suren Byna. 2018. IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs. In IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, September 10-13. 466–476.Google ScholarCross Ref
- [24] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, [n. d.]. ([n. d.]).Google Scholar
- Cong Xu, Shane Snyder, Vishwanath Venkatesan, Philip Carns, Omkar Kulkarni, Suren Byna, Roberto Sisneros, and Kalyana Chadalavada. 2017. Dxt: Darshan extended tracing. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).Google Scholar
Index Terms
- The I/O Trace Initiative: Building a Collaborative I/O Archive to Advance HPC
Recommendations
The HPC Testbed of the Italian Grid Infrastructure
PDP '13: Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based ProcessingEven though the Italian Grid Infrastructure (IGI) is a general purpose distributed platform, in the past it has been used mainly for serial computations. Parallel applications have been typically executed on supercomputer facilities or, in case of ``not ...
New capabilities in qoscosgrid middleware for advanced job management, advance reservation and co-allocation of computing resources --- quantum chemistry application use case
Building a National Distributed e-Infrastructure - PL-GridIn this chapter we present the new capabilities of QosCosGrid (QCG) middleware for advanced job and resource management in the grid environment. By connecting many computing clusters together, QosCosGrid offers easy-to-use mapping, execution and ...
Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence
AbstractThe evolution of High-Performance Computing (HPC) platforms enables the design and execution of progressively larger and more complex workflow applications in these systems. The complexity comes not only from the number of elements ...
Highlights- Analysis of the HPC, Big Data and AI convergence in complex scientific workflows.
Comments