research-article

The I/O Trace Initiative: Building a Collaborative I/O Archive to Advance HPC

Authors:
Nafiseh Moti

Johannes Gutenberg University Mainz, Germany

Johannes Gutenberg University Mainz, Germany

0000-0003-4491-6777
View Profile

,
André Brinkmann

Johannes Gutenberg University Mainz, Germany

Johannes Gutenberg University Mainz, Germany

0000-0003-3083-2775
View Profile

,
Marc-André Vef

Johannes Gutenberg University Mainz, Germany

Johannes Gutenberg University Mainz, Germany

0000-0001-7398-3034
View Profile

,
Philippe Deniel

CEA, France

CEA, France

0009-0007-2415-8864
View Profile

,
Jesus Carretero

Universidad Carlos III de Madrid, Spain

Universidad Carlos III de Madrid, Spain

0000-0002-1413-4793
View Profile

,
Philip Carns

Argonne National Laboratory, United States of America

Argonne National Laboratory, United States of America

0000-0002-3963-9923
View Profile

,
Jean-Thomas Acquaviva

DataDirect Networks (DDN), France

DataDirect Networks (DDN), France

0009-0006-9421-8490
View Profile

,
Reza Salkhordeh

Johannes Gutenberg University Mainz, Germany

Johannes Gutenberg University Mainz, Germany

0000-0003-3786-7102
View Profile

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and AnalysisNovember 2023Pages 1216–1222https://doi.org/10.1145/3624062.3624192

Published:12 November 2023Publication History

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

Pages 1216–1222

ABSTRACT

HPC application developers and administrators need to understand the complex interplay between compute clusters and storage systems to make effective optimization decisions. Ad hoc investigations of this interplay based on isolated case studies can lead to conclusions that are incorrect or difficult to generalize. The I/O Trace Initiative aims to improve the scientific community’s understanding of I/O operations by building a searchable collaborative archive of I/O traces from a wide range of applications and machines, with a focus on high-performance computing and scalable AI/ML. This initiative advances the accessibility of I/O trace data by enabling users to locate and compare traces based on user-specified criteria. It also provides a visual analytics platform for in-depth analysis, paving the way for the development of advanced performance optimization techniques. By acting as a hub for trace data, the initiative fosters collaborative research by encouraging data sharing and collective learning.

References

2015. darshan-logutils.c. https://github.com/darshan-hpc/darshan/blob/main/darshan-util/darshan-logutils.cGoogle Scholar
Jean Luca Bez, Suren Byna, and Shadi Ibrahim. 2023. I/O Access Patterns in HPC Applications: A 360-Degree Survey. ACM Comput. Surv. (jul 2023).Google Scholar
Jean Luca Bez, Houjun Tang, Bing Xie, David B. Williams-Young, Robert Latham, Robert B. Ross, Sarp Oral, and Suren Byna. 2021. I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis. In 6th IEEE/ACM International Parallel Data Systems Workshop (PDSW@SC), St. Louis, MO, USA, November 15. 15–22.Google ScholarCross Ref
Phil Carns. 2013. ALCF I/O Data Repository. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).Google Scholar
Philip Carns. 2014. Darshan. In High performance parallel I/O. Chapman and Hall/CRC, 351–358.Google Scholar
Philip H. Carns, Kevin Harms, William E. Allcock, Charles Bacon, Samuel Lang, Robert Latham, and Robert B. Ross. 2011. Understanding and Improving Computational Science Storage Access through Continuous Characterization. ACM Trans. Storage 7, 3 (2011), 8:1–8:26.Google ScholarDigital Library
Steven WD Chien, Artur Podobas, Ivy B Peng, and Stefano Markidis. 2020. tf-Darshan: Understanding fine-grained I/O performance in machine learning workloads. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 359–370.Google ScholarCross Ref
European Organization For Nuclear Research and OpenAIRE. 2013. Zenodo. https://doi.org/10.25495/7GXK-RD71Google ScholarCross Ref
Clinton Gormley and Zachary Tong. 2015. Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. " O’Reilly Media, Inc.".Google Scholar
Harsh Khetawat, Christopher Zimmer, Frank Mueller, Sudharshan Vazhkudai, and Scott Atchley. 2018. Using darshan and codes to evaluate application i/o performance. SC Poster Session (2018).Google Scholar
Seong Jo Kim, Seung Woo Son, Wei-keng Liao, Mahmut T. Kandemir, Rajeev Thakur, and Alok N. Choudhary. 2012. IOPin: Runtime Profiling of Parallel I/O in HPC Systems. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, USA, November 10-16, 2012. IEEE Computer Society, 18–23.Google Scholar
Andreas Knupfer, Christian Rossel, Dieter an Mey, Scott Biersdorff, Kai Diethelm, Dominic Eschweiler, Markus Geimer, Michael Gerndt, Daniel Lorenz, Allen Malony, and Wolfgang E. Nagel. 2012. Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir. (8 2012). https://www.osti.gov/biblio/1567522Google Scholar
Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett H. Phillips, Ankur Mahesh, Michael A. Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, and Michael Houston. 2018. Exascale deep learning for climate analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, Dallas, TX, USA, November 11-16, 2018. IEEE / ACM, 51:1–51:12.Google ScholarDigital Library
Jakob Lüttgau, Shane Snyder, Philip H. Carns, Justin M. Wozniak, Julian M. Kunkel, and Thomas Ludwig. 2018. Toward Understanding I/O Behavior in HPC Workflows. In 3rd IEEE/ACM International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS@SC), Dallas, TX, USA, November 12. 64–75. https://doi.org/10.1109/PDSW-DISCS.2018.00012Google ScholarCross Ref
Huong Luu, Babak Behzad, Ruth A. Aydt, and Marianne Winslett. 2013. A multi-level approach for understanding I/O activity in HPC applications. In 2013 IEEE International Conference on Cluster Computing, CLUSTER 2013, Indianapolis, IN, USA, September 23-27, 2013. IEEE Computer Society, 1–5.Google ScholarCross Ref
D Miller, J Whitlocak, M Gartiner, M Ralphson, R Ratovsky, and U Sarid. [n. d.]. OpenAPI Specification v3. 1.0 (2021). URL https://spec. openapis. org/oas/latest. html. OpenAPI Initiative, The Linux Foundation ([n. d.]).Google Scholar
Tirthak Patel, Suren Byna, Glenn K. Lockwood, and Devesh Tiwari. 2019. Revisiting I/O behavior in large-scale storage systems: the expected and the unexpected. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17-19, 2019, Michela Taufer, Pavan Balaji, and Antonio J. Peña (Eds.). ACM, 65:1–65:13.Google ScholarDigital Library
Arnab K. Paul, Jong Youl Choi, Ahmad Maroof Karimi, and Feiyi Wang. 2022. Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems. In 31st International Symposium on High-Performance Parallel and Distributed Computing (HPDC), Minneapolis, MN, USA, 27 June 2022 - 1 July. 199–212. https://doi.org/10.1145/3502181.3531457Google ScholarDigital Library
Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Jeffrey S. Vetter, Pietro Cicotti, Erwin Laure, and Stefano Markidis. 2018. Characterizing the performance benefit of hybrid memory system for HPC applications. Parallel Comput. 76 (2018), 57–69.Google ScholarDigital Library
Sameer Shende, Allen D. Malony, Wyatt Spear, and Karen Schuchardt. [n. d.]. Characterizing I/O Performance Using the TAU Performance System. In Applications, Tools and Techniques on the Road to Exascale Computing, Proceedings of the conference ParCo 2011, 31 August - 3 September 2011, Ghent, Belgium(Advances in Parallel Computing, Vol. 22), Koen De Bosschere, Erik H. D’Hollander, Gerhard R. Joubert, David A. Padua, Frans J. Peters, and Mark Sawyer (Eds.). IOS Press, 647–655.Google Scholar
Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K Lockwood, and Nicholas J Wright. 2016. Modular hpc i/o characterization with darshan. In 2016 5th workshop on extreme-scale programming tools (ESPT). IEEE, 9–17.Google ScholarCross Ref
Chen Wang, Jinghan Sun, Marc Snir, Kathryn M. Mohror, and Elsa Gonsiorowski. 2020. Recorder 2.0: Efficient Parallel I/O Tracing and Analysis. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020, New Orleans, LA, USA, May 18-22, 2020. IEEE, 1052–1059.Google ScholarCross Ref
Teng Wang, Shane Snyder, Glenn K. Lockwood, Philip H. Carns, Nicholas J. Wright, and Suren Byna. 2018. IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs. In IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, September 10-13. 466–476.Google ScholarCross Ref
[24] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, [n. d.]. ([n. d.]).Google Scholar
Cong Xu, Shane Snyder, Vishwanath Venkatesan, Philip Carns, Omkar Kulkarni, Suren Byna, Roberto Sisneros, and Kalyana Chadalavada. 2017. Dxt: Darshan extended tracing. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).Google Scholar

Index Terms

The I/O Trace Initiative: Building a Collaborative I/O Archive to Advance HPC
1. Computing methodologies
  1. Distributed computing methodologies
2. Information systems
  1. Information storage systems
    1. Storage management
      1. Hierarchical storage management

Recommendations

The HPC Testbed of the Italian Grid Infrastructure
PDP '13: Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

Even though the Italian Grid Infrastructure (IGI) is a general purpose distributed platform, in the past it has been used mainly for serial computations. Parallel applications have been typically executed on supercomputer facilities or, in case of ``not ...
Read More
New capabilities in qoscosgrid middleware for advanced job management, advance reservation and co-allocation of computing resources --- quantum chemistry application use case
Building a National Distributed e-Infrastructure - PL-Grid

In this chapter we present the new capabilities of QosCosGrid (QCG) middleware for advanced job and resource management in the grid environment. By connecting many computing clusters together, QosCosGrid offers easy-to-use mapping, execution and ...
Read More
Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence
Abstract
The evolution of High-Performance Computing (HPC) platforms enables the design and execution of progressively larger and more complex workflow applications in these systems. The complexity comes not only from the number of elements ...
Highlights
- Analysis of the HPC, Big Data and AI convergence in complex scientific workflows.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
November 2023
2180 pages
ISBN:9798400707858
DOI:10.1145/3624062

Copyright © 2023 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
High Performance Computing
I/O profiling
Storage systems
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 85
  Total Downloads
- Downloads (Last 12 months)85
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

The I/O Trace Initiative: Building a Collaborative I/O Archive to Advance HPC

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

The HPC Testbed of the Italian Grid Infrastructure

New capabilities in qoscosgrid middleware for advanced job management, advance reservation and co-allocation of computing resources --- quantum chemistry application use case

Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

The I/O Trace Initiative: Building a Collaborative I/O Archive to Advance HPC

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

The HPC Testbed of the Italian Grid Infrastructure

New capabilities in qoscosgrid middleware for advanced job management, advance reservation and co-allocation of computing resources --- quantum chemistry application use case

Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media