Demystifying asynchronous I/O Interference in HPC applications

Tseng, Shu-Mei; Nicolae, Bogdan; Cappello, Franck; Chandramowlishwaran, Aparna

doi:10.1177/10943420211016511

Title: Demystifying asynchronous I/O Interference in HPC applications

Journal Article · Thu May 13 00:00:00 EDT 2021 · International Journal of High Performance Computing Applications

DOI:https://doi.org/10.1177/10943420211016511· OSTI ID:1831116

Tseng, Shu-Mei ^[1];

^[2]; Cappello, Franck ^[2]; Chandramowlishwaran, Aparna ^[1]

Univ. of California, Irvine, CA (United States)
Argonne National Lab. (ANL), Argonne, IL (United States)

With increasing complexity of HPC workflows, data management services need to perform expensive I/O operations asynchronously in the background, aiming to overlap the I/O with the application runtime. However, this may cause interference due to competition for resources: CPU, memory/network bandwidth. The advent of multi-core architectures has exacerbated this problem, as many I/O operations are issued concurrently, thereby competing not only with the application but also among themselves. Furthermore, the interference patterns can dynamically change as a response to variations in application behavior and I/O subsystems (e.g. multiple users sharing a parallel file system). Without a thorough understanding, I/O operations may perform suboptimally, potentially even worse than in the blocking case. To fill this gap, here we investigate the causes and consequences of interference due to asynchronous I/O on HPC systems. Specifically, we focus on multi-core CPUs and memory bandwidth, isolating the interference due to each resource. Then, we perform an in-depth study to explain the interplay and contention in a variety of resource sharing scenarios such as varying priority and number of background I/O threads and different I/O strategies: sendfile, read/write, mmap/write underlining trade-offs. The insights from this study are important both to enable guided optimizations of existing background I/O, as well as to open new opportunities to design advanced asynchronous I/O strategies.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Argonne National Lab. (ANL), Argonne, IL (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Grant/Contract Number:: AC02-06CH11357

OSTI ID:: 1831116

Journal Information:: International Journal of High Performance Computing Applications, Vol. 35, Issue 4; ISSN 1094-3420

Publisher:: SAGECopyright Statement

Country of Publication:: United States

Language:: English

References (28)

OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES Duran, Alejandro; AyguadÉ, Eduard; Badia, Rosa M. Parallel Processing Letters, Vol. 21, Issue 02 https://doi.org/10.1142/S0129626411000151	journal	June 2011
HACC: extreme scaling and performance across diverse architectures Habib, Salman; Morozov, Vitali; Frontiere, Nicholas Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2504566	conference	January 2013
Toward Scalable and Asynchronous Object-Centric Data Management for HPC Tang, Houjun; Byna, Suren; Tessier, Francois 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) https://doi.org/10.1109/CCGRID.2018.00026	conference	May 2018
GoldRush: resource efficient in situ scientific data analytics using fine-grained interference aware execution Zheng, Fang; Yu, Hongfeng; Hantas, Can Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503279	conference	January 2013
Understanding and Improving Computational Science Storage Access through Continuous Characterization Carns, Philip; Harms, Kevin; Allcock, William ACM Transactions on Storage, Vol. 7, Issue 3, p. 1-26 https://doi.org/10.1145/2027066.2027068	journal	October 2011
Improving collective I/O performance using threads Dickens, P. M.; Thakur, R. Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999 https://doi.org/10.1109/IPPS.1999.760432	conference	January 1999
Exascale computing and big data Reed, Daniel A.; Dongarra, Jack Communications of the ACM, Vol. 58, Issue 7 https://doi.org/10.1145/2699414	journal	June 2015
Reducing I/O variability using dynamic I/O path characterization in petascale storage systems Son, Seung Woo; Sehrish, Saba; Liao, Wei-keng The Journal of Supercomputing, Vol. 73, Issue 5, p. 2069-2097 https://doi.org/10.1007/s11227-016-1904-7	journal	November 2016
Understanding the Effects of Communication and Coordination on Checkpointing at Scale Ferreira, Kurt B.; Widener, Patrick; Levy, Scott SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.77	conference	November 2014
Managing Variability in the IO Performance of Petascale Storage Systems Lofstead, Jay; Zheng, Fang; Liu, Qing 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.32	conference	November 2010
Light-weight parallel Python tools for earth system modeling workflows Paul, Kevin; Mickelson, Sheri; Dennis, John M. 2015 IEEE International Conference on Big Data (Big Data) https://doi.org/10.1109/BigData.2015.7363979	conference	October 2015
CHARM++: a portable concurrent object oriented system based on C++ Kale, Laxmikant V.; Krishnan, Sanjeev Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications - OOPSLA '93 https://doi.org/10.1145/165854.165874	conference	January 1993
Rucio: Scientific Data Management Barisits, Martin; Beermann, Thomas; Berghaus, Frank Computing and Software for Big Science, Vol. 3, Issue 1 https://doi.org/10.1007/s41781-019-0026-3	journal	August 2019
InterferenceRemoval: removing interference of disk access for MPI programs through data replication Zhang, Xuechen; Jiang, Song Proceedings of the 24th ACM International Conference on Supercomputing - ICS '10 https://doi.org/10.1145/1810085.1810116	conference	January 2010
Towards Asynchronous Many-Task in Situ Data Analysis Using Legion Pebay, Philippe; Bennett, Janine C.; Hollman, David 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2016.24	conference	May 2016
Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms Herault, Thomas; Robert, Yves; Bouteiller, Aurelien 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2018.00127	conference	May 2018
I/O-Aware Batch Scheduling for Petascale Computing Systems Zhou, Zhou; Yang, Xu; Zhao, Dongfang 2015 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2015.45	conference	September 2015
Tuning Object-Centric Data Management Systems for Large Scale Scientific Applications Tang, Houjun; Byna, Suren; Bailey, Stephen 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC) https://doi.org/10.1109/HiPC.2019.00023	conference	December 2019
On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems Yildiz, Orcun; Dorier, Matthieu; Ibrahim, Shadi 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2016.50	conference	May 2016
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures Augonnet, Cédric; Thibault, Samuel; Namyst, Raymond Concurrency and Computation: Practice and Experience, Vol. 23, Issue 2 https://doi.org/10.1002/cpe.1631	journal	November 2010
Harnessing Data Movement in Virtual Clusters for In-Situ Execution Huang, Dan; Liu, Qing; Klasky, Scott IEEE Transactions on Parallel and Distributed Systems, Vol. 30, Issue 3 https://doi.org/10.1109/TPDS.2018.2867879	journal	March 2019
Scheduling the I/O of HPC Applications Under Congestion Gainaru, Ana; Aupy, Guillaume; Benoit, Anne 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2015.116	conference	May 2015
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale Nicolae, Bogdan; Moody, Adam; Gonsiorowski, Elsa 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2019.00099	conference	May 2019
Enterprise HPC storage systems Petersen, Torben Kling; Fragalla, John 2014 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC.2014.7040948	conference	September 2014
NiMC: Characterizing and Eliminating Network-Induced Memory Contention Groves, Taylor; Grant, Ryan E.; Arnold, Dorian 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2016.29	conference	May 2016
Storage challenges at Los Alamos National Lab Bent, John; Grider, Gary; Kettering, Brett https://doi.org/10.1109/MSST.2012.6232376	conference	April 2012
Transferring a petabyte in a day Kettimuthu, Rajkumar; Liu, Zhengchun; Wheeler, David Future Generation Computer Systems, Vol. 88 https://doi.org/10.1016/j.future.2018.05.051	journal	November 2018
DAOS and Friends: A Proposal for an Exascale Storage System Lofstead, Jay; Jimenez, Ivo; Maltzahn, Carlos SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.49	conference	November 2016

Similar Records

Design and implementation of I/O performance prediction scheme on HPC systems through large-scale log analysis

Journal Article · Wed May 17 00:00:00 EDT 2023 · Journal of Big Data · OSTI ID:1831116

Kim, Sunggon; Sim, Alex; Wu, Kesheng; +2 more

SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing

Technical Report · Mon Feb 21 00:00:00 EST 2022 · OSTI ID:1831116

Dai, Donglai

Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures

Thesis/Dissertation · Mon May 01 00:00:00 EDT 2017 · OSTI ID:1831116

Arumugam, Kamesh

Related Subjects

97 MATHEMATICS AND COMPUTING
I/O interference
asynchronous and concurrent I/O
checkpointing
HPC applications
performance analysis

Title: Demystifying asynchronous I/O Interference in HPC applications

Citation Formats

References (28)

Similar Records

Related Subjects