research-article

Maximizing Data Utility for HPC Python Workflow Execution

Authors:
Thanh Son Phung

Department of Computer Science and Engineering, University of Notre Dame, United States

Department of Computer Science and Engineering, University of Notre Dame, United States

0000-0001-5382-938X
View Profile

,
Ben Clifford

CQX Limited, United Kingdom

CQX Limited, United Kingdom

0000-0001-6397-7239
View Profile

,
Kyle Chard

University of Chicago, United States of America

University of Chicago, United States of America

0000-0002-7370-4805
View Profile

,
Douglas Thain

University of Notre Dame, USA

University of Notre Dame, USA

0000-0001-5218-1956
View Profile

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and AnalysisNovember 2023Pages 637–640https://doi.org/10.1145/3624062.3624136

Published:12 November 2023Publication History

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

Pages 637–640

ABSTRACT

Large-scale HPC workflows are increasingly implemented in dynamic languages such as Python, which allow for more rapid development than traditional techniques. However, the cost of executing Python applications at scale is often dominated by the distribution of common datasets and complex software dependencies. As the application scales up, data distribution becomes a limiting factor that prevents scaling beyond a few hundred nodes. To address this problem, we present the integration of Parsl (a Python-native parallel programming library) with TaskVine (a data-intensive workflow execution engine). Instead of relying on a shared filesystem to provide data to tasks on demand, Parsl is able to express advance data needs to TaskVine, which then performs efficient data distribution at runtime. This combination provides a performance speedup of 1.48x over the typical method of on-demand paging from the shared filesystem, while also providing an average task speedup of 1.79x with 2048 tasks and 256 nodes.

References

Bela Abolfathi, David Alonso, Robert Armstrong, Éric Aubourg, Humna Awan, Yadu N Babuji, Franz Erik Bauer, Rachel Bean, George Beckett, Rahul Biswas, 2021. The lsst desc dc2 simulated sky survey. The Astrophysical Journal Supplement Series 253, 1 (2021), 31.Google ScholarCross Ref
Michael Albrecht, Patrick Donnelly, Peter Bui, and Douglas Thain. 2012. Makeflow: A portable abstraction for data intensive computing on clusters, clouds, and grids. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies. 1–13.Google ScholarDigital Library
Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel S. Katz, Ben Clifford, Rohan Kumar, Lukasz Lacinski, Ryan Chard, Justin M. Wozniak, Ian Foster, Michael Wilde, and Kyle Chard. 2019. Parsl: Pervasive Parallel Programming in Python. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing (Phoenix, AZ, USA) (HPDC ’19). Association for Computing Machinery, New York, NY, USA, 25–36. https://doi.org/10.1145/3307681.3325400Google ScholarDigital Library
Jakob Blomer, Philippe Canal, Axel Naumann, and Danilo Piparo. 2020. Evolution of the ROOT tree I/O. In EPJ Web of Conferences, Vol. 245. EDP Sciences, 02030.Google ScholarCross Ref
Peter Bui, Dinesh Rajan, Badi Abdul-Wahid, Jesus Izaguirre, and Douglas Thain. 2011. Work queue+ python: A framework for scalable scientific ensemble applications. In Workshop on python for high performance and scientific computing at sc11.Google Scholar
Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip J Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira Da Silva, Miron Livny, 2015. Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46 (2015), 17–35.Google ScholarDigital Library
Paolo Di Tommaso, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. Nextflow enables reproducible computational workflows. Nature biotechnology 35, 4 (2017), 316–319.Google Scholar
Michael M McKerns, Leif Strand, Tim Sullivan, Alta Fang, and Michael AG Aivazis. 2012. Building a framework for predictive science. arXiv preprint arXiv:1202.1056 (2012).Google Scholar
Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proceedings of the 14th Python in Science Conference, Kathryn Huff and James Bergstra (Eds.). 130 – 136.Google ScholarCross Ref
Barry Sly-Delgado, Thanh Son Phung, Colin Thomas, David Simonetti, Andrew Hennesse, Ben Tovar, and Douglas Thain. 2023. TaskVine: Managing In Cluster Data for High Throughput Data Intensive Workflows. WORKS Workshop on Workflows in Support of Large Scale Science at Supercomputing (2023).Google Scholar
Osamu Tatebe, Kazuki Obata, Kohei Hiraga, and Hiroki Ohtsuji. 2022. Chfs: Parallel consistent hashing file system for node-local persistent memory. In International Conference on High Performance Computing in Asia-Pacific Region. 115–124.Google ScholarDigital Library
Marc-André Vef, Nafiseh Moti, Tim Süß, Tommaso Tocci, Ramon Nou, Alberto Miranda, Toni Cortes, and André Brinkmann. 2018. Gekkofs-a temporary distributed file system for hpc applications. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 319–324.Google ScholarCross Ref
Brent Welch and Garth A Gibson. 2004. Managing Scalability in Object Storage Systems for HPC Linux Clusters.. In MSST. Citeseer, 433–445.Google Scholar

Index Terms

Maximizing Data Utility for HPC Python Workflow Execution

Index terms have been assigned to the content through auto-classification.

Recommendations

Data-Locality Aware Scientific Workflow Scheduling Methods in HPC Cloud Environments

Efficient data-aware methods in job scheduling, distributed storage management and data management platforms are necessary for successful execution of data-intensive applications. However, research about methods for data-intensive scientific ...
Read More
Dynamic provisioning and execution of HPC workflows using Python
PyHPC '16: Proceedings of the 6th Workshop on Python for High-Performance and Scientific Computing

High-performance computing (HPC) workflows over the last several decades have proven to assist in the understanding of scientific phenomena and the production of better products, more quickly, and at reduced cost. However, HPC workflows are difficult to ...
Read More
A scientific workflow system for genomic data analysis
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
November 2023
2180 pages
ISBN:9798400707858
DOI:10.1145/3624062

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 32
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Maximizing Data Utility for HPC Python Workflow Execution

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Data-Locality Aware Scientific Workflow Scheduling Methods in HPC Cloud Environments

Dynamic provisioning and execution of HPC workflows using Python

A scientific workflow system for genomic data analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Maximizing Data Utility for HPC Python Workflow Execution

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Data-Locality Aware Scientific Workflow Scheduling Methods in HPC Cloud Environments

Dynamic provisioning and execution of HPC workflows using Python

A scientific workflow system for genomic data analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media