ABSTRACT
Large-scale HPC workflows are increasingly implemented in dynamic languages such as Python, which allow for more rapid development than traditional techniques. However, the cost of executing Python applications at scale is often dominated by the distribution of common datasets and complex software dependencies. As the application scales up, data distribution becomes a limiting factor that prevents scaling beyond a few hundred nodes. To address this problem, we present the integration of Parsl (a Python-native parallel programming library) with TaskVine (a data-intensive workflow execution engine). Instead of relying on a shared filesystem to provide data to tasks on demand, Parsl is able to express advance data needs to TaskVine, which then performs efficient data distribution at runtime. This combination provides a performance speedup of 1.48x over the typical method of on-demand paging from the shared filesystem, while also providing an average task speedup of 1.79x with 2048 tasks and 256 nodes.
- Bela Abolfathi, David Alonso, Robert Armstrong, Éric Aubourg, Humna Awan, Yadu N Babuji, Franz Erik Bauer, Rachel Bean, George Beckett, Rahul Biswas, 2021. The lsst desc dc2 simulated sky survey. The Astrophysical Journal Supplement Series 253, 1 (2021), 31.Google ScholarCross Ref
- Michael Albrecht, Patrick Donnelly, Peter Bui, and Douglas Thain. 2012. Makeflow: A portable abstraction for data intensive computing on clusters, clouds, and grids. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies. 1–13.Google ScholarDigital Library
- Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel S. Katz, Ben Clifford, Rohan Kumar, Lukasz Lacinski, Ryan Chard, Justin M. Wozniak, Ian Foster, Michael Wilde, and Kyle Chard. 2019. Parsl: Pervasive Parallel Programming in Python. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing (Phoenix, AZ, USA) (HPDC ’19). Association for Computing Machinery, New York, NY, USA, 25–36. https://doi.org/10.1145/3307681.3325400Google ScholarDigital Library
- Jakob Blomer, Philippe Canal, Axel Naumann, and Danilo Piparo. 2020. Evolution of the ROOT tree I/O. In EPJ Web of Conferences, Vol. 245. EDP Sciences, 02030.Google ScholarCross Ref
- Peter Bui, Dinesh Rajan, Badi Abdul-Wahid, Jesus Izaguirre, and Douglas Thain. 2011. Work queue+ python: A framework for scalable scientific ensemble applications. In Workshop on python for high performance and scientific computing at sc11.Google Scholar
- Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip J Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira Da Silva, Miron Livny, 2015. Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46 (2015), 17–35.Google ScholarDigital Library
- Paolo Di Tommaso, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. Nextflow enables reproducible computational workflows. Nature biotechnology 35, 4 (2017), 316–319.Google Scholar
- Michael M McKerns, Leif Strand, Tim Sullivan, Alta Fang, and Michael AG Aivazis. 2012. Building a framework for predictive science. arXiv preprint arXiv:1202.1056 (2012).Google Scholar
- Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proceedings of the 14th Python in Science Conference, Kathryn Huff and James Bergstra (Eds.). 130 – 136.Google ScholarCross Ref
- Barry Sly-Delgado, Thanh Son Phung, Colin Thomas, David Simonetti, Andrew Hennesse, Ben Tovar, and Douglas Thain. 2023. TaskVine: Managing In Cluster Data for High Throughput Data Intensive Workflows. WORKS Workshop on Workflows in Support of Large Scale Science at Supercomputing (2023).Google Scholar
- Osamu Tatebe, Kazuki Obata, Kohei Hiraga, and Hiroki Ohtsuji. 2022. Chfs: Parallel consistent hashing file system for node-local persistent memory. In International Conference on High Performance Computing in Asia-Pacific Region. 115–124.Google ScholarDigital Library
- Marc-André Vef, Nafiseh Moti, Tim Süß, Tommaso Tocci, Ramon Nou, Alberto Miranda, Toni Cortes, and André Brinkmann. 2018. Gekkofs-a temporary distributed file system for hpc applications. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 319–324.Google ScholarCross Ref
- Brent Welch and Garth A Gibson. 2004. Managing Scalability in Object Storage Systems for HPC Linux Clusters.. In MSST. Citeseer, 433–445.Google Scholar
Index Terms
- Maximizing Data Utility for HPC Python Workflow Execution
Recommendations
Data-Locality Aware Scientific Workflow Scheduling Methods in HPC Cloud Environments
Efficient data-aware methods in job scheduling, distributed storage management and data management platforms are necessary for successful execution of data-intensive applications. However, research about methods for data-intensive scientific ...
Dynamic provisioning and execution of HPC workflows using Python
PyHPC '16: Proceedings of the 6th Workshop on Python for High-Performance and Scientific ComputingHigh-performance computing (HPC) workflows over the last several decades have proven to assist in the understanding of scientific phenomena and the production of better products, more quickly, and at reduced cost. However, HPC workflows are difficult to ...
Comments