research-article

Enabling large-scale scientific workflows on petascale resources using MPI master/worker

Authors:
Mats Rynge

University of Southern California

University of Southern California
View Profile

,
Scott Callaghan

University of Southern California

University of Southern California
View Profile

,
Ewa Deelman

University of Southern California

University of Southern California
View Profile

,
Gideon Juve

University of Southern California

University of Southern California
View Profile

,
Gaurang Mehta

University of Southern California

University of Southern California
View Profile

,
Karan Vahi

University of Southern California

University of Southern California
View Profile

,
Philip J. Maechling

University of Southern California

University of Southern California
View Profile

XSEDE '12: Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyondJuly 2012Article No.: 49Pages 1–8https://doi.org/10.1145/2335755.2335846

Published:16 July 2012Publication History

XSEDE '12: Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond

Pages 1–8

ABSTRACT

Computational scientists often need to execute large, loosely-coupled parallel applications such as workflows and bags of tasks in order to do their research. These applications are typically composed of many, short-running, serial tasks, which frequently demand large amounts of computation and storage. In order to produce results in a reasonable amount of time, scientists would like to execute these applications using petascale resources. In the past this has been a challenge because petascale systems are not designed to execute such workloads efficiently. In this paper we describe a new approach to executing large, fine-grained workflows on distributed petascale systems. Our solution involves partitioning the workflow into independent subgraphs, and then submitting each subgraph as a self-contained MPI job to the available resources (often remote). We describe how the partitioning and job management has been implemented in the Pegasus Workflow Management System. We also explain how this approach provides an end-to-end solution for challenges related to system architecture, queue policies and priorities, and application reuse and development. Finally, we describe how the system is being used to enable the execution of a very large seismic hazard analysis application on XSEDE resources.

References

Altintas, I. et al. 2004. Kepler: an extensible system for design and execution of scientific workflows. Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on (Jun. 2004), 423--424. Google ScholarDigital Library
Augonnet, C. et al. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience. 23, 2 (2011), 187--198. Google ScholarDigital Library
Bosilca, G. et al. 2012. DAGuE: A generic distributed DAG engine for High Performance Computing. Parallel Comput. 38, 1--2 (Jan. 2012), 37--51. Google ScholarDigital Library
Bui, P. et al. 2011. Work Queue + Python: A Framework For Scalable Scientific Ensemble Applications. Workshop on Python for High Performance and Scientific Computing. (2011).Google Scholar
Callaghan, S. et a l. 2011. Metrics for heterogeneous scientific workflows: A case study of an earthquake science application. International Journal of High Performance Computing Applications. 25, 3 (2011), 274--285. Google ScholarDigital Library
Casajus, A. et al. 2010. DIRAC pilot framework and the DIRAC Workload Management System. Journal of Physics: Conference Series. 219, 6 (Apr. 2010).Google ScholarCross Ref
Chervenak, A. L. et al. 2009. The Globus Replica Location Service: Design and Experience. IEEE Trans. Parallel Distrib. Syst. 20, 9 (Sep. 2009), 1260--1272. Google ScholarDigital Library
Condor DAGMan (Directed Acyclic Graph Manager): http://research.cs.wisc.edu/condor/dagman/.Google Scholar
Cosnard, M. et al. 2004. Compact DAG representation and its symbolic scheduling. J. Parallel Distrib. Comput. 64, 8 (2004), 921--935. Google ScholarDigital Library
Deelman, E. et al. 2005. Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Scientific Programming Journal. 13, (2005), 219--237. Google ScholarDigital Library
Fahringer, T. et al. 2005. ASKALON: a tool set for cluster and Grid computing: Research Articles. Concurr. Comput.: Pract. Exper. 17, 2--4 (2005), 143--169. Google ScholarDigital Library
Frey, J. et al. 2001. Condor-G: a computation management agent for multi-institutional grids. High Performance Distributed Computing, 2001. Proceedings. 10th IEEE International Symposium on (2001), 55--63. Google ScholarDigital Library
Goux, J.-P. et al. 2000. An enabling framework for master-worker applications on the Computational Grid. High-Performance Distributed Computing, 2000. Proceedings. The Ninth International Symposium on (2000), 43--50. Google ScholarDigital Library
Graves, R. et al. 2011. CyberShake: A Physics-Based Seismic Hazard Model for Southern California. Pure and Applied Geophysics. 168, 3 (Mar. 2011), 367--381.Google ScholarCross Ref
Juve, G. et al. 2010. Experiences with resource provisioning for scientific workflows using Corral. Sci. Program. 18, 2 (2010), 77--92. Google ScholarDigital Library
Maeno, T. 2008. PanDA: distributed production and distributed analysis system for ATLAS. Journal of Physics: Conference Series. 119, 6 (2008), 062036.Google ScholarCross Ref
Moscicki, J. T. 2003. DIANE - distributed analysis environment for GRID-enabled simulation and analysis of physics data. Nuclear Science Symposium Conference Record, 2003 IEEE (Oct. 2003), 1617--1620 Vol.3.Google ScholarCross Ref
Oinn, T. et al. 2006. Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience. 18, 10 (2006), 1067--1100. Google ScholarDigital Library
Perez, J. M. et al. 2008. A dependency-aware task-based programming environment for multi-core architectures. Cluster Computing, 2008 IEEE International Conference on (Oct. 2008), 142--151.Google ScholarCross Ref
Raicu, I. et al. 2007. Falkon: a Fast and Light-weight tasK executiON framework. (2007).Google Scholar
Rynge, M. et al. 2011. Experiences Using Glidein WMS and the Corral Frontend across Cyberinfrastructures. E-Science (e-Science), 2011 IEEE 7th International Conference on (Dec. 2011), 311--318. Google ScholarDigital Library
Sfiligoi, I. et al. 2009. The Pilot Way to Grid Resources Using glidein WMS. Computer Science and Information Engineering, 2009 WRI World Congress on (2009), 428--432. Google ScholarDigital Library
Singh, G. et al. 2008. Workflow task clustering for best effort systems with Pegasus. Mardi Gras Conference '08 (2008), -1--1. Google ScholarDigital Library
Taylor, I. et al. 2005. Visual Grid Workflow in Triana. Journal of Grid Computing. 3, 3 (Sep. 2005), 153--169.Google ScholarCross Ref
Zhao, H. and Sakellariou, R. 2007. Advance reservation policies for workflows. Proceedings of the 12th international conference on Job scheduling strategies for parallel processing (Saint-Malo, France, 2007), 47--67. Google ScholarDigital Library

Index Terms

Enabling large-scale scientific workflows on petascale resources using MPI master/worker
1. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models

Recommendations

Brokering multi-grid workflows in the P-GRADE portal
Euro-Par'06: Proceedings of the CoreGRID 2006, UNICORE Summit 2006, Petascale Computational Biology and Bioinformatics conference on Parallel processing

Grid computing has gone through some generations and as a result only a few widely used middleware architectures remain. The Globus Toolkit is the most widespread middleware in most of the current production grid systems, but the LCG-2 middleware ...
Read More
Automatically Composed Workflows for Grid Environments

Grid computing provides key infrastructure for distributed problem solving in dynamic virtual organizations. Many scientific projects have adopted grid computing, and industrial interest in itis rising rapidly. However, grids are still the domain of a ...
Read More
Online Fault and Anomaly Detection for Large-Scale Scientific Workflows
HPCC '11: Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications

Scientific workflows are an enabler of complex scientific analyses. Large-scale scientific workflows are executed on complex parallel and distributed resources, where many things can fail. Application scientists need to track the status of their ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
XSEDE '12: Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
July 2012
423 pages
ISBN:9781450316026
DOI:10.1145/2335755
General Chair:
Craig Stewart
Indiana University
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 July 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
task clustering
workflow management
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate129of190submissions,68%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 142
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enabling large-scale scientific workflows on petascale resources using MPI master/worker

XSEDE '12: Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond

ABSTRACT

References

Cited By

Index Terms

Recommendations

Brokering multi-grid workflows in the P-GRADE portal

Automatically Composed Workflows for Grid Environments

Online Fault and Anomaly Detection for Large-Scale Scientific Workflows

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Enabling large-scale scientific workflows on petascale resources using MPI master/worker

XSEDE '12: Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond

ABSTRACT

References

Cited By

Index Terms

Recommendations

Brokering multi-grid workflows in the P-GRADE portal

Automatically Composed Workflows for Grid Environments

Online Fault and Anomaly Detection for Large-Scale Scientific Workflows

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media