ABSTRACT
Computational scientists often need to execute large, loosely-coupled parallel applications such as workflows and bags of tasks in order to do their research. These applications are typically composed of many, short-running, serial tasks, which frequently demand large amounts of computation and storage. In order to produce results in a reasonable amount of time, scientists would like to execute these applications using petascale resources. In the past this has been a challenge because petascale systems are not designed to execute such workloads efficiently. In this paper we describe a new approach to executing large, fine-grained workflows on distributed petascale systems. Our solution involves partitioning the workflow into independent subgraphs, and then submitting each subgraph as a self-contained MPI job to the available resources (often remote). We describe how the partitioning and job management has been implemented in the Pegasus Workflow Management System. We also explain how this approach provides an end-to-end solution for challenges related to system architecture, queue policies and priorities, and application reuse and development. Finally, we describe how the system is being used to enable the execution of a very large seismic hazard analysis application on XSEDE resources.
- Altintas, I. et al. 2004. Kepler: an extensible system for design and execution of scientific workflows. Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on (Jun. 2004), 423--424. Google ScholarDigital Library
- Augonnet, C. et al. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience. 23, 2 (2011), 187--198. Google ScholarDigital Library
- Bosilca, G. et al. 2012. DAGuE: A generic distributed DAG engine for High Performance Computing. Parallel Comput. 38, 1--2 (Jan. 2012), 37--51. Google ScholarDigital Library
- Bui, P. et al. 2011. Work Queue + Python: A Framework For Scalable Scientific Ensemble Applications. Workshop on Python for High Performance and Scientific Computing. (2011).Google Scholar
- Callaghan, S. et a l. 2011. Metrics for heterogeneous scientific workflows: A case study of an earthquake science application. International Journal of High Performance Computing Applications. 25, 3 (2011), 274--285. Google ScholarDigital Library
- Casajus, A. et al. 2010. DIRAC pilot framework and the DIRAC Workload Management System. Journal of Physics: Conference Series. 219, 6 (Apr. 2010).Google ScholarCross Ref
- Chervenak, A. L. et al. 2009. The Globus Replica Location Service: Design and Experience. IEEE Trans. Parallel Distrib. Syst. 20, 9 (Sep. 2009), 1260--1272. Google ScholarDigital Library
- Condor DAGMan (Directed Acyclic Graph Manager): http://research.cs.wisc.edu/condor/dagman/.Google Scholar
- Cosnard, M. et al. 2004. Compact DAG representation and its symbolic scheduling. J. Parallel Distrib. Comput. 64, 8 (2004), 921--935. Google ScholarDigital Library
- Deelman, E. et al. 2005. Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Scientific Programming Journal. 13, (2005), 219--237. Google ScholarDigital Library
- Fahringer, T. et al. 2005. ASKALON: a tool set for cluster and Grid computing: Research Articles. Concurr. Comput.: Pract. Exper. 17, 2--4 (2005), 143--169. Google ScholarDigital Library
- Frey, J. et al. 2001. Condor-G: a computation management agent for multi-institutional grids. High Performance Distributed Computing, 2001. Proceedings. 10th IEEE International Symposium on (2001), 55--63. Google ScholarDigital Library
- Goux, J.-P. et al. 2000. An enabling framework for master-worker applications on the Computational Grid. High-Performance Distributed Computing, 2000. Proceedings. The Ninth International Symposium on (2000), 43--50. Google ScholarDigital Library
- Graves, R. et al. 2011. CyberShake: A Physics-Based Seismic Hazard Model for Southern California. Pure and Applied Geophysics. 168, 3 (Mar. 2011), 367--381.Google ScholarCross Ref
- Juve, G. et al. 2010. Experiences with resource provisioning for scientific workflows using Corral. Sci. Program. 18, 2 (2010), 77--92. Google ScholarDigital Library
- Maeno, T. 2008. PanDA: distributed production and distributed analysis system for ATLAS. Journal of Physics: Conference Series. 119, 6 (2008), 062036.Google ScholarCross Ref
- Moscicki, J. T. 2003. DIANE - distributed analysis environment for GRID-enabled simulation and analysis of physics data. Nuclear Science Symposium Conference Record, 2003 IEEE (Oct. 2003), 1617--1620 Vol.3.Google ScholarCross Ref
- Oinn, T. et al. 2006. Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience. 18, 10 (2006), 1067--1100. Google ScholarDigital Library
- Perez, J. M. et al. 2008. A dependency-aware task-based programming environment for multi-core architectures. Cluster Computing, 2008 IEEE International Conference on (Oct. 2008), 142--151.Google ScholarCross Ref
- Raicu, I. et al. 2007. Falkon: a Fast and Light-weight tasK executiON framework. (2007).Google Scholar
- Rynge, M. et al. 2011. Experiences Using Glidein WMS and the Corral Frontend across Cyberinfrastructures. E-Science (e-Science), 2011 IEEE 7th International Conference on (Dec. 2011), 311--318. Google ScholarDigital Library
- Sfiligoi, I. et al. 2009. The Pilot Way to Grid Resources Using glidein WMS. Computer Science and Information Engineering, 2009 WRI World Congress on (2009), 428--432. Google ScholarDigital Library
- Singh, G. et al. 2008. Workflow task clustering for best effort systems with Pegasus. Mardi Gras Conference '08 (2008), -1--1. Google ScholarDigital Library
- Taylor, I. et al. 2005. Visual Grid Workflow in Triana. Journal of Grid Computing. 3, 3 (Sep. 2005), 153--169.Google ScholarCross Ref
- Zhao, H. and Sakellariou, R. 2007. Advance reservation policies for workflows. Proceedings of the 12th international conference on Job scheduling strategies for parallel processing (Saint-Malo, France, 2007), 47--67. Google ScholarDigital Library
Index Terms
- Enabling large-scale scientific workflows on petascale resources using MPI master/worker
Recommendations
Brokering multi-grid workflows in the P-GRADE portal
Euro-Par'06: Proceedings of the CoreGRID 2006, UNICORE Summit 2006, Petascale Computational Biology and Bioinformatics conference on Parallel processingGrid computing has gone through some generations and as a result only a few widely used middleware architectures remain. The Globus Toolkit is the most widespread middleware in most of the current production grid systems, but the LCG-2 middleware ...
Automatically Composed Workflows for Grid Environments
Grid computing provides key infrastructure for distributed problem solving in dynamic virtual organizations. Many scientific projects have adopted grid computing, and industrial interest in itis rising rapidly. However, grids are still the domain of a ...
Online Fault and Anomaly Detection for Large-Scale Scientific Workflows
HPCC '11: Proceedings of the 2011 IEEE International Conference on High Performance Computing and CommunicationsScientific workflows are an enabler of complex scientific analyses. Large-scale scientific workflows are executed on complex parallel and distributed resources, where many things can fail. Application scientists need to track the status of their ...
Comments