skip to main content
10.1145/2335755.2335846acmotherconferencesArticle/Chapter ViewAbstractPublication PagesxsedeConference Proceedingsconference-collections
research-article

Enabling large-scale scientific workflows on petascale resources using MPI master/worker

Published:16 July 2012Publication History

ABSTRACT

Computational scientists often need to execute large, loosely-coupled parallel applications such as workflows and bags of tasks in order to do their research. These applications are typically composed of many, short-running, serial tasks, which frequently demand large amounts of computation and storage. In order to produce results in a reasonable amount of time, scientists would like to execute these applications using petascale resources. In the past this has been a challenge because petascale systems are not designed to execute such workloads efficiently. In this paper we describe a new approach to executing large, fine-grained workflows on distributed petascale systems. Our solution involves partitioning the workflow into independent subgraphs, and then submitting each subgraph as a self-contained MPI job to the available resources (often remote). We describe how the partitioning and job management has been implemented in the Pegasus Workflow Management System. We also explain how this approach provides an end-to-end solution for challenges related to system architecture, queue policies and priorities, and application reuse and development. Finally, we describe how the system is being used to enable the execution of a very large seismic hazard analysis application on XSEDE resources.

References

  1. Altintas, I. et al. 2004. Kepler: an extensible system for design and execution of scientific workflows. Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on (Jun. 2004), 423--424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Augonnet, C. et al. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience. 23, 2 (2011), 187--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bosilca, G. et al. 2012. DAGuE: A generic distributed DAG engine for High Performance Computing. Parallel Comput. 38, 1--2 (Jan. 2012), 37--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bui, P. et al. 2011. Work Queue + Python: A Framework For Scalable Scientific Ensemble Applications. Workshop on Python for High Performance and Scientific Computing. (2011).Google ScholarGoogle Scholar
  5. Callaghan, S. et a l. 2011. Metrics for heterogeneous scientific workflows: A case study of an earthquake science application. International Journal of High Performance Computing Applications. 25, 3 (2011), 274--285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Casajus, A. et al. 2010. DIRAC pilot framework and the DIRAC Workload Management System. Journal of Physics: Conference Series. 219, 6 (Apr. 2010).Google ScholarGoogle ScholarCross RefCross Ref
  7. Chervenak, A. L. et al. 2009. The Globus Replica Location Service: Design and Experience. IEEE Trans. Parallel Distrib. Syst. 20, 9 (Sep. 2009), 1260--1272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Condor DAGMan (Directed Acyclic Graph Manager): http://research.cs.wisc.edu/condor/dagman/.Google ScholarGoogle Scholar
  9. Cosnard, M. et al. 2004. Compact DAG representation and its symbolic scheduling. J. Parallel Distrib. Comput. 64, 8 (2004), 921--935. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Deelman, E. et al. 2005. Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Scientific Programming Journal. 13, (2005), 219--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Fahringer, T. et al. 2005. ASKALON: a tool set for cluster and Grid computing: Research Articles. Concurr. Comput.: Pract. Exper. 17, 2--4 (2005), 143--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Frey, J. et al. 2001. Condor-G: a computation management agent for multi-institutional grids. High Performance Distributed Computing, 2001. Proceedings. 10th IEEE International Symposium on (2001), 55--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Goux, J.-P. et al. 2000. An enabling framework for master-worker applications on the Computational Grid. High-Performance Distributed Computing, 2000. Proceedings. The Ninth International Symposium on (2000), 43--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Graves, R. et al. 2011. CyberShake: A Physics-Based Seismic Hazard Model for Southern California. Pure and Applied Geophysics. 168, 3 (Mar. 2011), 367--381.Google ScholarGoogle ScholarCross RefCross Ref
  15. Juve, G. et al. 2010. Experiences with resource provisioning for scientific workflows using Corral. Sci. Program. 18, 2 (2010), 77--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Maeno, T. 2008. PanDA: distributed production and distributed analysis system for ATLAS. Journal of Physics: Conference Series. 119, 6 (2008), 062036.Google ScholarGoogle ScholarCross RefCross Ref
  17. Moscicki, J. T. 2003. DIANE - distributed analysis environment for GRID-enabled simulation and analysis of physics data. Nuclear Science Symposium Conference Record, 2003 IEEE (Oct. 2003), 1617--1620 Vol.3.Google ScholarGoogle ScholarCross RefCross Ref
  18. Oinn, T. et al. 2006. Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience. 18, 10 (2006), 1067--1100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Perez, J. M. et al. 2008. A dependency-aware task-based programming environment for multi-core architectures. Cluster Computing, 2008 IEEE International Conference on (Oct. 2008), 142--151.Google ScholarGoogle ScholarCross RefCross Ref
  20. Raicu, I. et al. 2007. Falkon: a Fast and Light-weight tasK executiON framework. (2007).Google ScholarGoogle Scholar
  21. Rynge, M. et al. 2011. Experiences Using Glidein WMS and the Corral Frontend across Cyberinfrastructures. E-Science (e-Science), 2011 IEEE 7th International Conference on (Dec. 2011), 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sfiligoi, I. et al. 2009. The Pilot Way to Grid Resources Using glidein WMS. Computer Science and Information Engineering, 2009 WRI World Congress on (2009), 428--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Singh, G. et al. 2008. Workflow task clustering for best effort systems with Pegasus. Mardi Gras Conference '08 (2008), -1--1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Taylor, I. et al. 2005. Visual Grid Workflow in Triana. Journal of Grid Computing. 3, 3 (Sep. 2005), 153--169.Google ScholarGoogle ScholarCross RefCross Ref
  25. Zhao, H. and Sakellariou, R. 2007. Advance reservation policies for workflows. Proceedings of the 12th international conference on Job scheduling strategies for parallel processing (Saint-Malo, France, 2007), 47--67. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enabling large-scale scientific workflows on petascale resources using MPI master/worker

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      XSEDE '12: Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
      July 2012
      423 pages
      ISBN:9781450316026
      DOI:10.1145/2335755

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 July 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate129of190submissions,68%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader