Skip to main content

A Fault-Tolerance Architecture for Kepler-Based Distributed Scientific Workflows

  • Conference paper
Scientific and Statistical Database Management (SSDBM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6187))

Abstract

Fault-tolerance and failure recovery in scientific workflows is still a relatively young topic. The work done in the domain so far mostly applies classic fault-tolerance mechanisms, such as "alternative versions" and "checkpointing", to scientific workflows. Often scientific workflow systems simply rely on the fault-tolerance capabilities provided by their third party subcomponents such as schedulers, Grid resources, or the underlying operating systems.  When failures occur at the underlying layers, a workflow system typically sees them only as failed steps in the process without additional detail and the ability of the system to recover from those failures may be limited. In this paper, we present an architecture that tries to address this for Kepler-based scientific workflows by providing more information about failures and faults we have observed, and through a supporting implementation of more comprehensive failure coverage and recovery options. We discuss our framework in the context of the failures observed in two production-level Kepler-based workflows, specifically XGC and S3D. The framework is divided into three major components: (i) a general contingency Kepler actor that provides a recovery block functionality at the workflow level, (ii) an external monitoring module that tracks the underlying workflow components, and monitors the overall health of the workflow execution, and (iii) a checkpointing mechanism that provides smart resume capabilities for cases in which an unrecoverable error occurs. This framework takes advantage of the provenance data collected by the Kepler-based workflows to detect failures and help in fault-tolerance decision making.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Taylor, I., Deelman, E., Gannon, D., Shields, M.: Workflows for e-Science (2007) ISBN: 978-1-84628-519-6

    Google Scholar 

  2. Vouk, M., et al.: Automation of Network-Based Scientific Workflows. In: Gaffney, P.W., Pool, J.C.T. (eds.) Grid-Based Problem Solving Environments. IFIP, vol. 239, pp. 35–61. Springer, Boston (2007)

    Chapter  Google Scholar 

  3. Deelman, E., Singh, G., Su, M., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G., Good, J., Laity, A., Jacob, J., Katz, D.: Pegasus: A Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming J. 13(3), 219–237 (2005)

    Google Scholar 

  4. Scientific Process Automation (SPA), http://sdm.lbl.gov/sdmcenter/ (visited December 2009)

  5. Kepler Project, http://kepler-project.org/ (Visited December 2009)

  6. Chang, C., Ku, S., Weitzner, H.: Numerical study of neoclassical plasma pedestal in a tokamak geometry. Phys. Plasmas 11, 2649–2667 (2004)

    Article  Google Scholar 

  7. Chen, J., et al.: Terascale direct numerical simulations of turbulent combustion using S3D. Computational Science & Discovery 2(015001), 31 Pages (2009)

    Google Scholar 

  8. Cummings, J., et al.: Plasma Edge Kinetic-MHD Modeling in Tokamaks Using Kepler Workflow for Code Coupling, Data Management and Visualization. Communications in Computational Physics 4(3), 675–702 ISSN 1815-2406

    Google Scholar 

  9. Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance Collection Support in the Kepler Scientific Workflow System. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  10. Klasky, S., Barreto, R., Kahn, A., Parashar, M., Podhorszki, N., Parker, S., Silver, D., Vouk, M.: Collaborative visualization spaces for petascale simulations. In: International Symposium on Collaborative Technologies and Systems, 2008, May 2008, pp. 203–211 (2008)

    Google Scholar 

  11. McAllister, D.F., Vouk, M.A.: Software Fault-Tolerance Engineering. In: Handbook of Software Reliability Engineering, ch. 14, January 1996, pp. 567–614. McGraw Hill, New York (1996)

    Google Scholar 

  12. Randell, B.: Design-Fault Tolerance. In: The Evolution of Fault-Tolerant Computing, pp. 251–270. Springer, Vienna (1987)

    Google Scholar 

  13. Avizienis, A.: The N-Version Approach to Fault-Tolerant Systems. IEEE Trans. Software Engineering SE-11(12), 1491–1501 (1985)

    Article  Google Scholar 

  14. Yibei, L., Jie, M., Xiaola, L.: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Trans. Computers 50(7), 699–708 (2001)

    Article  Google Scholar 

  15. Barr, M.: Introduction to Watchdog Timers, http://www.embedded.com/story/OEG20010920S0064 (Visited January 2010)

  16. Tain, D., Tannenbaum, T., Livny, M.: Distributed Computing in Practice: The Condor Experience. Concurrency and Computation: Practice and Experience 17(2-4), 323–356 (2005)

    Article  Google Scholar 

  17. The Condor Project: Job Recovery with Rescue DAG, http://www.cs.wisc.edu/condor/manual/v6.2/2_10Inter_job_Dependencies.html (visited December 2009)

  18. Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R., Carver, T., Glover, K., Pocock, M., Wipat, A., Li, P.: Taverna: A Tool for the Composition and Enactment of Bioinformatics Workflows. Bioinformatics 20(17), 3045–3054 (2004)

    Article  Google Scholar 

  19. Oinn, T., et al.: Taverna: Lessons in Creating a Workflow Environment for the Life Sciences. J. Concurrency and Computation: Practice and Experience 18(10), 1067–1100 (2002)

    Article  Google Scholar 

  20. Taylor, Shields, M., Wang, I., Harrison, A.: The Triana Workflow Environment: Architecture and Applications. In: Workflows for e-Science, pp. 320–339. Springer, New York (2007)

    Chapter  Google Scholar 

  21. Callahan, S., Freire, J., Santos, E., Scheidegger, C., Silva, C., Vo, H.: VisTrails: Visualization Meets Data Management. In: Proc. Special Interest Group on Management of Data Conf. (SIGMOD 2006), pp. 745–747 (2006)

    Google Scholar 

  22. Laforenza, D., et al.: Biological Experiments on the Grid: A Novel Workflow Management Platform. In: Twentieth IEEE International Symposium on Computer-Based Medical Systems (CBMS 2007), pp. 489–494 (2007)

    Google Scholar 

  23. Zhao, Y., et al.: Swift: Fast, Reliable, Loosely Coupled Parallel Computation. In: Proc. IEEE Int’l. Workshop Scientific Workflows (SWF 2007), pp. 199–206 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mouallem, P., Crawl, D., Altintas, I., Vouk, M., Yildiz, U. (2010). A Fault-Tolerance Architecture for Kepler-Based Distributed Scientific Workflows. In: Gertz, M., Ludäscher, B. (eds) Scientific and Statistical Database Management. SSDBM 2010. Lecture Notes in Computer Science, vol 6187. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13818-8_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13818-8_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13817-1

  • Online ISBN: 978-3-642-13818-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics