Abstract
Fault-tolerance and failure recovery in scientific workflows is still a relatively young topic. The work done in the domain so far mostly applies classic fault-tolerance mechanisms, such as "alternative versions" and "checkpointing", to scientific workflows. Often scientific workflow systems simply rely on the fault-tolerance capabilities provided by their third party subcomponents such as schedulers, Grid resources, or the underlying operating systems. When failures occur at the underlying layers, a workflow system typically sees them only as failed steps in the process without additional detail and the ability of the system to recover from those failures may be limited. In this paper, we present an architecture that tries to address this for Kepler-based scientific workflows by providing more information about failures and faults we have observed, and through a supporting implementation of more comprehensive failure coverage and recovery options. We discuss our framework in the context of the failures observed in two production-level Kepler-based workflows, specifically XGC and S3D. The framework is divided into three major components: (i) a general contingency Kepler actor that provides a recovery block functionality at the workflow level, (ii) an external monitoring module that tracks the underlying workflow components, and monitors the overall health of the workflow execution, and (iii) a checkpointing mechanism that provides smart resume capabilities for cases in which an unrecoverable error occurs. This framework takes advantage of the provenance data collected by the Kepler-based workflows to detect failures and help in fault-tolerance decision making.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Taylor, I., Deelman, E., Gannon, D., Shields, M.: Workflows for e-Science (2007) ISBN: 978-1-84628-519-6
Vouk, M., et al.: Automation of Network-Based Scientific Workflows. In: Gaffney, P.W., Pool, J.C.T. (eds.) Grid-Based Problem Solving Environments. IFIP, vol. 239, pp. 35–61. Springer, Boston (2007)
Deelman, E., Singh, G., Su, M., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G., Good, J., Laity, A., Jacob, J., Katz, D.: Pegasus: A Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming J. 13(3), 219–237 (2005)
Scientific Process Automation (SPA), http://sdm.lbl.gov/sdmcenter/ (visited December 2009)
Kepler Project, http://kepler-project.org/ (Visited December 2009)
Chang, C., Ku, S., Weitzner, H.: Numerical study of neoclassical plasma pedestal in a tokamak geometry. Phys. Plasmas 11, 2649–2667 (2004)
Chen, J., et al.: Terascale direct numerical simulations of turbulent combustion using S3D. Computational Science & Discovery 2(015001), 31 Pages (2009)
Cummings, J., et al.: Plasma Edge Kinetic-MHD Modeling in Tokamaks Using Kepler Workflow for Code Coupling, Data Management and Visualization. Communications in Computational Physics 4(3), 675–702 ISSN 1815-2406
Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance Collection Support in the Kepler Scientific Workflow System. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006)
Klasky, S., Barreto, R., Kahn, A., Parashar, M., Podhorszki, N., Parker, S., Silver, D., Vouk, M.: Collaborative visualization spaces for petascale simulations. In: International Symposium on Collaborative Technologies and Systems, 2008, May 2008, pp. 203–211 (2008)
McAllister, D.F., Vouk, M.A.: Software Fault-Tolerance Engineering. In: Handbook of Software Reliability Engineering, ch. 14, January 1996, pp. 567–614. McGraw Hill, New York (1996)
Randell, B.: Design-Fault Tolerance. In: The Evolution of Fault-Tolerant Computing, pp. 251–270. Springer, Vienna (1987)
Avizienis, A.: The N-Version Approach to Fault-Tolerant Systems. IEEE Trans. Software Engineering SE-11(12), 1491–1501 (1985)
Yibei, L., Jie, M., Xiaola, L.: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Trans. Computers 50(7), 699–708 (2001)
Barr, M.: Introduction to Watchdog Timers, http://www.embedded.com/story/OEG20010920S0064 (Visited January 2010)
Tain, D., Tannenbaum, T., Livny, M.: Distributed Computing in Practice: The Condor Experience. Concurrency and Computation: Practice and Experience 17(2-4), 323–356 (2005)
The Condor Project: Job Recovery with Rescue DAG, http://www.cs.wisc.edu/condor/manual/v6.2/2_10Inter_job_Dependencies.html (visited December 2009)
Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R., Carver, T., Glover, K., Pocock, M., Wipat, A., Li, P.: Taverna: A Tool for the Composition and Enactment of Bioinformatics Workflows. Bioinformatics 20(17), 3045–3054 (2004)
Oinn, T., et al.: Taverna: Lessons in Creating a Workflow Environment for the Life Sciences. J. Concurrency and Computation: Practice and Experience 18(10), 1067–1100 (2002)
Taylor, Shields, M., Wang, I., Harrison, A.: The Triana Workflow Environment: Architecture and Applications. In: Workflows for e-Science, pp. 320–339. Springer, New York (2007)
Callahan, S., Freire, J., Santos, E., Scheidegger, C., Silva, C., Vo, H.: VisTrails: Visualization Meets Data Management. In: Proc. Special Interest Group on Management of Data Conf. (SIGMOD 2006), pp. 745–747 (2006)
Laforenza, D., et al.: Biological Experiments on the Grid: A Novel Workflow Management Platform. In: Twentieth IEEE International Symposium on Computer-Based Medical Systems (CBMS 2007), pp. 489–494 (2007)
Zhao, Y., et al.: Swift: Fast, Reliable, Loosely Coupled Parallel Computation. In: Proc. IEEE Int’l. Workshop Scientific Workflows (SWF 2007), pp. 199–206 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mouallem, P., Crawl, D., Altintas, I., Vouk, M., Yildiz, U. (2010). A Fault-Tolerance Architecture for Kepler-Based Distributed Scientific Workflows. In: Gertz, M., Ludäscher, B. (eds) Scientific and Statistical Database Management. SSDBM 2010. Lecture Notes in Computer Science, vol 6187. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13818-8_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-13818-8_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13817-1
Online ISBN: 978-3-642-13818-8
eBook Packages: Computer ScienceComputer Science (R0)