skip to main content
10.1145/1851476.1851525acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Design and evaluation of a self-healing Kepler for scientific workflows

Published:21 June 2010Publication History

ABSTRACT

Kepler is a popular open source scientific workflow (SWF) as it simplifies the effort required to construct complex data flow models through a visual interface. As the complexity of the workflow applications that will run on heterogeneous distributed systems increases, fault management becomes a critical design issue for large scale scientific and engineering applications. Due to the long execution times of these applications, it is important that they are fault tolerant; i.e. the workflow application can recover gracefully from faults without the need to restart the application from the beginning. The current implementation of Kepler tool does not support fault tolerance or recovery mechanisms. In this paper, we extend the Kepler capabilities to support fault tolerant scientific workflow (FT-SWF) with a checkpoint mechanism where corrective measures are taken seamlessly in an autonomic manner whenever a fault is detected. To the best of our knowledge, this is the first approach on adding autonomic operations to Kepler. We have evaluated the FT-Kepler on a distributed application used by ecosystem researchers. We evaluated the performance of the workflow with hardware and software based fault scenarios in terms of execution time, recovery time, and the checkpoint mechanism overhead. The experimental evaluations indicate that the checkpoint mechanism adds negligible overhead to the total execution time of the workflow and as the fault rate increases, the number of checkpoints should be increased.

References

  1. }}I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludaescher, S. Mock, Kepler: An Extensible System for Design and Execution of Scientific Workflows, In the 16th Intl. Conference on Scientific and Statistical Database Management(SSDBM), Santorini Island, Greece, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. }}http://kepler-project.org/Google ScholarGoogle Scholar
  3. }}Y. Jararweh, A. Hary, Y. B Al-Nashif, S. Hariri, A. Akoglu, D. Jenerette. "Accelerated Discovery through Integration of Kepler with Data Turbine for Ecosystem Research". AICCSA, May, 2009, Rabat, Morocco.Google ScholarGoogle Scholar
  4. }}A. Duda. The effects of checkpointing on program execution time. Information Processing Letters, 16:221--229, june 1983.Google ScholarGoogle ScholarCross RefCross Ref
  5. }}Salim Hariri, S., Lizhi Xue, Huoping Chen, Ming Zhang, Pavuluri, S., Soujanya Rao; "AUTONOMIA: an autonomic computing environment"; 2003. Conference Proceedings of the 2003, IEEE IPCCCGoogle ScholarGoogle Scholar
  6. }}Jenerette, G. D., R. L. Scott, G. A. Barron-Gafford, and T. E. Huxman. 2009. Gross primary production variability associated with meteorology, physiology, leaf area, and water supply in contrasting woodland and grassland semiarid riparian ecosystems. Journal of Geophysical Research - Biogeosciences 114, G04010: doi:10.1029/2009JG001074.Google ScholarGoogle Scholar
  7. }}J. Eker, J. W. Janneck, E. A. Lee, J. Liu, X. Liu, J. Ludvig, S. Neuendorffer, S. Sachs, and Y. Xiong. Taming Heterogeneity - The Ptolemy Approach. In Proceedings of the IEEE, volume 91(1), January 2003.Google ScholarGoogle Scholar
  8. }}http://ptolemy.eecs.berkeley.edu/publications/papers/03/overview/overview03.pdfGoogle ScholarGoogle Scholar
  9. }}https://code.kepler-project.org/code/kepler-docs/trunk/outreach/documentation/shipping/UserManual.pdfGoogle ScholarGoogle Scholar
  10. }}Plankensteiner, K., Prodan, R., Fahringer, T., Kertesz, A., Kacsuk,.: Fault-tolerant behavior in state-of-the-art Grid Workflow Management Systems. TR-0091, Core-GRID, 2007.Google ScholarGoogle Scholar
  11. }}S. Hwang and C. Kesselman, "Grid Workflow: A Flexible Failure Handling Framework for the Grid", in 12th IEEE International Symposium on High Performance Distributed Computing (HPDC'03), Seattle, Washington, USA, IEEE CS, Los Alamitos, CA, USA, June 22Y24, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Design and evaluation of a self-healing Kepler for scientific workflows

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
      June 2010
      911 pages
      ISBN:9781605589428
      DOI:10.1145/1851476

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 June 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate166of966submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader