ABSTRACT
Kepler is a popular open source scientific workflow (SWF) as it simplifies the effort required to construct complex data flow models through a visual interface. As the complexity of the workflow applications that will run on heterogeneous distributed systems increases, fault management becomes a critical design issue for large scale scientific and engineering applications. Due to the long execution times of these applications, it is important that they are fault tolerant; i.e. the workflow application can recover gracefully from faults without the need to restart the application from the beginning. The current implementation of Kepler tool does not support fault tolerance or recovery mechanisms. In this paper, we extend the Kepler capabilities to support fault tolerant scientific workflow (FT-SWF) with a checkpoint mechanism where corrective measures are taken seamlessly in an autonomic manner whenever a fault is detected. To the best of our knowledge, this is the first approach on adding autonomic operations to Kepler. We have evaluated the FT-Kepler on a distributed application used by ecosystem researchers. We evaluated the performance of the workflow with hardware and software based fault scenarios in terms of execution time, recovery time, and the checkpoint mechanism overhead. The experimental evaluations indicate that the checkpoint mechanism adds negligible overhead to the total execution time of the workflow and as the fault rate increases, the number of checkpoints should be increased.
- }}I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludaescher, S. Mock, Kepler: An Extensible System for Design and Execution of Scientific Workflows, In the 16th Intl. Conference on Scientific and Statistical Database Management(SSDBM), Santorini Island, Greece, June 2004. Google ScholarDigital Library
- }}http://kepler-project.org/Google Scholar
- }}Y. Jararweh, A. Hary, Y. B Al-Nashif, S. Hariri, A. Akoglu, D. Jenerette. "Accelerated Discovery through Integration of Kepler with Data Turbine for Ecosystem Research". AICCSA, May, 2009, Rabat, Morocco.Google Scholar
- }}A. Duda. The effects of checkpointing on program execution time. Information Processing Letters, 16:221--229, june 1983.Google ScholarCross Ref
- }}Salim Hariri, S., Lizhi Xue, Huoping Chen, Ming Zhang, Pavuluri, S., Soujanya Rao; "AUTONOMIA: an autonomic computing environment"; 2003. Conference Proceedings of the 2003, IEEE IPCCCGoogle Scholar
- }}Jenerette, G. D., R. L. Scott, G. A. Barron-Gafford, and T. E. Huxman. 2009. Gross primary production variability associated with meteorology, physiology, leaf area, and water supply in contrasting woodland and grassland semiarid riparian ecosystems. Journal of Geophysical Research - Biogeosciences 114, G04010: doi:10.1029/2009JG001074.Google Scholar
- }}J. Eker, J. W. Janneck, E. A. Lee, J. Liu, X. Liu, J. Ludvig, S. Neuendorffer, S. Sachs, and Y. Xiong. Taming Heterogeneity - The Ptolemy Approach. In Proceedings of the IEEE, volume 91(1), January 2003.Google Scholar
- }}http://ptolemy.eecs.berkeley.edu/publications/papers/03/overview/overview03.pdfGoogle Scholar
- }}https://code.kepler-project.org/code/kepler-docs/trunk/outreach/documentation/shipping/UserManual.pdfGoogle Scholar
- }}Plankensteiner, K., Prodan, R., Fahringer, T., Kertesz, A., Kacsuk,.: Fault-tolerant behavior in state-of-the-art Grid Workflow Management Systems. TR-0091, Core-GRID, 2007.Google Scholar
- }}S. Hwang and C. Kesselman, "Grid Workflow: A Flexible Failure Handling Framework for the Grid", in 12th IEEE International Symposium on High Performance Distributed Computing (HPDC'03), Seattle, Washington, USA, IEEE CS, Los Alamitos, CA, USA, June 22Y24, 2003. Google ScholarDigital Library
Index Terms
- Design and evaluation of a self-healing Kepler for scientific workflows
Recommendations
Approaches to Distributed Execution of Scientific Workflows in Kepler
Scalable Workflow Enactment Engines and TechnologyThe Kepler scientific workflow system enables creation, execution and sharing of workflows across a broad range of scientific and engineering disciplines while also facilitating remote and distributed execution of workflows. In this paper, we present ...
Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems
WORKS '09: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale ScienceMapReduce provides a parallel and scalable programming model for data-intensive business and scientific applications. MapReduce and its de facto open source project, called Hadoop, support parallel processing on large datasets with capabilities ...
Securing Scientific Workflows
QRS-C '15: Proceedings of the 2015 IEEE International Conference on Software Quality, Reliability and Security - CompanionThis paper investigates security of Kepler scientific workflow engine. We are especially interested in Kepler-based scientific workflows that may operate in cloud environments. We find that (1) three security properties (i.e., input validation, remote ...
Comments