Failure-aware workflow scheduling in cluster environments

Yu, Zhifeng; Wang, Chenjia; Shi, Weisong

doi:10.1007/s10586-010-0126-7

Failure-aware workflow scheduling in cluster environments

Published: 13 March 2010

Volume 13, pages 421–434, (2010)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Zhifeng Yu¹,
Chenjia Wang¹ &
Weisong Shi¹

190 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

The goal of workflow application scheduling is to achieve minimal makespan for each workflow. Scheduling workflow applications in high performance cluster environments is an NP-Complete problem, and becomes more complicated when potential resource failures are considered. While more research on failure prediction has been witnessed in recent years to improve system availability and reliability, very few of them attack the problem in the context of workflow application scheduling. In this paper, we study how a workflow scheduler benefits from failure prediction and propose FLAW, a failure-aware workflow scheduling algorithm. We propose two important definitions on accuracy, Application Oblivious Accuracy (AOA) and Application Aware Accuracy (AAA), from the perspectives of system and scheduling respectively, as we observe that the prediction accuracy defined conventionally imposes different performance implications on different applications and fails to measure how that improves scheduling effectiveness. The comprehensive evaluation results using real failure traces show that FLAW performs well with practically achievable prediction accuracy by reducing the average makespan, the loss time and the number of job rescheduling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Garey, M., Johnson, D.: Computers and Intractibility: A Guide to the Theory of NP-completeness. Freeman, San Francisco (1979)
MATH Google Scholar
Open science grid. [Online]. Available: http://www.opensciencegrid.org/
Nsf taragrid. [Online]. Available: http://www.teragrid.org/
Yang, L., Schopf, J., Foster, I.: Anomaly detection and diagnosis in grid environments. In: SC’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, Washington (2007)
Google Scholar
Liang, Y., Sivasubramaniam, A., Moreira, J.: Filtering failure logs for a bluegene/l prototype. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN’05), pp. 476–485. IEEE Computer Society, Washington (2005)
Chapter Google Scholar
Fu, S., Xu, C.: Exploring event correlation for failure prediction in coalitions of clusters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’07) (2007)
Oppenheimer, D., et al.: Service placement in shared wide-area platforms. In: Proceedings of the Twentieth ACM Symposium on Operating Systems Principles (SOSP’05), p. 1. ACM, New York (2005)
Google Scholar
Zhang, Y., et al.: Performance implications of failures in large-scale cluster scheduling. In: Proceedings of 10th International WorkshopJob Scheduling Strategies for Parallel Processing (JSSPP’04), pp. 233–252 (2004)
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN’06), pp. 249–258. IEEE Computer Society, Washington (2006)
Chapter Google Scholar
Yalagandula, P., et al.: Beyond availability: Towards a deeper understanding of machine failure characteristics in large distributed systems. In: Proceedings of the Workshop on Real, Large Distributed Systems (WORLDS’04) (2004)
Ren, X., et al.: Prediction of resource availability in fine-grained cycle sharing systems empirical evaluation. J. Grid Comput. 5(2), 173–195 (2007)
Article Google Scholar
Salfner, F., Schieschke, M., Malek, M.: Predicting failures of computer systems: a case study for a telecommunication system. In: Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006) (2006)
Li, Y., Lan, Z.: Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID’06), pp. 531–538. IEEE Computer Society, Washington (2006)
Google Scholar
Li, Y., et al.: Fault-driven re-scheduling for improving system-level fault resilience. In: Proceedings of the 2007 International Conference on Parallel Processing (ICPP’07), p. 39. IEEE Computer Society, Washington (2007)
Chapter Google Scholar
Oliner, A., et al.: Fault-aware job scheduling for bluegene/l systems. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04). IEEE Computer Society, Washington (2004)
Google Scholar
Hwang, S., Kesselman, C.: Gridworkflow: A flexible failure handling framework for the grid. In: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC’03), p. 126. IEEE Computer Society, Washington (2003)
Chapter Google Scholar
Abawajy, J.H.: Fault-tolerant scheduling policy for grid computing systems. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04). IEEE Computer Society, Washington (2004)
Google Scholar
Dagman. [Online]. Available: http://www.cs.wisc.edu/condor/dagman/
Dogan, A., Özgüner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 308–323 (2002)
Article Google Scholar
Deelman, E., Blythe, J., Gil, Y., Kesselman, C.: Workflow management in griphyn. In: Grid Resource Management: State of the Art and Future Trends, pp. 99–116. Kluwer Academic, Norwell (2004)
Google Scholar
Planet lab. [Online]. Available: http://www.planet-lab.org
Yu, Z., Shi, W.: A planner-guided scheduling strategy for multiple grid workflow applications. In: Proceeding of Fourth International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS ’08), Portland, Oregon, USA, September 2008
Los Alamos National Laboratory. Operational data to support and enable computer science research (2006). [Online]. Available: http://institutes.lanl.gov/data/fdata/
Salfner, F., Malek, M.: Proactive fault handling for system availability enhancement. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05)—Workshop 16, p. 281.1. IEEE Computer Society, Washington (2005)
Google Scholar
Yu, Z., Shi, W.: An adaptive rescheduling strategy for grid workflow applications. In: Proceeding of 21st International Parallel and Distributed Processing Symposium (IPDPS’07), Long Beach, Florida, USA, March 2007
Topcuouglu, H., Hariri, S., Wu, M.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)
Article Google Scholar
Oliner, A., Sahoo, R., Moreira, J., Gupta, M.: Performance implications of periodic checkpointing on large-scale cluster systems. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05), p. 299.2. IEEE Computer Society, Washington (2005)
Google Scholar
Schroeder, B., Gibson, G.: Understanding failures in petascale computers. J. Phys., Condens. Matter 19(45) (2007)
Hönig, U., Schiffmann, W.: A comprehensive test bench for the evaluation of scheduling heuristics. In: Proceedings of the 16th International Conference on Parallel and Distributed Computing and Systems (PDCS’04). IEEE, New York (2004)
Google Scholar
Canon, L.-C., Jeannot, E., Sakellariou, R., Zheng, W.: Comparative evaluation of the robustness of dag scheduling heuristics. In: Integration Research in Grid Computing, CoreGRID Integration Workshop, pp. 63–74. Crete University Press, Heraklion (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Wayne State University, Detroit, USA
Zhifeng Yu, Chenjia Wang & Weisong Shi

Authors

Zhifeng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Chenjia Wang
View author publications
You can also search for this author in PubMed Google Scholar
Weisong Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weisong Shi.

Additional information

This work is in part supported by National Science Foundation CAREER grant CCF-0643521.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, Z., Wang, C. & Shi, W. Failure-aware workflow scheduling in cluster environments. Cluster Comput 13, 421–434 (2010). https://doi.org/10.1007/s10586-010-0126-7

Download citation

Received: 25 June 2008
Accepted: 24 February 2010
Published: 13 March 2010
Issue Date: December 2010
DOI: https://doi.org/10.1007/s10586-010-0126-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Failure-aware workflow scheduling in cluster environments

Abstract

Access this article

Similar content being viewed by others

An Eye on the Elephant in the Wild: A Performance Evaluation of Hadoop’s Schedulers Under Failures

Resiliency Variance in Workflows with Choice

Reliability-Aware Workflow Scheduling Using Monte Carlo Failure Estimation in Cloud

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Failure-aware workflow scheduling in cluster environments

Abstract

Access this article

Similar content being viewed by others

An Eye on the Elephant in the Wild: A Performance Evaluation of Hadoop’s Schedulers Under Failures

Resiliency Variance in Workflows with Choice

Reliability-Aware Workflow Scheduling Using Monte Carlo Failure Estimation in Cloud

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation