Evaluation of a Failure Prediction Model for Large Scale Cloud Applications

Jassas, Mohammad S.; Mahmoud, Qusay H.

doi:10.1007/978-3-030-47358-7_32

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12109))

Included in the following conference series:

Canadian Conference on Artificial Intelligence

2297 Accesses
6 Citations

Abstract

Modern cloud-based applications, including smart homes and cities require high levels of reliability and availability. All cloud services, including hardware and software experience failures because of their large scale and heterogeneity nature. In this paper, the main objective is to develop a failure prediction model that can early detect failed jobs. The advantage of the proposed model is to enhance resource utilization and to increase the efficiency of cloud applications. The proposed model is evaluated based on three public available traces, which are the Google cluster, Mustang, and Trinity. Moreover, four different machine learning algorithms have been applied to the traces in order to select the best accurate model. Furthermore, we have improved the prediction accuracy using different feature selection techniques. The evaluation results show that the proposed model has achieved a high rate of precision, recall, and f1-score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Amvrosiadis, G., et al.: The Atlas cluster trace repository. ;login 43(4), 29–35 (2018)
Google Scholar
Chen, X., Lu, C.D., Pattabiraman, K.: Failure analysis of jobs in compute clouds: a Google cluster case study. In: 2014 IEEE 25th International Symposium on Software Reliability Engineering, pp. 167–177. IEEE (2014)
Google Scholar
El-Sayed, N., Zhu, H., Schroeder, B.: Learning from failure across multiple clusters: a trace-driven approach to understanding, predicting, and mitigating job terminations. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 1333–1344. IEEE (2017)
Google Scholar
Jassas, M., Mahmoud, Q.H.: Failure analysis and characterization of scheduling jobs in Google cluster trace. In: IECON 2018-44th Annual Conference of the IEEE Industrial Electronics Society, pp. 3102–3107. IEEE (2018)
Google Scholar
Jassas, M., Mahmoud, Q.H.: Failure characterization and prediction of scheduling jobs in Google cluster traces. In: 2019 10th IEEE-GCC Conference and Exhibition (GCCCE). IEEE (2019)
Google Scholar
Reiss, C., Wilkes, J., Hellerstein, J.L.: Google cluster-usage traces: format+ schema. Google Inc., White Paper, pp. 1–14 (2011)
Google Scholar
Ros, A., Chen, L.Y., Binder, W.: Predicting and mitigating jobs failures in big data clusters. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 221–230 (2015)
Google Scholar
Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)
Article Google Scholar
Sun, Y., Xu, L., Li, Y., Guo, L., Ma, Z., Wang, Y.: Utilizing deep architecture networks of VAE in software fault prediction. In: 2018 IEEE International Conference on Parallel & Distributed Processing with Applications, pp. 870–877. IEEE (2018)
Google Scholar

Download references

Acknowledgement

The first author would like to thank Umm Al-Qura University, Saudi Arabia for funding this work as part of his graduate scholarship.

Author information

Authors and Affiliations

Department of Electrical, Computer and Software Engineering, Ontario Tech University, Oshawa, ON, L1G 0C5, Canada
Mohammad S. Jassas & Qusay H. Mahmoud

Authors

Mohammad S. Jassas
View author publications
You can also search for this author in PubMed Google Scholar
Qusay H. Mahmoud
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad S. Jassas .

Editor information

Editors and Affiliations

National Research Council Canada, Ottawa, ON, Canada
Cyril Goutte
Queen’s University, Kingston, ON, Canada
Xiaodan Zhu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jassas, M.S., Mahmoud, Q.H. (2020). Evaluation of a Failure Prediction Model for Large Scale Cloud Applications. In: Goutte, C., Zhu, X. (eds) Advances in Artificial Intelligence. Canadian AI 2020. Lecture Notes in Computer Science(), vol 12109. Springer, Cham. https://doi.org/10.1007/978-3-030-47358-7_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-47358-7_32
Published: 06 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-47357-0
Online ISBN: 978-3-030-47358-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics