Autonomous Management of Virtual Machine Failures in IaaS Using Fault Tree Analysis

Butoi, Alexandru; Stan, Alexandru; Silaghi, Gheorghe Cosmin

doi:10.1007/978-3-319-14609-6_14

Alexandru Butoi¹⁶,
Alexandru Stan¹⁶ &
Gheorghe Cosmin Silaghi¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 8914))

Included in the following conference series:

International Conference on Grid Economics and Business Models

722 Accesses
1 Citations
3 Altmetric

Abstract

Cloud IaaS services bring the novelty of elastic delivery of computational resources in a virtualized form and resource management through easy replication of virtual nodes and live migration. In such dynamic and volatile environments where resources are virtualized, availability and reliability are mandatory for assuring an accepted quality of service for end users. In this context specific fault tolerance strategies are needed. Using concepts from fault tree analysis, we propose a distributed and autonomous approach where each virtualized node can assess and predict its own health state. In our setup each node can proactively take a decision about accepting future jobs, delegate jobs to own replicated instances or start a live migration process. We practically evaluate our model using real Xen log traces.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Atif, M., Strazdins, P.: Adaptive parallel application resource remapping through the live migration of virtual machines. Future Gener. Comput. Syst. 37, 148–161 (2014)
Article Google Scholar
Bobbio, A., Portinale, L., Minichino, M., Ciancamerla, E.: Improving the analysis of dependable systems by mapping fault trees into bayesian networks. Reliab. Eng. Syst. Saf. 71(3), 249–260 (2001)
Article Google Scholar
Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I., Warfield, A.: Live migration of virtual machines. In: Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation, NSDI’05, vol. 2, pp. 273–286. USENIX Association (2005)
Google Scholar
Colesa, A., Mihai, B.: An adaptive virtual machine replication algorithm for highly-available services. In: 2011 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 941–948. IEEE (2011)
Google Scholar
Cully, B., Lefebvre, G., Meyer, D., Feeley, M., Hutchinson, N., Warfield, A.: Remus: high availability via asynchronous virtual machine replication. In: Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, San Francisco, pp. 161–174 (2008)
Google Scholar
Feller, E., Rilling, L., Morin, C.: Snooze: a scalable and autonomic virtual machine management framework for private clouds. In: 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 482–489. IEEE Press, May 2012
Google Scholar
Gerofi, B., Vass, Z., Ishikawa, Y.: Utilizing memory content similarity for improving the performance of replicated virtual machines. In: 2011 Fourth IEEE International Conference on Utility and Cloud Computing (UCC), pp. 73–80. IEEE (2011)
Google Scholar
Guerraoui, R., Yabandeh, M.: Independent faults in the cloud. In: Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware, LADIS ’10, pp. 12–17. ACM Press (2010)
Google Scholar
Haimes, Y.: Risk Modeling, Assessment, and Management. Wiley, New York (2005)
Google Scholar
Jhawar, R., Piuri, V.: Fault tolerance management in iaas clouds. In: IEEE First AESS European Conference on Satellite Telecommunications (ESTEL), pp. 1–6. IEEE Press (2012)
Google Scholar
Jhawar, R., Piuri, V.: Fault tolerance and resilience in cloud computing environments. In: Vacca, J.R. (ed.) Cyber Security and IT Infrastructure Protection, pp. 1–28. Syngress (2014)
Google Scholar
Jin, H., Deng, L., Wu, S., Shi, X., Chen, H., Pan, X.: Mecom: live migration of virtual machines by adaptively compressing memory pages. Future Gener. Comput. Syst. 38, 23–35 (2014)
Article Google Scholar
Kim, D.S., Machida, F., Trivedi, K.S.: Availability modeling and analysis of a virtualized system. In: Proceedings of the 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC ’09, pp. 365–371. IEEE Computer Society (2009)
Google Scholar
Nagarajan, A.B., Mueller, F., Engelmann, C., Scott, S.L.: Proactive fault tolerance for hpc with xen virtualization. In: Proceedings of the 21st Annual International Conference on Supercomputing, ICS ’07, pp. 23–32. ACM Press (2007)
Google Scholar
Nicolae, B., Cappello, F.: BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds. J. Parallel Distrib. Comput. 73(5), 698–711 (2013)
Article Google Scholar
Sampaio, A.M., Barbosa, J.G.: Towards high-available and energy-efficient virtual computing environments in the cloud. Future Gener. Comput. Syst. 40, 30–43 (2014)
Article Google Scholar
Travostino, F., Daspit, P., Gommans, L., Jog, C., De Laat, C., Mambretti, J., Monga, I., Van Oudenaarde, B., Raghunath, S., Yonghui Wang, P.: Seamless live migration of virtual machines over the man/wan. Future Gener. Comput. Syst. 22(8), 901–907 (2006)
Article Google Scholar
Undheim, A., Chilwan, A., Heegaard, P.: Differentiated availability in cloud computing slas. In: Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing, GRID ’11, pp. 129–136. IEEE Computer Society (2011)
Google Scholar
Vallee, G., Engelmann, C., Tikotekar, A., Naughton, T., Charoenpornwattana, K., Leangsuksun, C., Scott, S.: A framework for proactive fault tolerance. In: Third International Conference on Availability, Reliability and Security, (ARES), pp. 659–664. IEEE Press (2008)
Google Scholar
Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 193–204. ACM Press (2010)
Google Scholar
Wang, S.S., Wang, S.C.: The consensus problem with dual failure nodes in a cloud computing environment. Inf. Sci. 279, 213–228 (2014)
Article Google Scholar

Download references

Acknowledgements

This work was co-financed from the European Social Fund through Sectoral Operational Programme Human Resources Development 2007-2013, project number POSDRU/159/1.5/S/134197 “Performance and excellence in doctoral and postdoctoral research in Romanian economics science domain”. G.C. Silaghi acknowledges support from UEFISCDI under project JustASR - PN-II-PT-PCCA-2013-4-1644.

Author information

Authors and Affiliations

Business Information Systems Department, Babeş-Bolyai University, Cluj-Napoca, Romania
Alexandru Butoi, Alexandru Stan & Gheorghe Cosmin Silaghi

Authors

Alexandru Butoi
View author publications
You can also search for this author in PubMed Google Scholar
Alexandru Stan
View author publications
You can also search for this author in PubMed Google Scholar
Gheorghe Cosmin Silaghi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexandru Butoi .

Editor information

Editors and Affiliations

Technology, Economics, and Policy Program, College of Engineering, Seoul National University, Gwanak-Gu, Seoul, Korea, Republic of (South Korea)
Jörn Altmann
Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
Kurt Vanmechelen
School of Computer Science, Cardiff University, Cardiff, United Kingdom
Omer F. Rana

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Butoi, A., Stan, A., Silaghi, G.C. (2014). Autonomous Management of Virtual Machine Failures in IaaS Using Fault Tree Analysis. In: Altmann, J., Vanmechelen, K., Rana, O. (eds) Economics of Grids, Clouds, Systems, and Services. GECON 2014. Lecture Notes in Computer Science(), vol 8914. Springer, Cham. https://doi.org/10.1007/978-3-319-14609-6_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-14609-6_14
Published: 24 December 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14608-9
Online ISBN: 978-3-319-14609-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics