Skip to main content

Autonomous Management of Virtual Machine Failures in IaaS Using Fault Tree Analysis

  • Conference paper
  • First Online:
Book cover Economics of Grids, Clouds, Systems, and Services (GECON 2014)

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 8914))

Included in the following conference series:

Abstract

Cloud IaaS services bring the novelty of elastic delivery of computational resources in a virtualized form and resource management through easy replication of virtual nodes and live migration. In such dynamic and volatile environments where resources are virtualized, availability and reliability are mandatory for assuring an accepted quality of service for end users. In this context specific fault tolerance strategies are needed. Using concepts from fault tree analysis, we propose a distributed and autonomous approach where each virtualized node can assess and predict its own health state. In our setup each node can proactively take a decision about accepting future jobs, delegate jobs to own replicated instances or start a live migration process. We practically evaluate our model using real Xen log traces.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Atif, M., Strazdins, P.: Adaptive parallel application resource remapping through the live migration of virtual machines. Future Gener. Comput. Syst. 37, 148–161 (2014)

    Article  Google Scholar 

  2. Bobbio, A., Portinale, L., Minichino, M., Ciancamerla, E.: Improving the analysis of dependable systems by mapping fault trees into bayesian networks. Reliab. Eng. Syst. Saf. 71(3), 249–260 (2001)

    Article  Google Scholar 

  3. Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I., Warfield, A.: Live migration of virtual machines. In: Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation, NSDI’05, vol. 2, pp. 273–286. USENIX Association (2005)

    Google Scholar 

  4. Colesa, A., Mihai, B.: An adaptive virtual machine replication algorithm for highly-available services. In: 2011 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 941–948. IEEE (2011)

    Google Scholar 

  5. Cully, B., Lefebvre, G., Meyer, D., Feeley, M., Hutchinson, N., Warfield, A.: Remus: high availability via asynchronous virtual machine replication. In: Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, San Francisco, pp. 161–174 (2008)

    Google Scholar 

  6. Feller, E., Rilling, L., Morin, C.: Snooze: a scalable and autonomic virtual machine management framework for private clouds. In: 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 482–489. IEEE Press, May 2012

    Google Scholar 

  7. Gerofi, B., Vass, Z., Ishikawa, Y.: Utilizing memory content similarity for improving the performance of replicated virtual machines. In: 2011 Fourth IEEE International Conference on Utility and Cloud Computing (UCC), pp. 73–80. IEEE (2011)

    Google Scholar 

  8. Guerraoui, R., Yabandeh, M.: Independent faults in the cloud. In: Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware, LADIS ’10, pp. 12–17. ACM Press (2010)

    Google Scholar 

  9. Haimes, Y.: Risk Modeling, Assessment, and Management. Wiley, New York (2005)

    Google Scholar 

  10. Jhawar, R., Piuri, V.: Fault tolerance management in iaas clouds. In: IEEE First AESS European Conference on Satellite Telecommunications (ESTEL), pp. 1–6. IEEE Press (2012)

    Google Scholar 

  11. Jhawar, R., Piuri, V.: Fault tolerance and resilience in cloud computing environments. In: Vacca, J.R. (ed.) Cyber Security and IT Infrastructure Protection, pp. 1–28. Syngress (2014)

    Google Scholar 

  12. Jin, H., Deng, L., Wu, S., Shi, X., Chen, H., Pan, X.: Mecom: live migration of virtual machines by adaptively compressing memory pages. Future Gener. Comput. Syst. 38, 23–35 (2014)

    Article  Google Scholar 

  13. Kim, D.S., Machida, F., Trivedi, K.S.: Availability modeling and analysis of a virtualized system. In: Proceedings of the 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC ’09, pp. 365–371. IEEE Computer Society (2009)

    Google Scholar 

  14. Nagarajan, A.B., Mueller, F., Engelmann, C., Scott, S.L.: Proactive fault tolerance for hpc with xen virtualization. In: Proceedings of the 21st Annual International Conference on Supercomputing, ICS ’07, pp. 23–32. ACM Press (2007)

    Google Scholar 

  15. Nicolae, B., Cappello, F.: BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds. J. Parallel Distrib. Comput. 73(5), 698–711 (2013)

    Article  Google Scholar 

  16. Sampaio, A.M., Barbosa, J.G.: Towards high-available and energy-efficient virtual computing environments in the cloud. Future Gener. Comput. Syst. 40, 30–43 (2014)

    Article  Google Scholar 

  17. Travostino, F., Daspit, P., Gommans, L., Jog, C., De Laat, C., Mambretti, J., Monga, I., Van Oudenaarde, B., Raghunath, S., Yonghui Wang, P.: Seamless live migration of virtual machines over the man/wan. Future Gener. Comput. Syst. 22(8), 901–907 (2006)

    Article  Google Scholar 

  18. Undheim, A., Chilwan, A., Heegaard, P.: Differentiated availability in cloud computing slas. In: Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing, GRID ’11, pp. 129–136. IEEE Computer Society (2011)

    Google Scholar 

  19. Vallee, G., Engelmann, C., Tikotekar, A., Naughton, T., Charoenpornwattana, K., Leangsuksun, C., Scott, S.: A framework for proactive fault tolerance. In: Third International Conference on Availability, Reliability and Security, (ARES), pp. 659–664. IEEE Press (2008)

    Google Scholar 

  20. Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 193–204. ACM Press (2010)

    Google Scholar 

  21. Wang, S.S., Wang, S.C.: The consensus problem with dual failure nodes in a cloud computing environment. Inf. Sci. 279, 213–228 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

This work was co-financed from the European Social Fund through Sectoral Operational Programme Human Resources Development 2007-2013, project number POSDRU/159/1.5/S/134197 “Performance and excellence in doctoral and postdoctoral research in Romanian economics science domain”. G.C. Silaghi acknowledges support from UEFISCDI under project JustASR - PN-II-PT-PCCA-2013-4-1644.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandru Butoi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Butoi, A., Stan, A., Silaghi, G.C. (2014). Autonomous Management of Virtual Machine Failures in IaaS Using Fault Tree Analysis. In: Altmann, J., Vanmechelen, K., Rana, O. (eds) Economics of Grids, Clouds, Systems, and Services. GECON 2014. Lecture Notes in Computer Science(), vol 8914. Springer, Cham. https://doi.org/10.1007/978-3-319-14609-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-14609-6_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-14608-9

  • Online ISBN: 978-3-319-14609-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics