Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

Sun, Dawei; Chang, Guiran; Miao, Changsheng; Wang, Xingwei

doi:10.1007/s11227-013-0898-7

Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

Published: 21 March 2013

Volume 66, pages 193–228, (2013)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Dawei Sun^1,2,
Guiran Chang³,
Changsheng Miao¹ &
…
Xingwei Wang¹

1303 Accesses
44 Citations
Explore all metrics

Abstract

Failures are normal rather than exceptional in cloud computing environments, high fault tolerance issue is one of the major obstacles for opening up a new era of high serviceability cloud computing as fault tolerance plays a key role in ensuring cloud serviceability. Fault tolerant service is an essential part of Service Level Objectives (SLOs) in clouds. To achieve high level of cloud serviceability and to meet high level of cloud SLOs, a foolproof fault tolerance strategy is needed. In this paper, the definitions of fault, error, and failure in a cloud are given, and the principles for high fault tolerance objectives are systematically analyzed by referring to the fault tolerance theories suitable for large-scale distributed computing environments. Based on the principles and semantics of cloud fault tolerance, a dynamic adaptive fault tolerance strategy DAFT is put forward. It includes: (i) analyzing the mathematical relationship between different failure rates and two different fault tolerance strategies, which are checkpointing fault tolerance strategy and data replication fault tolerance strategy; (ii) building a dynamic adaptive checkpointing fault tolerance model and a dynamic adaptive replication fault tolerance model by combining the two fault tolerance models together to maximize the serviceability and meet the SLOs; and (iii) evaluating the dynamic adaptive fault tolerance strategy under various conditions in large-scale cloud data centers and consider different system centric parameters, such as fault tolerance degree, fault tolerance overhead, response time, etc. Theoretical as well as experimental results conclusively demonstrate that the dynamic adaptive fault tolerance strategy DAFT has high potential as it provides efficient fault tolerance enhancements, significant cloud serviceability improvement, and great SLOs satisfaction. It efficiently and effectively achieves a trade-off for fault tolerance objectives in cloud computing environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Survey on Fault-Tolerance-Aware Scheduling in Cloud Computing

Effect of Fault Tolerance in the Field of Cloud Computing

Preventing Faults: Fault Monitoring and Proactive Fault Tolerance in Cloud Computing

References

Marston S, Li Z, Bandyopadhyay S, Zhang J, Ghalsasi A (2011) Cloud computing—the business perspective. Decis Support Syst 51(1):176–189
Article Google Scholar
Buyya R, Chee Shin Y, Venugopal S, Broberg J, Brandic I (2009) Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Future Gener Comput Syst 25(6):599–616
Article Google Scholar
Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M (2010) A view of cloud computing. Commun ACM 53(4):50–58
Article Google Scholar
Mell P, Grance T (2010) The NIST definition of cloud computing. Commun ACM 53(6):50
Google Scholar
Iosup A, Ostermann S, Yigitbasi MN, Prodan R, Fahringer T, Epema DHJ (2011) Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Trans Parallel Distrib Syst 22(6):931–945
Article Google Scholar
Xu BM, Zhao CY, Hu EZ, Hu B (2011) Job scheduling algorithm based on Berger model in cloud environment. Adv Eng Softw 42(7):419–425
Article Google Scholar
Zhu Q, Agrawal G (2010) Supporting fault-tolerance for time-critical events in distributed environments. Sci Program 18(1):51–76
Google Scholar
Zhang Y, Zheng Z, Lyu MR (2011) BFTCloud: a byzantine fault tolerance framework for voluntary-resource cloud computing. In: Proc 2011 IEEE 4th international conference on cloud computing (CLOUD 2011), Jul. 2011. IEEE Press, New York, pp 444–451
Chapter Google Scholar
Okorafor E (2011) A fault-tolerant high performance cloud strategy for scientific computing. In: Proc 2011 IEEE international symposium on parallel & distributed processing, workshops and Phd forum, May 2011. IEEE Press, New York, pp 1525–1532
Chapter Google Scholar
Zheng Z, Zhou TC, Lyu MR, King I (2010) FTCloud: a component ranking framework for fault-tolerant cloud applications. In: Proc 2010 IEEE 21st international symposium on software reliability engineering (ISSRE 2010), Nov. 2010. IEEE Press, New York, pp 398–407
Chapter Google Scholar
Li Y, Lan Z (2011) FREM: a fast restart mechanism for general checkpoint/restart. IEEE Trans Comput 60(5):639–652
Article MathSciNet Google Scholar
Brogan J (2010) Expand your pareto principle 80–20 metrics can evaluate viability of numerous endeavors. Ind Eng 42(11):45–49
Google Scholar
Luo Y, Manivannan D (2011) Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families. Perform Eval 68(5):429–445
Article Google Scholar
Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. Oper Syst Rev 37(5):29–43
Article Google Scholar
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proc 2010 IEEE 26th symposium on mass storage systems and technologies (MSST 2010), May 2010. IEEE Press, New York, pp 1–10
Chapter Google Scholar
Qureshi K, Khan FG, Manuel P, Nazir B (2011) A hybrid fault tolerance technique in grid computing system. J Supercomput 56(1):106–128
Article Google Scholar
Chtepen M, Claeys FHA, Dhoedt B, De Turck F, Demeester P, Vanrolleghem PA (2009) Adaptive task checkpointing and replication: toward efficient fault-tolerant grids. IEEE Trans Parallel Distrib Syst 20(2):180–190
Article Google Scholar
Wang C, Mueller F, Engelmann C, Scott SL (2010) Hybrid checkpointing for MPI jobs in HPC environments. In: Proc 16th international conference on parallel and distributed systems (ICPADS 2010), Dec. 2010. IEEE Press, New York, pp 524–533
Chapter Google Scholar
Wang SS, Yan KQ, Wang SC (2011) Achieving efficient agreement within a dual-failure cloud-computing environment. Expert Syst Appl 38(1):906–915
Article Google Scholar
Chen CH, Ting Y, Heh JS (2010) Low overhead incremental checkpointing and rollback recovery scheme on Windows operating system. In: Proc 2010 3rd international conference on knowledge discovery and data mining (WKDD 2010), Jan. 2010. IEEE Press, New York, pp 268–271
Chapter Google Scholar
Naksinehaboon N, Paun M, Nassar R, Leangsuksun B, Scott S (2009) High performance computing systems with various checkpointing schemes. Int J Comput Commun Control 4(4):386–400
Google Scholar
Lotfi M, Motamedi SA (2010) Adaptive two-level blocking coordinated checkpointing for high performance cluster computing systems. J Inf Sci Eng 26(3):951–966
Google Scholar
Garg R, Garg VK, Sabharwal Y (2010) Efficient algorithms for global snapshots in large distributed systems. IEEE Trans Parallel Distrib Syst 21(5):620–630
Article Google Scholar
Menderico RM, Garcia IC (2010) Diskless checkpointing with rollback-dependency trackability. In: Proc 2010 29th IEEE international symposium on reliable distributed systems (SRDS 2010), Nov. 2010. IEEE Press, New York, pp 275–281
Chapter Google Scholar
Chiu GM, Chiu JF (2011) A new diskless checkpointing approach for multiple processor failures. IEEE Trans Dependable Secure Comput 8(4):481–493
Article Google Scholar
Ling Y, Mi J, Lin X (2001) A variational calculus approach to optimal checkpoint placement. IEEE Trans Comput 50(7):699–708
Article Google Scholar
Lei M, Vrbsky SV, Hong X (2008) An on-line replication strategy to increase availability in data grids. Future Gener Comput Syst 24(2):85–98
Article MATH Google Scholar
Chang RS, Chang HP (2008) A dynamic data replication strategy using access-weights in data grids. J Supercomput 45(3):277–295
Article Google Scholar
Yuan D, Yang Y, Liu X, Chen J (2010) A data placement strategy in scientific cloud workflows. Future Gener Comput Syst 26(8):1200–1214
Article Google Scholar
Ray I, Ray I, Chakraborty S (2009) An interoperable context sensitive model of trust. J Intell Inf Syst 32(1):75–104
Article Google Scholar
Tu M, Li P, Yen IL, Thuraisingham BM, Khan L (2010) Secure data objects replication in data grid. IEEE Trans Dependable Secure Comput 7(1):50–64
Article Google Scholar
Wang JY, Jea KF (2009) A near-optimal database allocation for reducing the average waiting time in the grid computing environment. Inf Sci 179(21):3772–3790
Article MathSciNet MATH Google Scholar
Jung D, Chin SH, Chung KS, Suh T, Yu HC, Gil JM (2010) An effective job replication technique based on reliability and performance in mobile grids. In: Proc the 5th international conference advances in grid and pervasive computing (GPC 2010), May 2010. Springer, Berlin, pp 47–58
Chapter Google Scholar
Kim YH, Jung MJ, Lee CH (2010) Energy-aware real-time task scheduling exploiting temporal locality. IEICE Trans Inf Syst 93-D:1147–1153
Article Google Scholar
Liu H, Jin H, Liao X, Yu C, Xu CZ (2011) Live virtual machine migration via asynchronous replication and state synchronization. IEEE Trans Parallel Distrib Syst 22(12):1986–1999
Article Google Scholar
Khan FG, Qureshi K, Nazir B (2010) Performance evaluation of fault tolerance techniques in grid computing system. Comput Electr Eng 36(6):1110–1122
Article MATH Google Scholar
Marzouk S, Jmaiel M (2011) A survey on software checkpointing and mobility techniques in distributed systems. Concurr Comput 23(11):1196–1212
Article Google Scholar
Ma Z, Krings AW (2011) Dynamic hybrid fault modeling and extended evolutionary game theory for reliability, survivability and fault tolerance analyses. IEEE Trans Reliab 60(1):180–196
Article Google Scholar
Shi X, Pazat JL, Rodriguez E, Jin H, Jiang H (2010) Adapting grid applications to safety using fault-tolerant methods: design, implementation and evaluations. Future Gener Comput Syst 26(2):236–244
Article Google Scholar
Leu FY, Yang CT, Jiang FC (2010) Improving reliability of a heterogeneous grid-based intrusion detection platform using levels of redundancies. Future Gener Comput Syst 26(4):554–568
Article Google Scholar
Buyya R, Ranjan R, Calheiros RN (2009) Modeling and simulation of scalable cloud computing environments and the CloudSim toolkit: challenges and opportunities. In: 2009 international conference on high performance computing & simulation (HPCS), June 2009, pp 1–11
Chapter Google Scholar
Belalem G, Tayeb FZ, Zaoui W (2010) Approaches to improve the resources management in the simulator CloudSim. In: Proc. the first international conference information computing and applications (ICICA 2010), (Oct. 2010). Springer, Berlin, pp 189–196
Google Scholar
Calheiros RN, Ranjan R, Beloglazov A, De Rose CAF, Buyya R (2011) CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50
Article Google Scholar
Xu BM, Zhao CY, Hu EZ, Hu B (2011) Job scheduling algorithm based on Berger model in cloud environment. Adv Eng Softw 42(7):419–425
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Science Foundation for Distinguished Young Scholars of China under Grant No. 61225012; the National Natural Science Foundation of China under Grant No. 61070162, No. 71071028 and No. 70931001; the Specialized Research Fund of the Doctoral Program of Higher Education for the Priority Development Areas under Grant No. 20120042130003; the Specialized Research Fund for the Doctoral Program of Higher Education under Grant No. 20100042110025 and No. 20110042110024; the Specialized Development Fund for the Internet of Things from the ministry of industry and information technology of the P.R. China; the Fundamental Research Funds for the Central Universities under Grant No. N100604012 and No. N110204003. The authors gratefully thank Junling Hu for her help and comments.

Author information

Authors and Affiliations

School of Information Science and Engineering, Northeastern University, Shenyang, 110819, P.R. China
Dawei Sun, Changsheng Miao & Xingwei Wang
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, P.R. China
Dawei Sun
Computing Center, Northeastern University, Shenyang, 110819, P.R. China
Guiran Chang

Authors

Dawei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Guiran Chang
View author publications
You can also search for this author in PubMed Google Scholar
Changsheng Miao
View author publications
You can also search for this author in PubMed Google Scholar
Xingwei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dawei Sun.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, D., Chang, G., Miao, C. et al. Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments. J Supercomput 66, 193–228 (2013). https://doi.org/10.1007/s11227-013-0898-7

Download citation

Published: 21 March 2013
Issue Date: October 2013
DOI: https://doi.org/10.1007/s11227-013-0898-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

Abstract

Access this article

Similar content being viewed by others

Survey on Fault-Tolerance-Aware Scheduling in Cloud Computing

Effect of Fault Tolerance in the Field of Cloud Computing

Preventing Faults: Fault Monitoring and Proactive Fault Tolerance in Cloud Computing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

Abstract

Access this article

Similar content being viewed by others

Survey on Fault-Tolerance-Aware Scheduling in Cloud Computing

Effect of Fault Tolerance in the Field of Cloud Computing

Preventing Faults: Fault Monitoring and Proactive Fault Tolerance in Cloud Computing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation