Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

Kumari, Priti; Kaur, Parmeet

doi:10.1007/s11277-020-07949-0

Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

Published: 16 November 2020

Volume 117, pages 1853–1877, (2021)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

335 Accesses
5 Citations
Explore all metrics

Abstract

Cloud computing provides infinite resources and a suitable environment for the execution of large scale computing applications. However, it is also susceptible to frequent failures which can affect users as well as service providers adversely. Therefore, fault tolerance techniques are necessary for the reliable execution of applications in the cloud. This work presents checkpointing based fault tolerance protocols for two types of distributed applications. The first kind of applications is the Bags of Tasks (BoT) applications where an application comprises of a set of independent tasks that do not communicate with each other during execution. Hence, an uncoordinated checkpointing algorithm is proposed for fault tolerance of BoT applications. Subsequently, we consider large scale distributed applications composed of multiple tasks dependent on each other due to inter-task message passing. An uncoordinated checkpointing and message logging protocol is presented for this type of applications. The proposed protocols utilize storage at edge switches in a data center to reduce the bandwidth consumption for saving checkpoints and message logs. Simulation results have demonstrated that the proposed protocols provide an increased rate of successful recoveries from failures and cause lower resource overhead than other contemporary and related schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid approach towards reduced checkpointing overhead in cloud-based applications

Article 26 October 2021

A Resilient Hierarchical Checkpointing Algorithm for Distributed Systems Running on Cluster Federation

On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems

Article 15 November 2018

References

Jaggi, P. K., & Singh, A. K. (2015). Movement-based checkpointing and message logging for recovery in MANETs. Wireless Personal Communications, 83(3), 1971–1993.
Article Google Scholar
Kumari, P., & Kaur, P. (2018). A survey of fault tolerance in cloud computing. Journal of King Saud University-Computer and Information Sciences. https://doi.org/10.1016/j.jksuci.2018.09.021.
Article Google Scholar
Zhou, A., Sun, Q., & Li, J. (2017). Enhancing reliability via checkpointing in cloud computing systems. China Communications, 14(7), 1–10.
Article Google Scholar
Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., & Brandic, I. (2009). Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25(6), 599–616.
Article Google Scholar
https://www.crn.com/slide-shows/cloud/the-10-biggest-cloud-outages-of-2018, Available online 2019.
Kumar, S., & Goudar, R. H. (2012). Cloud computing-research issues, challenges, architecture, platforms and applications: A survey. International Journal of Future Computer and Communication, 1(4), 356.
Article Google Scholar
Patel, S., & Singh, A. S. (2013). Fault tolerance mechanisms and its implementation in cloud computing–a review. International Journal of Advanced Research in Computer Science and Software Engineering, 3(12), 573–576.
Google Scholar
Zhao, J., Xiang, Y., Lan, T., Huang, H. H., & Subramaniam, S. (2016). Elastic reliability optimization through peer-to-peer checkpointing in cloud computing. IEEE Transactions on Parallel and Distributed Systems, 28(2), 491–502.
Google Scholar
da Silva, F. A., & Senger, H. (2011). Scalability limits of Bag-of-Tasks applications running on hierarchical platforms. Journal of Parallel and Distributed Computing, 71(6), 788–801.
Article Google Scholar
Sukhoroslov, O. (2018). Supporting efficient execution of many-task applications with Everest. In Proceedings of the VIII international conference “distributed computing and grid-technologies in science and education”(GRID 2018) (pp. 266–270).
Saikia, L. P., & Devi, Y. L. (2014). Fault tolerance techniques and algorithms in cloud computing. International Journal of Computer Science & Communication Networks, 4(1), 01–08.
Google Scholar
Goiri, Í., Julia, F., Guitart, J., & Torres, J. (2010). Checkpoint-based fault-tolerant infrastructure for virtualized service providers. In 2010 IEEE network operations and management symposium-NOMS 2010 (pp. 455–462). IEEE.
El-Sayed, N., & Schroeder, B. (2016). Understanding practical tradeoffs in HPC checkpoint-scheduling policies. IEEE Transactions on Dependable and Secure Computing, 15(2), 336–350.
Article Google Scholar
Han, H., Bao, W., Zhu, X., Feng, X., & Zhou, W. (2018). Fault-tolerant scheduling for hybrid real-time tasks based on CPB model in cloud. IEEE Access, 6, 18616–18629.
Article Google Scholar
Han, L., Canon, L. C., Casanova, H., Robert, Y., & Vivien, F. (2018). Checkpointing workflows for fail-stop errors. IEEE Transactions on Computers, 67(8), 1105–1120.
MathSciNet MATH Google Scholar
Liu, D. (2015). A fault-tolerant architecture for ROIA in cloud. Journal of Ambient Intelligence and Humanized Computing, 6(5), 587–595.
Article Google Scholar
Chinnathambi, S., Santhanam, A., Rajarathinam, J., & Senthilkumar, M. (2019). Scheduling and checkpointing optimization algorithm for Byzantine fault tolerance in cloud clusters. Cluster Computing, 22(6), 14637–14650.
Article Google Scholar
Amoon, M., El-Bahnasawy, N., Sadi, S., & Wagdi, M. (2019). On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems. Journal of Ambient Intelligence and Humanized Computing, 10(11), 4567–4577.
Article Google Scholar
Cheraghlou, M. N., Khademzadeh, A., & Haghparast, M. (2019). New fuzzy-based fault tolerance evaluation framework for cloud computing. Journal of Network and Systems Management, 27(4), 930–948.
Article Google Scholar
Rezaeipanah, A., Mojarad, M., & Fakhari, A. (2020). Providing a new approach to increase fault tolerance in cloud computing using fuzzy logic. International Journal of Computers and Applications, 1–9. https://doi.org/10.1080/1206212X.2019.1709288.
Parwekar, P., Rodda, S., & Kaur, P. (2018). Mobile sink as checkpoints for fault detection towards fault tolerance in wireless sensor networks. Journal of Global Information Management (JGIM), 26(3), 78–89.
Article Google Scholar
Mansouri, H., Badache, N., Aliouat, M., & Pathan, A. S. K. (2018). Checkpointing distributed application running on mobile ad hoc networks. International Journal of High Performance Computing and Networking, 11(2), 95–107.
Article Google Scholar
Singh, A. K., & Jaggi, P. K. (2013). Asynchronous rollback recovery in cluster based multi hop mobile ad hoc networks. International Journal of Enhanced Research in Management & Computer Applications, ISSN, 2319–7471.
Kshemkalyani, A. D., & Singhal, M. (2011). Distributed computing: Principles, algorithms, and systems. Cambridge: Cambridge University Press.
MATH Google Scholar
Mansouri, H., & Pathan, A. S. K. (2019). Checkpointing distributed computing systems: An optimisation approach. International Journal of High Performance Computing and Networking, 15(3–4), 202–209.
Article Google Scholar
Singh, A. K., & Kaur, P. (2011). Log based recovery with low overhead for mobile computing systems. In International conference on advances in communication, network, and computing (pp. 637–642). Springer, Berlin, Heidelberg.
Liu, J., Wang, S., Zhou, A., Kumar, S. A., Yang, F., & Buyya, R. (2016). Using proactive fault-tolerance approach to enhance cloud service reliability. IEEE Transactions on Cloud Computing, 6(4), 1191–1202.
Article Google Scholar
Zhou, A., Wang, S., Cheng, B., Zheng, Z., Yang, F., Chang, R. N., et al. (2016). Cloud service reliability enhancement via virtual machine placement optimization. IEEE Transactions on Services Computing, 10(6), 902–913.
Article Google Scholar
Kumari, P., & Kaur, P. (2020). Topology-aware virtual machine replication for fault tolerance in cloud computing systems. Multiagent and Grid Systems, 16(2), 193–206.
Article Google Scholar
https://blogchinmaya.blogspot.com/2017/04/what-is-fat-tree-and-how-to-onstruct.html, Available online 2019.
https://www.cisco.com/en/US/docs/storage/san_switches/mds9000/hw/9124/quick/quide/9124QSG.html. Available online 2019.
https://www.dell.com/en-in/work/shop/povw/networking-n2000-series. Available online 2019.

Download references

Author information

Authors and Affiliations

Department of CSE/IT, Jaypee Institute of Information Technology, Noida, India
Priti Kumari & Parmeet Kaur

Authors

Priti Kumari
View author publications
You can also search for this author inPubMed Google Scholar
Parmeet Kaur
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Parmeet Kaur.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kumari, P., Kaur, P. Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud. Wireless Pers Commun 117, 1853–1877 (2021). https://doi.org/10.1007/s11277-020-07949-0

Download citation

Accepted: 05 November 2020
Published: 16 November 2020
Issue Date: April 2021
DOI: https://doi.org/10.1007/s11277-020-07949-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A hybrid approach towards reduced checkpointing overhead in cloud-based applications

A Resilient Hierarchical Checkpointing Algorithm for Distributed Systems Running on Cluster Federation

On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now