Abstract
Cloud computing provides infinite resources and a suitable environment for the execution of large scale computing applications. However, it is also susceptible to frequent failures which can affect users as well as service providers adversely. Therefore, fault tolerance techniques are necessary for the reliable execution of applications in the cloud. This work presents checkpointing based fault tolerance protocols for two types of distributed applications. The first kind of applications is the Bags of Tasks (BoT) applications where an application comprises of a set of independent tasks that do not communicate with each other during execution. Hence, an uncoordinated checkpointing algorithm is proposed for fault tolerance of BoT applications. Subsequently, we consider large scale distributed applications composed of multiple tasks dependent on each other due to inter-task message passing. An uncoordinated checkpointing and message logging protocol is presented for this type of applications. The proposed protocols utilize storage at edge switches in a data center to reduce the bandwidth consumption for saving checkpoints and message logs. Simulation results have demonstrated that the proposed protocols provide an increased rate of successful recoveries from failures and cause lower resource overhead than other contemporary and related schemes.












Similar content being viewed by others
References
Jaggi, P. K., & Singh, A. K. (2015). Movement-based checkpointing and message logging for recovery in MANETs. Wireless Personal Communications, 83(3), 1971–1993.
Kumari, P., & Kaur, P. (2018). A survey of fault tolerance in cloud computing. Journal of King Saud University-Computer and Information Sciences. https://doi.org/10.1016/j.jksuci.2018.09.021.
Zhou, A., Sun, Q., & Li, J. (2017). Enhancing reliability via checkpointing in cloud computing systems. China Communications, 14(7), 1–10.
Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., & Brandic, I. (2009). Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25(6), 599–616.
https://www.crn.com/slide-shows/cloud/the-10-biggest-cloud-outages-of-2018, Available online 2019.
Kumar, S., & Goudar, R. H. (2012). Cloud computing-research issues, challenges, architecture, platforms and applications: A survey. International Journal of Future Computer and Communication, 1(4), 356.
Patel, S., & Singh, A. S. (2013). Fault tolerance mechanisms and its implementation in cloud computing–a review. International Journal of Advanced Research in Computer Science and Software Engineering, 3(12), 573–576.
Zhao, J., Xiang, Y., Lan, T., Huang, H. H., & Subramaniam, S. (2016). Elastic reliability optimization through peer-to-peer checkpointing in cloud computing. IEEE Transactions on Parallel and Distributed Systems, 28(2), 491–502.
da Silva, F. A., & Senger, H. (2011). Scalability limits of Bag-of-Tasks applications running on hierarchical platforms. Journal of Parallel and Distributed Computing, 71(6), 788–801.
Sukhoroslov, O. (2018). Supporting efficient execution of many-task applications with Everest. In Proceedings of the VIII international conference “distributed computing and grid-technologies in science and education”(GRID 2018) (pp. 266–270).
Saikia, L. P., & Devi, Y. L. (2014). Fault tolerance techniques and algorithms in cloud computing. International Journal of Computer Science & Communication Networks, 4(1), 01–08.
Goiri, Í., Julia, F., Guitart, J., & Torres, J. (2010). Checkpoint-based fault-tolerant infrastructure for virtualized service providers. In 2010 IEEE network operations and management symposium-NOMS 2010 (pp. 455–462). IEEE.
El-Sayed, N., & Schroeder, B. (2016). Understanding practical tradeoffs in HPC checkpoint-scheduling policies. IEEE Transactions on Dependable and Secure Computing, 15(2), 336–350.
Han, H., Bao, W., Zhu, X., Feng, X., & Zhou, W. (2018). Fault-tolerant scheduling for hybrid real-time tasks based on CPB model in cloud. IEEE Access, 6, 18616–18629.
Han, L., Canon, L. C., Casanova, H., Robert, Y., & Vivien, F. (2018). Checkpointing workflows for fail-stop errors. IEEE Transactions on Computers, 67(8), 1105–1120.
Liu, D. (2015). A fault-tolerant architecture for ROIA in cloud. Journal of Ambient Intelligence and Humanized Computing, 6(5), 587–595.
Chinnathambi, S., Santhanam, A., Rajarathinam, J., & Senthilkumar, M. (2019). Scheduling and checkpointing optimization algorithm for Byzantine fault tolerance in cloud clusters. Cluster Computing, 22(6), 14637–14650.
Amoon, M., El-Bahnasawy, N., Sadi, S., & Wagdi, M. (2019). On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems. Journal of Ambient Intelligence and Humanized Computing, 10(11), 4567–4577.
Cheraghlou, M. N., Khademzadeh, A., & Haghparast, M. (2019). New fuzzy-based fault tolerance evaluation framework for cloud computing. Journal of Network and Systems Management, 27(4), 930–948.
Rezaeipanah, A., Mojarad, M., & Fakhari, A. (2020). Providing a new approach to increase fault tolerance in cloud computing using fuzzy logic. International Journal of Computers and Applications, 1–9. https://doi.org/10.1080/1206212X.2019.1709288.
Parwekar, P., Rodda, S., & Kaur, P. (2018). Mobile sink as checkpoints for fault detection towards fault tolerance in wireless sensor networks. Journal of Global Information Management (JGIM), 26(3), 78–89.
Mansouri, H., Badache, N., Aliouat, M., & Pathan, A. S. K. (2018). Checkpointing distributed application running on mobile ad hoc networks. International Journal of High Performance Computing and Networking, 11(2), 95–107.
Singh, A. K., & Jaggi, P. K. (2013). Asynchronous rollback recovery in cluster based multi hop mobile ad hoc networks. International Journal of Enhanced Research in Management & Computer Applications, ISSN, 2319–7471.
Kshemkalyani, A. D., & Singhal, M. (2011). Distributed computing: Principles, algorithms, and systems. Cambridge: Cambridge University Press.
Mansouri, H., & Pathan, A. S. K. (2019). Checkpointing distributed computing systems: An optimisation approach. International Journal of High Performance Computing and Networking, 15(3–4), 202–209.
Singh, A. K., & Kaur, P. (2011). Log based recovery with low overhead for mobile computing systems. In International conference on advances in communication, network, and computing (pp. 637–642). Springer, Berlin, Heidelberg.
Liu, J., Wang, S., Zhou, A., Kumar, S. A., Yang, F., & Buyya, R. (2016). Using proactive fault-tolerance approach to enhance cloud service reliability. IEEE Transactions on Cloud Computing, 6(4), 1191–1202.
Zhou, A., Wang, S., Cheng, B., Zheng, Z., Yang, F., Chang, R. N., et al. (2016). Cloud service reliability enhancement via virtual machine placement optimization. IEEE Transactions on Services Computing, 10(6), 902–913.
Kumari, P., & Kaur, P. (2020). Topology-aware virtual machine replication for fault tolerance in cloud computing systems. Multiagent and Grid Systems, 16(2), 193–206.
https://blogchinmaya.blogspot.com/2017/04/what-is-fat-tree-and-how-to-onstruct.html, Available online 2019.
https://www.cisco.com/en/US/docs/storage/san_switches/mds9000/hw/9124/quick/quide/9124QSG.html. Available online 2019.
https://www.dell.com/en-in/work/shop/povw/networking-n2000-series. Available online 2019.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kumari, P., Kaur, P. Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud. Wireless Pers Commun 117, 1853–1877 (2021). https://doi.org/10.1007/s11277-020-07949-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-020-07949-0