Abstract
With the development of cloud-edge collaborative computing technology, more and more cloud applications are transferred to edge devices. Some cloud applications in relatively unstable edge scenarios put forward higher requirements for fault tolerance. Therefore, we design and implement a flexible supervision system. The system provides a higher frequency of fault detection than existing cloud management platforms like Kubernetes. And It implements a more efficient checkpoint-restart fault handling scheme based on the distributed in-memory database. Meanwhile, we also consider minimizing the extra time costs caused by the fault-tolerance operations and saving cloud system resources including computing, storage, and network.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Neto, J.P.A., Pianto, D.M., Ralha, C.G.: MULTS: a multi-cloud fault-tolerant architecture to manage transient servers in cloud computing. J. Syst. Archit. 101, 101651 (2019)
Nakamura, J., Kim, Y., Katayama, Y., Masuzawa, M.: A cooperative partial snapshot algorithm for checkpoint-rollback recovery of large-scale and dynamic distributed systems and experimental evaluations. Concurr. Comput. Pract. Exp. 33(12) (2021)
Tang, X., Zhai, J., Yu, B., Chen, W., Zheng, W., Li, K.: An efficient in-memory checkpoint method and its practice on fault-tolerant HPL. IEEE Trans. Parallel Distrib. Syst. 29(4), 758–771 (2018)
Zhao, J., Xiang, Y., Lan, T., Huang, H.H., Subramaniam, S.: Elastic reliability optimization through peer-to-peer checkpointing in cloud computing. IEEE Trans. Parallel Distrib. Syst. 30(4), 897–909 (2019)
Sinha, B., Singh, A.K., Saini, P.: A hybrid approach towards reduced checkpointing overhead in cloud-based applications. Peer-to-Peer Networking Appl. 15(1), 473–483 (2021). https://doi.org/10.1007/s12083-021-01230-2
Acknowledgments
The work described in this paper was supported in part by the Key Basic Research Program of the China Basic Strengthening Program (2019-JCJQ-ZD-041).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Cai, W., Chen, H., Zhuo, Z., Wang, Z., An, N. (2022). Flexible Supervision System: A Fast Fault-Tolerance Strategy for Cloud Applications in Cloud-Edge Collaborative Environments. In: Liu, S., Wei, X. (eds) Network and Parallel Computing. NPC 2022. Lecture Notes in Computer Science, vol 13615. Springer, Cham. https://doi.org/10.1007/978-3-031-21395-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-21395-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21394-6
Online ISBN: 978-3-031-21395-3
eBook Packages: Computer ScienceComputer Science (R0)