ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing

Xu, Chen; Yang, Yi; Pan, Qingfeng; Zhou, Hongfu

doi:10.1007/978-3-031-25158-0_5

Chen Xu^13,14,
Yi Yang^13,14,
Qingfeng Pan^13,14 &
…
Hongfu Zhou¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13421))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

823 Accesses

Abstract

Iterative computation in distributed graph processing systems typically incurs a long runtime. Hence, it is crucial for graph processing to tolerate and quick recover from intermittent failures. Existing solutions can be categorized into checkpoint-based and checkpoint-free solution. The former writes checkpoints periodically during execution, which leads to significant overhead. Differently, the latter requires no checkpoint. Once failure happens, it reloads input data and resets the value of lost vertices directly. However, reloading input data involves repartitioning, which incurs additional overhead. Moreover, we observe that checkpoint-free solution cannot effectively handle failures for graph algorithms with topological mutations. To address these issues, we propose ACF2 with a partition-aware backup strategy and an incremental protocol. In particular, the partition-aware backup strategy backs up the sub-graphs of all nodes after initial partitioning. Once failure happens, the partition-aware backup strategy recovers the lost sub-graphs from the backups, and then resumes computation like checkpoint-free solution. To effectively handle failures involving topological mutations, the incremental protocol logs topological mutations during normal execution which would be exploited for recovery. We implement ACF2 based on Apache Giraph and our experiments show that ACF2 significantly outperforms existing solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Ammar, K., et al.: Experimental analysis of distributed graph systems. Proc. VLDB Endow. 11(10), 1151–1164 (2018)
Article Google Scholar
Dathathri, R., et al.: Phoenix: a substrate for resilient distributed graph analytics. In: ASPLOS, pp. 615–630 (2019)
Google Scholar
Gonzalez, J.E., et al.: Powergraph: distributed graph-parallel computation on natural graphs. In: OSDI, pp. 17–30 (2012)
Google Scholar
Gonzalez, J.E., et al.: Graphx: graph processing in a distributed dataflow framework. In: OSDI, pp. 599–613 (2014)
Google Scholar
Kalavri, V., et al.: High-level programming abstractions for distributed graph processing. IEEE Trans. Knowl. Data Eng. 30(2), 305–324 (2018)
Article Google Scholar
Li, B., et al.: : A trusted parallel route planning model on dynamic road networks. TITS (2022)
Google Scholar
Low, Y., et al.: Distributed graphLab: a framework for machine learning in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012)
Article Google Scholar
Lu, Y., et al.: Large-scale distributed graph computing systems: an experimental evaluation. Proc. VLDB Endow. 8(3), 281–292 (2014)
Article Google Scholar
Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010)
Google Scholar
McCune, R.R., Weninger, T., Madey, G.: Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv. 48(2), 1–39 (2015)
Google Scholar
Pundir, M., et al.: Zorro: zero-cost reactive failure recovery in distributed graph processing. In: SoCC, pp. 195–208 (2015)
Google Scholar
Schelter, S., et al.: “All roads lead to rome”: optimistic recovery for distributed iterative data processing. In: CIKM, pp. 1919–1928 (2013)
Google Scholar
Shen, Y., et al.: Fast failure recovery in distributed graph processing systems. Proc. VLDB Endow. 8(4), 437–448 (2014)
Article Google Scholar
Vora, K., et al.: Coral: confined recovery in distributed asynchronous graph processing. In: ASPLOS, pp. 223–236 (2017)
Google Scholar
Wang, P., et al.: Replication-based fault-tolerance for large-scale graph processing. In: DSN, pp. 562–573 (2014)
Google Scholar
Xu, C., et al.: Efficient fault-tolerance for iterative graph processing on distributed dataflow systems. In: ICDE, pp. 613–624 (2016)
Google Scholar
Yan, D., et al.: Lightweight fault tolerance in Pregel-like systems. In: ICPP, pp. 1–10 (2019)
Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)
Google Scholar

Download references

Acknowledgments

This work has been supported by the National Natural Science Foundation of China (No. 61902128).

Author information

Authors and Affiliations

East China Normal University, Shanghai, China
Chen Xu, Yi Yang & Qingfeng Pan
Shanghai Engineering Research Center of Big Data Management, Shanghai, China
Chen Xu, Yi Yang & Qingfeng Pan
Shanghai Ruanzhong Information Technology Company Limited, Shanghai, China
Hongfu Zhou

Authors

Chen Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qingfeng Pan
View author publications
You can also search for this author in PubMed Google Scholar
Hongfu Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Xu .

Editor information

Editors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, China
Bohan Li
Newcastle University, Callaghan, NSW, Australia
Lin Yue
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Chuanqi Tao
Jinan University, Guangzhou, China
Xuming Han
Free University of Bozen-Bolzano, Bolzano, Italy
Diego Calvanese
University of Tsukuba, Tsukuba, Japan
Toshiyuki Amagasa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, C., Yang, Y., Pan, Q., Zhou, H. (2023). ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing. In: Li, B., Yue, L., Tao, C., Han, X., Calvanese, D., Amagasa, T. (eds) Web and Big Data. APWeb-WAIM 2022. Lecture Notes in Computer Science, vol 13421. Springer, Cham. https://doi.org/10.1007/978-3-031-25158-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-25158-0_5
Published: 10 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25157-3
Online ISBN: 978-3-031-25158-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing